Artificial Intelligence 15 min read

Which LLM Generates Tokens Fastest? A Real‑World Speed Benchmark Across Major Models

This article presents a practical Python benchmark that measures token‑per‑second generation speed of various large language models—including GPT‑4o, glm‑4‑airx, and moonshot‑v1‑32k—by timing text generation on a Colab environment and summarizing the results in detailed tables and visual charts.

Baobao Algorithm Notes

Jul 13, 2024

Which LLM Generates Tokens Fastest? A Real‑World Speed Benchmark Across Major Models

The author needed a high‑throughput LLM for a business scenario and sought a way to compare the token generation speed of several popular models while preserving output quality. A friend shared a Colab notebook that measures speed by dividing generated text length by the elapsed time, starting the timer from the first streamed token to avoid network latency effects.

Benchmark Script Overview

The script defines a configuration dictionary for multiple providers (OpenAI, Moonshot, Zhipu, Qwen, DeepSeek, Stepfun, Baichuan) with model version, API key, and base URL. It then implements two core functions:

LLM(messages, model, result_dict, use_streaming=False) – Calls the provider’s chat completion API, records start time, and captures either streaming or non‑streaming responses. It stores content snippets, token counts, total time, and calculates inference and generation speeds.

testALL(messages=[], prompts=[], models={}) – Executes the LLM function for each model using a ThreadPoolExecutor, optionally with streaming enabled, and prints sorted results by generation speed.

# Configuration
models_venti = {
    'GPT': {'brand': 'OpenAI', 'model_version': 'gpt-4', 'api_key': userdata.get('Key_OpenAI'), 'base_url': 'https://api.openai.com/v1'},
    'Moonshot': {'brand': '月之暗面', 'model_version': 'moonshot-v1-32k', 'api_key': userdata.get('Key_Moonshot'), 'base_url': 'https://api.moonshot.cn/v1'},
    'Zhipu': {'brand': '智谱', 'model_version': 'glm-4', 'api_key': userdata.get('Key_Zhipu'), 'base_url': 'https://open.bigmodel.cn/api/paas/v4/'},
    'Qwen': {'brand': '通义千问', 'model_version': 'qwen-max', 'api_key': userdata.get('Key_Qwen'), 'base_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1'},
    'DeepSeek': {'brand': '深度求索', 'model_version': 'deepseek-chat', 'api_key': userdata.get('Key_DeepSeek'), 'base_url': 'https://api.deepseek.com'},
    'Stepfun': {'brand': '阶跃星辰', 'model_version': 'step-1-8k', 'api_key': userdata.get('Key_Stepfun'), 'base_url': 'https://api.stepfun.com/v1'},
    'Baichuan': {'brand': '百川', 'model_version': 'Baichuan4', 'api_key': userdata.get('Key_Baichuan'), 'base_url': 'https://api.baichuan-ai.com/v1'}
}

# ... (similar definitions for models_grande and models_tall)

import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from openai import OpenAI
import queue, time, datetime, pytz

def LLM(messages, model, result_dict, use_streaming=False):
    client = OpenAI(api_key=model['api_key'], base_url=model['base_url'])
    api_params = {"model": model['model_version'], "messages": messages, "stream": use_streaming}
    start_time = time.time()
    try:
        response = client.chat.completions.create(**api_params)
        if use_streaming:
            for _ in response:
                pass
            infer_time = time.time() - start_time
            result = result_dict.get(model['model_version'], {'model': model})
            result.update({'content': None, 'duration': infer_time, 'use_streaming': use_streaming, 'infer_time': infer_time})
            result_dict[model['model_version']] = result
            return
        else:
            result_message = response.choices[0].message
            content = result_message.content or f"No output from {model['model_version']}"
            total_time = time.time() - start_time
            result = result_dict.get(model['model_version'], {'model': model})
            result.update({
                'content': content[:20] + '...',
                'duration': total_time,
                'prompt_tokens': response.usage.prompt_tokens,
                'completion_tokens': response.usage.completion_tokens,
                'total_tokens': response.usage.total_tokens,
                'use_streaming': use_streaming,
                'generation_time': total_time - result.get('infer_time', 0),
            })
            if 'infer_time' in result:
                result['infer_speed'] = result['prompt_tokens'] / result['infer_time'] if result['infer_time'] > 0 else 0
                result['generation_speed'] = result['completion_tokens'] / result['generation_time'] if result['generation_time'] > 0 else 0
            result_dict[model['model_version']] = result
            return
    except Exception as e:
        result_dict[model['model_version']] = {'model': model, 'content': f"Error: {e}", 'use_streaming': use_streaming}

def testALL(messages=[], prompts=[], models={}):
    tz = pytz.timezone('Asia/Shanghai')
    now = datetime.datetime.now(tz)
    print(f"
Test start time: {now.strftime('%Y-%m-%d %H:%M:%S')}
")
    result_dict = {}
    def process_messages(messages, use_streaming):
        futures = [executor.submit(LLM, messages, models[key], result_dict, use_streaming) for key in models]
        for future in as_completed(futures):
            future.result()
    with ThreadPoolExecutor() as executor:
        for use_streaming in [True, False]:
            if prompts:
                for prompt in prompts:
                    msgs = [{"role": "user", "content": prompt}]
                    process_messages(msgs, use_streaming)
            else:
                process_messages(messages, use_streaming)
    sorted_results = sorted(result_dict.values(), key=lambda x: x.get('generation_speed', 0), reverse=True)
    for result in sorted_results:
        print(f"From {result['model']['brand']} {result['model']['model_version']}:")
        print(f"Total tokens: {result.get('total_tokens')}, Time: {result.get('duration',0):.2f}s")
        print(f"Content: {result.get('content','No output')}")
        print(f"Prompt tokens: {result.get('prompt_tokens')}, Infer time: {result.get('infer_time',0):.2f}s")
        print(f"Completion tokens: {result.get('completion_tokens')}, Generation time: {result.get('generation_time',0):.2f}s, Speed: {result.get('generation_speed',0):.2f} token/s
")

Running the Benchmark

Execute the script in a Colab notebook with a list of prompts (e.g., a Chinese text translation task). The script prints a start timestamp, runs each model with and without streaming, and finally displays a sorted list of models by generation speed.

Results Overview

The benchmark measured three groups of models based on parameter size: small, medium, and large. For each model, the table shows context length, total time, token counts, and average generation speed (tokens per second).

Small‑Parameter Models

OpenAI gpt-3.5-turbo : 2250 context tokens, 14.62 s total, 83.42 token/s.

Zhipu glm-4‑flash : 1409 context tokens, 11.35 s total, 72.14 token/s.

Qwen qwen‑turbo : 1440 context tokens, 18.17 s total, 43.99 token/s.

Baichuan Baichuan3‑Turbo : 1485 context tokens, 22.00 s total, 36.36 token/s.

Medium‑Parameter Models

Zhipu glm‑4‑airx : 1415 context tokens, 9.67 s total, 92.95 token/s.

OpenAI gpt‑4o : 1690 context tokens, 12.68 s total, 75.14 token/s.

Qwen qwen‑plus : 1500 context tokens, 48.10 s total, 17.22 token/s.

Large‑Parameter Models

Moonshot moonshot‑v1‑32k : 1445 context tokens, 27.03 s total, 30.26 token/s.

OpenAI gpt‑4 : 2409 context tokens, 45.64 s total, 30.14 token/s.

Zhipu glm‑4 : 1415 context tokens, 25.98 s total, 30.13 token/s.

Stepfun step‑1‑8k : 1590 context tokens, 33.38 s total, 28.58 token/s.

Baichuan Baichuan4 : 1413 context tokens, 40.05 s total, 19.21 token/s.

DeepSeek deepseek‑chat : 1691 context tokens, 52.08 s total, 18.80 token/s.

Qwen qwen‑max : 1538 context tokens, 57.39 s total, 15.07 token/s.

Although the large‑parameter models generally achieve higher quality, the medium‑parameter glm‑4‑airx outperformed many larger models in raw token‑per‑second speed.

Qualitative Tests

Beyond raw speed, the author posed linguistic puzzles to compare answer quality. For example, the prompt "I often feel out of place because I'm not perverse enough" was answered correctly by glm‑4‑airx and gpt‑4 . Another test asked for the meanings of four consecutive "把" characters in a Chinese sentence; glm‑4‑airx provided a more accurate breakdown than gpt‑4 , demonstrating that speed does not necessarily sacrifice comprehension.

The speed advantage of glm‑4‑airx stems from architectural improvements that reduce both the prefilling and decoding phases of inference, especially the latency of the first token, which is critical for interactive applications.

All scripts and data were provided by the community member "赛博禅心".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Python AI LLM model comparison token speed

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.