Industry Insights 13 min read

Why GLM‑Z1‑AirX Hits 150‑200 TPS: A Deep Dive into LLM Speed Benchmarking

The article examines the slowdown caused by long‑chain‑of‑thought LLMs, presents a Python benchmarking script, compares token‑per‑second performance of several models—including the ultra‑fast GLM‑Z1‑AirX—and demonstrates a real‑time anti‑fraud use case that benefits from sub‑second response times.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why GLM‑Z1‑AirX Hits 150‑200 TPS: A Deep Dive into LLM Speed Benchmarking

When deploying the R1 model, the author noticed that although answer quality was high, the inference speed was limited to about 45 tokens/s, which hurt completion‑rate metrics, especially for chain‑of‑thought (CoT) prompts that generate long outputs.

Typical mitigations such as UI tricks, folding the reasoning chain, model distillation, quantization, or DPO fine‑tuning either only mask the latency or degrade answer quality, leaving raw output speed as the core problem.

During this investigation, a colleague from Zhipu AI introduced the GLM‑Z1‑AirX model, claiming token‑per‑second (TPS) rates of 150‑200, far surpassing existing market models.

Benchmark Script

The provided Python script runs in Google Colab, measures generation speed by dividing generated token count by elapsed time (starting from the first streamed token to capture true TPS rather than TTFT), and supports multiple providers (OpenAI, Moonshot, Zhipu, Qwen, DeepSeek, Stepfun, Baichuan). The script defines three model groups (venti, grande, tall) and a testALL function that executes a prompt against each model with and without streaming.

# Configuration example (only the relevant keys are shown)
models_venti = {
    'GPT': {
        'brand': 'OpenAI',
        'model_version': 'gpt-4',
        'api_key': userdata.get('Key_OpenAI'),
        'base_url': 'https://api.openai.com/v1'
    },
    'Zhipu': {
        'brand': 'Zhipu',
        'model_version': 'glm-4',
        'api_key': userdata.get('Key_Zhipu'),
        'base_url': 'https://open.bigmodel.cn/api/paas/v4/'
    }
    # ... other providers omitted for brevity
}

Running the script yields a sorted list of models by generation speed.

Speed Results (Token‑per‑Second)

OpenAI gpt-3.5‑turbo : 83.42 TPS (total tokens 2250, 14.62 s)

Qwen‑turbo (Alibaba) : 43.99 TPS (total tokens 1440, 18.17 s)

GLM‑Z1‑AirX (Zhipu) : 149.13 TPS (total tokens 2017, 9.46 s)

The GLM‑Z1‑AirX model not only outperforms dense and MoE models in overall TPS but also leads in time‑to‑first‑token (TTFT), delivering a “double‑kill” in speed.

Practical Use Case: Real‑Time Anti‑Fraud Agent

In a telecom anti‑fraud scenario, an AI must analyze ASR transcripts of phone calls instantly to detect scams. Every second saved can prevent financial loss. The article provides a prompt template and a three‑step JSON output schema (risk level, risk types, triggers, suggestion) for automated decision‑making.

# Example risk‑assessment logic (Python‑like pseudocode)
if high_risk_categories >= 2:
    risk_level = "high"
elif high_risk_categories == 1 and vague_threat:
    risk_level = "medium"
else:
    risk_level = "low"

Given the input "Here is XX police, your account is suspected of money‑laundering…", the model returns a JSON indicating a high‑risk classification with specific keywords (e.g., "police", "safe account", "money‑laundering") and a recommendation to hang up and call emergency services.

Model Availability

Beyond the benchmark, Zhipu has opened the GLM‑Z1‑AirX family on the MaaS platform (bigmodel.cn) with three variants:

GLM‑Z1‑AirX (ultra‑fast, up to 200 TPS)

GLM‑Z1‑Air (high cost‑performance, 1/30 the price of DeepSeek‑R1)

GLM‑Z1‑Flash (free tier for low‑cost experimentation)

These models are accessible via API, enabling developers to integrate high‑speed LLM inference into latency‑sensitive applications such as real‑time fraud detection.

PerformancePythonLLManti-fraudbenchmarkGLM-Z1-AirXtoken per second
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.