How GLM‑5.1‑highspeed Achieves 7× Faster Inference to Become the World’s Fastest Flagship Model

On May 22, Zhipu launched the GLM‑5.1‑highspeed API, delivering 400 tokens per second—about 7× faster than the original model and twice as fast as Gemini 3.5 Flash—through a three‑layer optimization that rewrites the MoE inference path, introduces dynamic scheduling, and leverages TileRT’s AOT engine to cut latency while preserving full flagship capabilities.

SuanNi
SuanNi
SuanNi
How GLM‑5.1‑highspeed Achieves 7× Faster Inference to Become the World’s Fastest Flagship Model

Performance breakthrough

On 22 May the GLM‑5.1‑highspeed API (interface name GLM‑5.1‑highspeed ) achieved an output speed of 400 tokens per second , raising the global ceiling from the previous 50‑60 tokens / s and surpassing Google Gemini 3.5 Flash by roughly a factor of two.

Typical Chinese reading speed is 300‑500 characters / min (≈5‑8 tokens / s). At 400 tokens / s the model generates text 50‑80 times faster than a human can read, completing an entire document before a person finishes a single sentence.

Compared with the original GLM‑5.1, the high‑speed version finishes the same workload in 30 seconds instead of 7 minutes .

Three‑layer optimization

Inference engine layer : The MoE (Mixture‑of‑Experts) architecture was re‑engineered. Only a subset of experts is activated per token, and the routing and expert‑scheduling logic was rewritten to raise single‑card throughput.

Scheduling system layer : Introduced dynamic batching, request merging, and KV‑cache scheduling. Dynamic batching packs heterogeneous user requests, request merging eliminates duplicate computation, and KV‑cache optimization raises cache‑hit rates for repeated content.

Infrastructure layer : Coordinated deployment of inference clusters, network links, and load‑balancing to ensure the reported 400 TPS is a stable, production‑grade figure rather than a transient peak.

The three “punches” dramatically reduce tail latency under high concurrency, keeping response times low even when many users query simultaneously.

TileRT’s contribution

Most existing inference frameworks schedule at the operator/kernel level, incurring host‑CPU launch, weight loading, computation, write‑back, and synchronization for every operator. In single‑token, small‑batch, multi‑card Tensor‑Parallel scenarios, operator overhead dominates, leaving little time for actual computation.

TileRT discards runtime‑dynamic scheduling and, during the AOT (Ahead‑Of‑Time) compilation stage, stitches the entire computation graph into a single persistent engine kernel that launches only once. Within a GPU, intermediate results flow directly through registers, shared memory, and L2 cache without writing back to global memory, and host‑side scheduling and cross‑operator sync are folded into the same kernel.

On multi‑card setups, TileRT extends warp‑specialization across an 8‑GPU NVL topology, assigning different GPUs specialized workers (e.g., dedicated to attention or feed‑forward layers). This heterogeneous division outperforms traditional homogeneous parallelism.

The result is a Time‑to‑First‑Token (TTFT) under one second , enabling near‑real‑time interaction.

Model capability retained

GLM‑5.1‑highspeed keeps the full flagship specifications: a 754 B parameter MoE model with 256 experts (≈44 B active parameters), 200 K context length, and a 128 K output window.

Code‑generation throughput improves roughly tenfold. The model can sustain autonomous operation for up to eight hours, completing planning, execution, and iterative optimization in a single task.

On the SWE‑bench Pro benchmark the model scores 58.4 , surpassing Claude Opus 4.6 and becoming the first Chinese open‑source model to achieve an eight‑hour continuous‑work capability. OpenRouter data show state‑of‑the‑art performance in coding and agent tasks.

New application scenarios

Interactive 3D game scenes can update in near‑real time as players type, eliminating the previous multi‑second lag.

Coding agents see a tenfold boost in iteration speed, turning model‑output latency from a bottleneck into a non‑issue.

Real‑time customer service, education, and financial analysis benefit from sub‑second responses, keeping decision windows within human reaction time.

Availability

The high‑speed model is currently offered to select enterprise customers on Zhipu’s MaaS (Model‑as‑a‑Service) platform; a public rollout date has not been announced.

Reference: https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1-highspeed

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationLarge Language Modelreal-time AIGLM-5.1highspeedTileRT
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.