Inside GLM-5: Training Techniques, Architecture Innovations, and Benchmark Performance
The article dissects GLM-5’s 744B‑parameter MoE design, 28.5 T token training corpus, novel Muon Split and MLA‑256 optimizations, DSA sparse attention, a fully asynchronous RL pipeline, extensive domestic chip adaptation, and benchmark results that place it on par with Claude Opus 4.5 and ahead of Gemini 3 Pro.
Paper Overview and Goal
GLM‑5’s model files have been released together with the paper GLM‑5: from Vibe Coding to Agentic Engineering . Unlike earlier "Vibe Coding" where a user feeds prompts line‑by‑line, GLM‑5 aims for AI that plans, implements, and iterates autonomously, approaching the behavior of a real software engineer.
Key Specifications
744B‑parameter MoE architecture (40B active parameters). Despite having twice the total parameters of GLM‑4.5, MoE plus DSA sparse attention reduces inference cost.
28.5 T token training corpus , heavily weighted toward code and reasoning data; context length expanded from 4K to 200K during the middle‑training stage.
Fully asynchronous reinforcement‑learning framework that decouples training and inference, dramatically improving GPU utilization.
Adaptation to seven domestic chip platforms (Huawei Ascend, Moore Threads, HaiGuang, Cambricon, Kunlun, MuXi, SuiYuan) from day one.
Architecture Innovations
1. Muon Split : After observing that MLA (Multi‑Latent Attention) underperforms GQA‑8 when trained with the Muon optimizer, the authors split projection matrices per attention head and orthogonalize them, allowing each head to update at a different scale and bringing MLA performance up to GQA‑8.
2. MLA‑256 : Head dimension is increased from 192 to 256 while reducing the number of heads by one‑third, keeping training compute constant but cutting decoding compute substantially.
3. MTP Parameter Sharing : Sharing three layers of MTP parameters (instead of a single layer) raises the accepted token length from 2.55 to 2.76 under the same four‑step predictive decoding, yielding a noticeable speed gain.
The figure shows GLM‑5’s results on eight core ARC (Agentic, Reasoning, Coding) benchmarks, delivering an average ~20 % gain over GLM‑4.7 and matching Claude Opus 4.5 and GPT‑5.2 (xhigh) while surpassing Gemini 3 Pro.
DeepSeek Sparse Attention (DSA)
DSA identifies that roughly 90 % of attention calculations in long contexts are redundant. A "Lightning Indexer" dynamically selects important tokens, achieving token‑level sparsity without loss of long‑range dependencies. Ablation studies reported:
Sliding‑window attention drops 30 points on RULER@128K (catastrophic).
Search‑based SWA improves but still lags by 5.69 points.
Linear‑attention variants (GDN/SimpleGDN) exceed the baseline on some metrics but lag on fine‑grained retrieval.
DSA shows no degradation across all scenarios because it never discards long‑range information.
Training GLM‑5 from the GLM‑4.7 MLA base model with only 20 B tokens of sparse adaptation matches the original MLA performance, an order‑of‑magnitude efficiency improvement over DeepSeek‑V3.2’s 943.7 B‑token training.
Three‑Stage RL Pipeline
Stage 1 – Reasoning RL : Built on the GRPO algorithm with IcePop techniques, focusing on math and code reasoning. KL regularization is removed to speed convergence. DSA models gain a 1–2 % “Reasoning Bonus” over MLA models.
Stage 2 – Agentic RL (core highlight) introduces several hard‑core designs:
Fully asynchronous decoupled framework using a Multi‑Task Rollout Orchestrator, eliminating GPU idle time.
Token‑in‑Token‑out (TITO) gateway ensuring exact token‑level correspondence between inference and training tokenizers.
Direct Double‑sided Importance Sampling with token‑level clipping to control off‑policy bias without tracking historic checkpoints.
DP‑aware Routing to maximize KV‑cache reuse, accelerating long‑context inference in MoE models.
The training environment includes over 10,000 real software‑engineering (SWE) tasks, terminal tasks, and complex multi‑hop search tasks.
Stage 3 – General RL combines multi‑objective optimization (safety, creativity, QA) with a mixed reward system (rule‑based + AI‑Judge) and human‑in‑the‑loop alignment.
Cross‑stage on‑policy distillation is applied after each RL stage to prevent catastrophic forgetting, using the current best model’s on‑policy data to teach the next stage.
Domestic Chip Adaptation
GLM‑5 is fully stack‑adapted to seven Chinese chip platforms. The paper details adaptation on Huawei Ascend Atlas 800T A3, including:
W4A8 mixed‑precision quantization (W8A8 for attention/MLP, W4A8 for MoE experts) with QuaRot outlier suppression.
High‑performance fused kernels: Lightning Indexer, Sparse Flash Attention, MLAPO.
Inference engine optimizations: asynchronous scheduling to eliminate bubbles, RadixCache prefix sharing, hybrid parallelism (Attention DP + MoE EP).
Single‑node performance on a domestic server reaches double‑card international cluster speed, cutting deployment cost for long‑sequence scenarios by 50 %.
Benchmark Results
On LMArena, GLM‑5 is the open‑weight model that ranks first in both Text Arena and Code Arena, comparable to Claude Opus 4.5 and Gemini 3 Pro.
In the long‑term planning Vending‑Bench 2, GLM‑5 ends with a $4,432 balance, the top open‑source model and close to Claude Opus 4.5. Code‑related benchmarks show:
SWE‑bench Verified: open‑source SOTA, surpassing Gemini 3 Pro.
SWE‑bench Multilingual: beats Gemini 3 Pro and GPT‑5.2 (xhigh).
Terminal‑Bench 2.0: on par with Claude Opus 4.5.
Agentic abilities:
BrowseComp (web search): SOTA among frontier models.
MCP‑Atlas (tool use): comparable to Claude Opus 4.5.
τ²‑Bench (dialogue agent): strong performance.
CC‑Bench‑V2, an internal evaluation suite covering front‑end (Playwright‑based Agent‑as‑a‑Judge) and back‑end tasks across six languages and 85 real tasks, shows GLM‑5 matching or exceeding Claude Opus 4.5, especially in large code‑base exploration.
Easter Egg: "Pony Alpha" Experiment
The paper concludes with an "Easter Egg" where GLM‑5 was anonymously released on OpenRouter under the name "Pony Alpha". Within days the community guessed its identity (25 % Claude Sonnet 5, 20 % DeepSeek, 10 % Grok, the rest GLM‑5). The reveal was described as a "profound moment" that disproves the notion that Chinese models cannot compete with the frontier.
Paper link: https://arxiv.org/abs/2602.15763v1 GitHub: https://github.com/zai-org/GLM-5
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
