Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models

GLM‑4.5 and its lightweight Air variant, featuring a deep‑layered MoE design, grouped‑query attention, and dual inference modes, achieve third‑place overall on 12 hard‑core benchmarks, excel in web‑browsing and tool‑calling with a 90.6 % success rate, and introduce novel training tricks such as the Muon optimizer and Slime RL framework.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Why GLM‑4.5 Sets a New Benchmark for Open‑Source Large Language Models

Introduction

In a few weeks the open‑source LLM crown passed from Kimi‑K2 and Qwen‑3 to GLM‑4.5, which aims to be not only a top code generator, reasoning expert, or agent but to master all roles.

Model specifications

GLM‑4.5 has 355 billion total parameters and 32 billion activation parameters; the lightweight GLM‑4.5‑Air has 106 billion total and 12 billion activation parameters. Both use a Mixture‑of‑Experts (MoE) architecture.

Architectural innovations

The team narrowed expert network width while increasing depth, resulting in deeper layers and more focused attention, which improves long‑text reasoning and stabilizes multi‑turn tool use. They introduced a dual inference mode: a “deep thinking” mode that automatically activates for complex tasks or tool chains, and an “instant response” mode for concise answers.

Performance results

On 12 hard‑core benchmarks covering coding, agents, and logical reasoning, GLM‑4.5 ranks third overall, behind OpenAI and Grok‑4; the Air variant ranks sixth. In the BrowseComp web‑browsing test it outperforms Claude‑4‑Opus by nearly 8 points. In BFCL‑v3 (function and tool reasoning) it remains robust. Its tool‑calling success rate reaches 90.6 %, surpassing Claude‑4 Sonnet, Kimi‑K2 and Qwen‑3.

Reasoning benchmarks: MATH‑500 98.2 %, AIME‑24 91.0 %, GPQA 79.1 %. In SWE‑bench Verified it scores 64.2, higher than GPT‑4.1 and DeepSeek. In Terminal‑Bench it scores 37.5, beating Claude‑4 Sonnet. Human evaluation shows it beats Kimi‑K2 in 54 % of cases and outperforms Qwen‑3‑Coder with an 80.8 % success rate.

Training techniques

The MoE layer uses loss‑less balanced routing and a Sigmoid gate, differing from the wide‑shallow designs of DeepSeek‑V3 and Kimi‑K2. Grouped‑Query Attention combined with partial RoPE increases the number of attention heads to 96 (2.5× typical models) and boosts hidden size to 5120. The Muon optimizer accelerates convergence and supports larger batch sizes; QK‑Norm stabilizes attention scores. Both models employ Multi‑Token Prediction (MTP) for speculative decoding.

Pre‑training consumed 22 T tokens (15 T generic data, 7 T code + reasoning). Post‑training applied domain‑specific instruction data. Reinforcement learning used a custom “Slime” framework with synchronous‑asynchronous parallelism, decoupled training from inference, and mixed‑precision FP8 inference with BF16 training. RL focused on real‑world agent workflows: retrieval‑augmented QA, software development tasks, and web‑scraped QA pairs that force the model to search rather than repeat.

Conclusion

GLM‑4.5 is positioned as a reset of industry benchmarks, offering a technically robust, open‑source LLM that excels across multiple tracks, though Claude still leads in some domains and GPT‑4.1 remains superior for long‑text depth.

AIlarge language modelbenchmarkMoEtraining techniquesGLM-4.5
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.