How GLM‑5 Breaks New Ground with Sparse Attention and Asynchronous RL
GLM‑5, the 744‑billion‑parameter open‑source LLM, introduces DeepSeek Sparse Attention, Multi‑latent Attention, Muon Split optimizer, and a fully asynchronous agentic reinforcement‑learning framework, achieving state‑of‑the‑art performance on long‑context, code, math, and multimodal benchmarks while running efficiently on domestic Chinese chips.
Architecture and Pre‑training Efficiency
GLM‑5 scales to 744 billion total parameters with 256 expert branches and a reduced depth of 80 layers. The model uses a Multi‑latent Attention (MLA) backbone and is trained on 28.5 trillion tokens. A modified Muon optimizer, called Muon Split , updates projection weights of each attention head at different rates, closing the performance gap with grouped‑query attention.
During inference the attention‑head dimension is expanded to 256, which lowers decoding latency while keeping overall FLOPs constant. The Multi‑token Prediction (MTP) mechanism shares parameters across three MTP layers, matching the memory footprint of the most efficient existing models and improving token acceptance rates.
The model incorporates DeepSeek Sparse Attention (DSA) , a dynamic sparsity pattern that selects the most informative tokens in sequences up to 200 k tokens, cutting compute roughly in half. DSA training consists of a dense warm‑up stage followed by a sparse adaptation stage, achieving dense‑model quality with a fraction of the training budget.
Extensive ablations of attention variants show that interleaved sliding‑window attention performs poorly on long‑context tasks, while search‑based sparsity patterns recover performance. The SimpleGDN linearization strategy maximally re‑uses pretrained weights to balance efficiency and accuracy.
Asynchronous Reinforcement Learning
GLM‑5 adopts a progressive alignment pipeline: supervised fine‑tuning (SFT) → reasoning‑focused reinforcement learning (RL) → multi‑domain RL. In SFT three reasoning modes are defined:
Interleaved Thinking : deep deliberation before action.
Retained Thinking : preserves multi‑turn reasoning traces.
Round‑level Thinking : switches between casual chat and intensive reasoning.
Reasoning RL fine‑tunes the model on mathematics, science, code and tool‑integration using the IcePop technique based on the GRPO algorithm, which removes complex regularization terms and accelerates convergence.
A deterministic top‑k operator stabilises training on sparse‑attention models by eliminating stochastic noise. To avoid idle compute in long‑horizon tasks, the team rebuilt a fully asynchronous Agentic RL architecture that decouples data‑generation workers from the training engine.
Key engineering components include:
Token‑in‑Token‑out : removes alignment bias caused by text reconstruction.
Dual‑side importance sampling combined with data‑parallel routing, maximising cache reuse for mixture‑of‑experts models.
Hybrid reward system that blends rule‑based judgments with generative evaluation to filter hallucinations and logical errors.
High‑quality human‑written responses serve as calibration anchors.
Online distillation across stages preserves earlier capabilities and mitigates catastrophic forgetting.
Agentic Engineering for Real‑World Development
An asynchronous multi‑task scheduler orchestrates thousands of concurrent test executions, allowing GLM‑5 to interact with realistic software environments. Real‑world defect data extracted from massive open‑source repositories are used to generate multilingual, verifiable sandboxes.
Generation is guided by a three‑layer reward hierarchy:
Static code‑structure compliance.
Runtime geometry and resource constraints.
Visual‑aesthetic quality.
Mask‑correction and rejection‑sampling act as precise surgical tools, removing flawed pages while preserving valuable layout elements, thereby improving logical coherence and visual impact of generated presentations.
Comprehensive Evaluation
Benchmarking on authoritative leaderboards shows GLM‑5 achieving top ranks, surpassing the 50‑point threshold on the AI Index—the first open‑source model to do so. In the LMArena arena GLM‑5 dominates both text and code tracks, outperforming closed‑source competitors.
On long‑duration simulations such as a year‑long vending‑machine business model and large‑scale code‑challenge suites, GLM‑5 matches the performance of Claude Opus 4.5.
Academic reasoning and programming test sets confirm a strong lead over previous open‑source releases and a narrowing gap to the most advanced proprietary systems.
The CC‑Bench‑V2 automated benchmark evaluates front‑end development by having an agent click buttons and adjust windows, simulating real user interactions. GLM‑5 attains a high success rate on both code generation and backend tasks, satisfying strict functional and boundary tests.
Domestic Compute Ecosystem
GLM‑5 natively supports seven major Chinese chip platforms, including Huawei Ascend, Cambricon, Moore Threads and HaiGuang. Kernel‑to‑framework optimisations enable a single domestic node to match the performance of an international dual‑node cluster, halving deployment costs for long‑sequence workloads.
With its sparse architecture and asynchronous collaboration design, GLM‑5 provides a high‑efficiency pathway toward powerful intelligent agents.
Reference: https://arxiv.org/pdf/2602.15763
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
