Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge
The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.
Architecture leak via vLLM pull request
Developers inspecting a pull request in the vLLM inference framework observed that the implementation of the upcoming GLM‑5 model maps directly onto DeepSeek‑V3 components, exposing the architecture before any official announcement.
Core components
DeepSeek Sparse Attention (DSA) operates in two stages. First, a lightweight Lightning Indexer scans all historical tokens and scores their relevance to the current query token. Then only the top‑k highest‑scoring tokens are processed with full attention while the remaining tokens are skipped, dramatically improving long‑text efficiency with minimal impact on output quality.
Multi‑Token Prediction (MTP) enables the model to generate multiple tokens in a single step, accelerating generation speed.
Model scale and architecture
Code analysis shows GLM‑5 contains 78 hidden layers and adopts a mixture‑of‑experts (MoE) design with 256 experts, activating 8 experts per inference. This yields roughly 44 B active parameters, a sparsity of 5.9 % (comparable to DeepSeek‑V3.2’s 5.4 %). The total parameter count is reported as 745 B, about twice that of the previous GLM‑4.7.
The context window supports up to 202 K tokens.
Inference compatibility
Because GLM‑5 reuses the DeepSeek‑V3 architecture, it can directly benefit from existing optimizations in inference engines such as vLLM and SGLang, lowering deployment barriers.
Potential limitation
Some observers note that DeepSeek‑V3 is primarily a pure‑text architecture, raising the question of whether the initial GLM‑5 release may lack multimodal capabilities.
Anonymous model "Pony Alpha" on OpenRouter
An anonymous model named "Pony Alpha" appeared on the OpenRouter platform with an approximate 200 K token window. The community reported exceptionally strong programming and reasoning abilities, and more than 91 % of users identified it as a GLM‑5 test version.
Evidence linking Pony Alpha to GLM‑5 includes:
Temporal alignment: the model’s appearance coincides with multiple hints from Zhipu AI chief scientist Tang Jie about a GLM‑5 release window.
Token‑level response similarity: developers observed that Pony Alpha’s reactions to specific tokens match those of previous GLM series models.
Stylistic consistency: output formatting habits of Pony Alpha closely resemble those of the GLM family.
Release timeline
Internal communications indicate that GLM‑5 is slated for release in mid‑February 2026, around the Chinese New Year period, a timeframe that also sees announcements from DeepSeek, Qwen 3.5, and MiniMax M2.2.
Reference links:
[1] https://github.com/vllm-project/vllm/pull/34124
[2] https://x.com/chetaslua/status/2020832197771714943
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
