Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

Architecture leak via vLLM pull request

Developers inspecting a pull request in the vLLM inference framework observed that the implementation of the upcoming GLM‑5 model maps directly onto DeepSeek‑V3 components, exposing the architecture before any official announcement.

Core components

DeepSeek Sparse Attention (DSA) operates in two stages. First, a lightweight Lightning Indexer scans all historical tokens and scores their relevance to the current query token. Then only the top‑k highest‑scoring tokens are processed with full attention while the remaining tokens are skipped, dramatically improving long‑text efficiency with minimal impact on output quality.

Multi‑Token Prediction (MTP) enables the model to generate multiple tokens in a single step, accelerating generation speed.

Model scale and architecture

Code analysis shows GLM‑5 contains 78 hidden layers and adopts a mixture‑of‑experts (MoE) design with 256 experts, activating 8 experts per inference. This yields roughly 44 B active parameters, a sparsity of 5.9 % (comparable to DeepSeek‑V3.2’s 5.4 %). The total parameter count is reported as 745 B, about twice that of the previous GLM‑4.7.

The context window supports up to 202 K tokens.

Inference compatibility

Because GLM‑5 reuses the DeepSeek‑V3 architecture, it can directly benefit from existing optimizations in inference engines such as vLLM and SGLang, lowering deployment barriers.

Potential limitation

Some observers note that DeepSeek‑V3 is primarily a pure‑text architecture, raising the question of whether the initial GLM‑5 release may lack multimodal capabilities.

Anonymous model "Pony Alpha" on OpenRouter

An anonymous model named "Pony Alpha" appeared on the OpenRouter platform with an approximate 200 K token window. The community reported exceptionally strong programming and reasoning abilities, and more than 91 % of users identified it as a GLM‑5 test version.

Evidence linking Pony Alpha to GLM‑5 includes:

Temporal alignment: the model’s appearance coincides with multiple hints from Zhipu AI chief scientist Tang Jie about a GLM‑5 release window.

Token‑level response similarity: developers observed that Pony Alpha’s reactions to specific tokens match those of previous GLM series models.

Stylistic consistency: output formatting habits of Pony Alpha closely resemble those of the GLM family.

Release timeline

Internal communications indicate that GLM‑5 is slated for release in mid‑February 2026, around the Chinese New Year period, a timeframe that also sees announcements from DeepSeek, Qwen 3.5, and MiniMax M2.2.

Reference links:

[1] https://github.com/vllm-project/vllm/pull/34124

[2] https://x.com/chetaslua/status/2020832197771714943

Mixture of ExpertsDeepSeekLarge Language ModelMulti‑Token PredictionSparse AttentionGLM-5AI Stock Impact
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.