Artificial Intelligence 9 min read

Inside Kimi Linear: How Aggressive MoE Sparsity and Hybrid Linear Attention Boost a 3B‑Scale LLM

The author details Kimi Linear's architecture, training challenges, aggressive MoE sparsity, hybrid linear attention design, benchmark gains, and post‑training insights, offering a transparent technical review of this 48B‑parameter MoE LLM built on 5.7 T tokens.

Baobao Algorithm Notes

Nov 3, 2025

Inside Kimi Linear: How Aggressive MoE Sparsity and Hybrid Linear Attention Boost a 3B‑Scale LLM

Model Architecture

The Kimi Linear model extends the Moonlight design by increasing the Mixture‑of‑Experts (MoE) sparsity from 8 to 32 experts. Its core attention mechanism is KDA (Linear Attention) which combines GDN‑based gating with fine‑grained GLA control. To reduce the risk of using pure linear attention in production‑grade LLMs, a hybrid configuration mixes KDA layers with conventional MLA layers at a 3:1 ratio, a setting identified as optimal after extensive ablation studies.

Training used a token budget of 5.7 T tokens and roughly 3 B activation parameters . The KV‑cache footprint of KDA enables a decoding speedup of about 6× compared with dense attention baselines.

Paper: https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf
Code: https://github.com/MoonshotAI/Kimi-Linear
Reddit discussion: https://www.reddit.com/r/LocalLLaMA/comments/1ojz8pz/kimi_linear_released/

Training Procedure

The model was scaled to a 48 B MoE configuration with the 5.7 T token budget. Distributed training frequently experienced interruptions, requiring manual monitoring and coordination across UTC+8 and Bay‑Area time zones.

Key ablation topics included:

Positional encoding choice: NoPE vs. RoPE.

Forget‑gate design: pure sigmoid versus GDN‑style.

Output‑gate role and its impact on short‑ and long‑context handling.

The “Scaling Ladder” strategy was followed: start from a ~1 B activation model, require benchmark improvements at each step, then progress to the next scale. This incremental approach ensured that each size met predefined performance thresholds before scaling further.

During training two bias vectors (named A_log) were initially stored in bf16. Mid‑training they were switched to fp32, which caused a rapid drop in their maximum values. This precision change was a deliberate derisking step, though its exact impact on final quality remains uncertain.

Post‑Training and Evaluation

Post‑training recipes were still experimental. Dozens of data formulations were tested, borrowing ideas from Moonlight and K2. A clear trade‑off emerged: configurations that maximized math/code benchmark scores often degraded real‑world user experience. The final configuration balances leaderboard performance with usability, delivering a model that is both fast and engaging.

Quantitative outcomes:

Decoding speedup ≈ 6× due to the small KV‑cache of KDA.

Personality described as “small K2‑like”, with strong engagement in interactive settings.

Benchmark gains over comparable baselines, though some comparison settings were not perfectly fair.

Broader Reflections

Comparisons with RWKV‑7 highlight shared components, while the community is converging on Delta‑variant linear attention as a promising direction. Sparse‑attention approaches such as NSA remain attractive alternatives.

The original goal of achieving state‑of‑the‑art performance at this scale was limited by resource constraints. Consequently, the focus shifted to fair 1 T token comparisons as part of the Scaling Ladder, providing a solid technical validation for future flagship models (e.g., K3).

References

[1] Moonlight: https://arxiv.org/abs/2502.16982

LLM Mixture of Experts Hybrid Model model architecture Linear Attention Kimi Linear Training Scaling

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.