How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.
Background
Standard Transformers use full‑attention whose compute and memory scale O(N²) with sequence length N, creating a compute wall and a memory wall for million‑token contexts. Pure sparse‑attention reduces compute but still requires a full KV‑cache; pure linear‑attention reduces compute to O(N) but suffers accuracy loss on long‑range dependencies.
Hybrid SALA Architecture
MiniCPM‑SALA combines sparse attention (InfLLM‑V2) and linear attention (Lightning Attention) in a single 8‑B parameter model. Twenty‑five percent of the layers use InfLLM‑V2 for high‑fidelity local modeling with a low KV‑cache, while the remaining seventy‑five percent use Lightning Attention for O(N) global computation. This 75/25 split empirically yields the best trade‑off between efficiency and semantic precision, enabling context windows up to 2 048 K tokens without additional tricks such as YaRN.
Key technical contributions
Mixed attention design (SALA) : First architecture that integrates InfLLM‑V2 and Lightning Attention.
HALO conversion : A lightweight conversion that transforms a pretrained full‑attention Transformer into the mixed architecture, reducing total pre‑training cost to ≈ 25 % of training from scratch.
Hybrid Position Encoding (HyPE) : Linear layers retain RoPE, sparse layers use NoPE, eliminating the long‑range decay of rotary embeddings.
Inference efficiency : 3.5× speed‑up over Qwen3‑8B on 256 K token sequences; can process up to 1 M tokens on consumer‑grade GPUs without out‑of‑memory.
Training pipeline
The training consists of five stages:
HALO conversion : Convert 75 % of layers to linear attention, keep the first and last layers unchanged. Trained on 1.3 B tokens of length 512.
Stable continued training : 314.6 B tokens of length 4 K, sparse attention disabled, learning rate 7.5e‑3.
Short‑Decay phase : 1 T tokens of length 4 K, exponential LR decay to 3.75e‑4, heavy L2‑filtered data and PDF corpora.
Long‑Decay phase : Context window gradually expanded to 32 K, 160 K, then 520 K tokens with 102.2 B + 62.9 B + 50.6 B tokens respectively; sparse attention re‑enabled.
Supervised fine‑tuning (SFT) : High‑quality reasoning, code, math and function‑call data; trained on 64 K and 140 K contexts with 204.5 B + 213.3 B tokens.
Evaluation
On short‑context benchmarks (knowledge QA, math, code generation) MiniCPM‑SALA matches full‑attention 8 B models. On long‑context benchmarks it surpasses them, maintaining stable performance up to 2 048 K tokens without any extra techniques.
Inference speed measured on NVIDIA A6000D (96 GB) and RTX 5090 (32 GB): at 256 K tokens TTFT drops from 180.8 s (Qwen3‑8B) to 51.6 s (MiniCPM‑SALA), a 3.5× acceleration. The model avoids OOM where Qwen3‑8B fails, enabling million‑token processing on consumer GPUs.
Resources
GitHub repository: https://github.com/openbmb/minicpm
HuggingFace model page: https://huggingface.co/openbmb/MiniCPM-SALA
ModelScope: https://www.modelscope.cn/models/OpenBMB/MiniCPM-SALA
Technical report PDF: https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
