How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

PaperAgent
PaperAgent
PaperAgent
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

Background

Standard Transformers use full‑attention whose compute and memory scale O(N²) with sequence length N, creating a compute wall and a memory wall for million‑token contexts. Pure sparse‑attention reduces compute but still requires a full KV‑cache; pure linear‑attention reduces compute to O(N) but suffers accuracy loss on long‑range dependencies.

Hybrid SALA Architecture

MiniCPM‑SALA combines sparse attention (InfLLM‑V2) and linear attention (Lightning Attention) in a single 8‑B parameter model. Twenty‑five percent of the layers use InfLLM‑V2 for high‑fidelity local modeling with a low KV‑cache, while the remaining seventy‑five percent use Lightning Attention for O(N) global computation. This 75/25 split empirically yields the best trade‑off between efficiency and semantic precision, enabling context windows up to 2 048 K tokens without additional tricks such as YaRN.

Key technical contributions

Mixed attention design (SALA) : First architecture that integrates InfLLM‑V2 and Lightning Attention.

HALO conversion : A lightweight conversion that transforms a pretrained full‑attention Transformer into the mixed architecture, reducing total pre‑training cost to ≈ 25 % of training from scratch.

Hybrid Position Encoding (HyPE) : Linear layers retain RoPE, sparse layers use NoPE, eliminating the long‑range decay of rotary embeddings.

Inference efficiency : 3.5× speed‑up over Qwen3‑8B on 256 K token sequences; can process up to 1 M tokens on consumer‑grade GPUs without out‑of‑memory.

Training pipeline

The training consists of five stages:

HALO conversion : Convert 75 % of layers to linear attention, keep the first and last layers unchanged. Trained on 1.3 B tokens of length 512.

Stable continued training : 314.6 B tokens of length 4 K, sparse attention disabled, learning rate 7.5e‑3.

Short‑Decay phase : 1 T tokens of length 4 K, exponential LR decay to 3.75e‑4, heavy L2‑filtered data and PDF corpora.

Long‑Decay phase : Context window gradually expanded to 32 K, 160 K, then 520 K tokens with 102.2 B + 62.9 B + 50.6 B tokens respectively; sparse attention re‑enabled.

Supervised fine‑tuning (SFT) : High‑quality reasoning, code, math and function‑call data; trained on 64 K and 140 K contexts with 204.5 B + 213.3 B tokens.

Evaluation

On short‑context benchmarks (knowledge QA, math, code generation) MiniCPM‑SALA matches full‑attention 8 B models. On long‑context benchmarks it surpasses them, maintaining stable performance up to 2 048 K tokens without any extra techniques.

Inference speed measured on NVIDIA A6000D (96 GB) and RTX 5090 (32 GB): at 256 K tokens TTFT drops from 180.8 s (Qwen3‑8B) to 51.6 s (MiniCPM‑SALA), a 3.5× acceleration. The model avoids OOM where Qwen3‑8B fails, enabling million‑token processing on consumer GPUs.

Resources

GitHub repository: https://github.com/openbmb/minicpm

HuggingFace model page: https://huggingface.co/openbmb/MiniCPM-SALA

ModelScope: https://www.modelscope.cn/models/OpenBMB/MiniCPM-SALA

Technical report PDF: https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf

LLMLong-contextmodel architecturesparse attentionLinear Attention
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.