Artificial Intelligence 18 min read

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

Transformer models rely on full (quadratic) attention, which becomes a compute and memory bottleneck for ultra‑long contexts; the KV‑Cache alone can require tens or hundreds of gigabytes for million‑token sequences.

Existing remedies fall into two camps: sparse attention reduces computation by focusing on a subset of tokens but still stores dense KV‑Cache, while linear attention lowers complexity to O(N) at the expense of information loss.

SALA (Sparse‑Linear Attention) integrates both approaches: 75% of the layers use Lightning Attention (a linear attention variant closely matching full‑attention computation) and 25% employ InfLLM‑v2 sparse attention, which selects fine‑grained blocks of keys/values for each query. This mix preserves global information flow while keeping per‑token cost low.

The architecture also adopts a hybrid position‑encoding strategy (HyPE). Linear layers retain RoPE to stay compatible with the original full‑attention weights, whereas sparse layers use NoPE, eliminating positional decay and improving long‑range recall.

Training proceeds via a HALO‑style conversion that transforms a pretrained full‑attention Transformer into the mixed architecture without training from scratch. The pipeline includes:

HALO conversion : keep the first and last layers unchanged, apply HALO layer‑selection to retain some full‑attention layers (later trained as sparse).

Stable pre‑training : 1.3B tokens at 512 length, only the newly introduced linear layers are unfrozen.

Continued pre‑training : 314.6B tokens at 4K length on the MiniCPM‑4.0 dataset, sparse layers frozen, learning rate 7.5e‑3.

Short‑decay stage : 1T tokens at 4K length, learning rate decayed exponentially to 3.75e‑4.

Long‑decay stage : gradually expand context from 4K → 32K → 160K → 520K tokens, using 102.2B, 62.9B and 50.6B tokens respectively, with learning rates 3e‑4 → 2e‑4 → 1e‑4 → 3.75e‑5, and enable sparse attention for the longest windows.

SFT (Supervised Fine‑Tuning) : high‑quality reasoning data (code, math, knowledge, function calls) at 64K and 140K context, 204.5B and 213.3B tokens respectively, keeping sparse attention active.

Benchmark results show that MiniCPM‑SALA matches full‑attention 9B models on knowledge, math, and code tasks while excelling in long‑context benchmarks. On NVIDIA A6000D (96 GB), SALA achieves a 3.5× speed‑up over Qwen3‑8B at 256K tokens (TTFT 51.6 s vs 180.8 s) and avoids OOM where Qwen3‑8B fails at 512K/1M tokens. On consumer‑grade RTX 5090 (32 GB), SALA processes up to 1 M tokens without OOM, whereas Qwen3‑8B runs out of memory at 128K (non‑quantized) or 256K (INT4‑quantized). The model also generalises to 2048K context without additional techniques such as YaRN.

In summary, by blending sparse and linear attention, preserving positional information where beneficial, and converting existing Transformers via HALO, SALA delivers a practical solution for efficient ultra‑long‑context LLM inference, positioning hybrid attention as a leading direction for 2026 and beyond.

long context Sparse Attention Linear Attention LLM efficiency Hybrid Position Encoding SALA

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.