LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.
Problem
Large language models (LLMs) encounter two scalability bottlenecks when processing long texts: KV‑cache memory grows linearly with token length, and the quadratic cost of full attention makes pre‑fill and decode slow and memory‑intensive.
Prior approaches
Multi‑head latent attention (MLA) : projects tokens into a low‑dimensional latent space, reducing KV‑cache size but still incurs O(N²) attention cost.
Sparse attention : skips attention blocks to lower compute, but requires the full Q/K/V matrices and cannot be combined with MLA without decompress‑then‑compress overhead.
Latent‑Condensed Attention (LCA)
LCA redesigns attention directly in the latent space of MLA. It consists of three deterministic steps.
Intelligent grouping : tokens are divided into groups of 16; the most recent 1024 tokens are kept uncompressed.
Semantic compression : for each group a weighted‑pooling vector is computed. The weight for token i is the softmax‑normalized attention score between the current query and the token’s latent vector z_i, yielding a representative vector c = Σ_i α_i z_i.
Position anchoring : the token with the highest attention score in each group is selected as a positional anchor, providing a compatible position key vector p for the group.
LCA reuses the original projection matrices, introduces zero additional parameters, and can be inserted by replacing the MLA (or GQA) layer without architectural changes.
Fine‑tuning requires only 1 000 steps on the original pretrained model.
Theoretical guarantee
The authors prove that the approximation error of LCA has a uniform upper bound that does not depend on context length, i.e., ‖ΔK‖ ≤ ε and ‖ΔV‖ ≤ ε for all tokens.
Implementation
A Triton kernel implementation achieves 24.4× speedup over the baseline PyTorch kernel at 64 K context length.
Experimental results
On 128 K context, LCA yields 2.5× pre‑fill acceleration, 90 % KV‑cache reduction, and 1.8× lower decode latency compared with the original MLA.
LongBench‑E and RULER benchmarks show performance parity with MLA; on RULER 128 K the result is slightly better.
Short‑context tasks (MMLU, GSM8K, MBPP) exhibit virtually unchanged accuracy.
MiniCPM3‑4B: 2.2× pre‑fill speedup and 93 % KV‑cache reduction.
Adaptation to Grouped Query Attention (GQA) on DeepSeek‑R1‑Distill‑Qwen‑7B: 3.25× inference acceleration and 93 % cache saving.
Comparison with other efficient attention methods
Parameter overhead : DSA/KDA need extra gating/routing modules; LCA adds zero parameters, fully reuses projection matrices.
Training dependency : DSA/KDA require pre‑training or large‑scale continual training; LCA needs only 1 000 fine‑tuning steps.
Integration : DSA/KDA require architecture or training‑pipeline changes; LCA is plug‑and‑play, directly replaces MLA/GQA.
Practical impact
Reduces deployment cost by eliminating extra modules.
Lowers hardware requirements: 90 % KV‑cache reduction allows several‑fold longer contexts on the same GPU.
Improves user‑facing response time with 2.5× pre‑fill speedup.
Maintains model capability across long‑ and short‑context benchmarks.
Resources
Paper: “Latent‑Condensed Transformer for Efficient Long Context Modeling”, arXiv:2604.12452.
Code repository:
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
