Artificial Intelligence 10 min read

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Machine Heart

Apr 29, 2026

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

Problem

Large language models (LLMs) encounter two scalability bottlenecks when processing long texts: KV‑cache memory grows linearly with token length, and the quadratic cost of full attention makes pre‑fill and decode slow and memory‑intensive.

Prior approaches

Multi‑head latent attention (MLA) : projects tokens into a low‑dimensional latent space, reducing KV‑cache size but still incurs O(N²) attention cost.

Sparse attention : skips attention blocks to lower compute, but requires the full Q/K/V matrices and cannot be combined with MLA without decompress‑then‑compress overhead.

Latent‑Condensed Attention (LCA)

LCA redesigns attention directly in the latent space of MLA. It consists of three deterministic steps.

Intelligent grouping : tokens are divided into groups of 16; the most recent 1024 tokens are kept uncompressed.

Semantic compression : for each group a weighted‑pooling vector is computed. The weight for token i is the softmax‑normalized attention score between the current query and the token’s latent vector z_i, yielding a representative vector c = Σ_i α_i z_i.

Position anchoring : the token with the highest attention score in each group is selected as a positional anchor, providing a compatible position key vector p for the group.

LCA reuses the original projection matrices, introduces zero additional parameters, and can be inserted by replacing the MLA (or GQA) layer without architectural changes.

Fine‑tuning requires only 1 000 steps on the original pretrained model.

Theoretical guarantee

The authors prove that the approximation error of LCA has a uniform upper bound that does not depend on context length, i.e., ‖ΔK‖ ≤ ε and ‖ΔV‖ ≤ ε for all tokens.

Implementation

A Triton kernel implementation achieves 24.4× speedup over the baseline PyTorch kernel at 64 K context length.

Experimental results

On 128 K context, LCA yields 2.5× pre‑fill acceleration, 90 % KV‑cache reduction, and 1.8× lower decode latency compared with the original MLA.

LongBench‑E and RULER benchmarks show performance parity with MLA; on RULER 128 K the result is slightly better.

Short‑context tasks (MMLU, GSM8K, MBPP) exhibit virtually unchanged accuracy.

MiniCPM3‑4B: 2.2× pre‑fill speedup and 93 % KV‑cache reduction.

Adaptation to Grouped Query Attention (GQA) on DeepSeek‑R1‑Distill‑Qwen‑7B: 3.25× inference acceleration and 93 % cache saving.

Comparison with other efficient attention methods

Parameter overhead : DSA/KDA need extra gating/routing modules; LCA adds zero parameters, fully reuses projection matrices.

Training dependency : DSA/KDA require pre‑training or large‑scale continual training; LCA needs only 1 000 fine‑tuning steps.

Integration : DSA/KDA require architecture or training‑pipeline changes; LCA is plug‑and‑play, directly replaces MLA/GQA.

Practical impact

Reduces deployment cost by eliminating extra modules.

Lowers hardware requirements: 90 % KV‑cache reduction allows several‑fold longer contexts on the same GPU.

Improves user‑facing response time with 2.5× pre‑fill speedup.

Maintains model capability across long‑ and short‑context benchmarks.

Resources

Paper: “Latent‑Condensed Transformer for Efficient Long Context Modeling”, arXiv:2604.12452.

Code repository:

Transformer Inference acceleration Long-context Efficient Attention KV cache reduction LCA

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.