From Infinite Context to Linear Regression: MIT’s Attention Matching Accelerates KV Compression 100×

MIT’s new “Fast KV Compaction via Attention Matching” paper reformulates the costly KV‑cache compression problem as a series of closed‑form linear‑regression tasks, eliminating gradient descent, cutting compression time by two orders of magnitude and achieving up to 200× overall reduction while preserving accuracy on long‑context benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
From Infinite Context to Linear Regression: MIT’s Attention Matching Accelerates KV Compression 100×

Background and Motivation

Modern large‑language models store a key‑value (KV) cache for every token during inference. In long‑context scenarios such as multi‑turn dialogue or code generation, the KV cache consumes a prohibitive amount of GPU memory, becoming the primary bottleneck. Traditional compression methods—token dropping or merging—degrade performance sharply when high compression ratios are required.

Recent work called Cartridges demonstrated that training a high‑compression KV cache in latent space can retain most of the original context quality, but the end‑to‑end gradient‑based optimization is extremely expensive, often taking several GPU‑hours per compression.

Key Insight: Attention Matching as Linear Regression

The MIT team observed that the most time‑consuming step, feature fitting, can be transformed into a classic linear‑regression problem. By returning to the fundamentals of the attention mechanism, they construct compressed KV pairs that exactly reproduce the original attention outputs on a reference query set.

The matching problem is decomposed into multiple efficiently solvable closed‑form sub‑tasks, completely avoiding slow gradient descent.

Theoretical Reconstruction

For a token sequence of length L with original keys K\in\mathbb{R}^{L\times d} and values V\in\mathbb{R}^{L\times d}, the goal is to find compressed pairs \tilde K,\tilde V such that for any future token q the attention output matches the original:

\text{Attention}(q, K, V) \approx \text{Attention}(q, \tilde K, \tilde V)

The authors prove a mixed‑identity that the final output of the concatenated attention blocks is a weighted mixture of local attention outputs, where the weights are determined by each block’s Attention Mass . Matching both the local attention outputs and their quality guarantees high‑fidelity reconstruction.

Three‑Step Gradient‑Free Algorithm

Reference Query Construction : Generate a set of query vectors using Repeat‑prefill and Self‑study mechanisms to broaden the query distribution. An on‑policy strategy extracts queries from already‑compressed layers to mitigate distribution shift.

Key Selection & Bias Fitting : Instead of iterative optimization, select a representative subset of original keys. The authors employ a greedy Orthogonal Matching Pursuit (OMP) algorithm that picks keys minimizing the residual of attention‑quality fitting, followed by a non‑negative least‑squares (NNLS) problem to compute a scalar bias. This bias acts as a multiplicative weight, allowing retained keys to represent the summed quality of discarded tokens with negligible memory overhead.

Value Fitting : With compressed keys and bias fixed, solve a standard ordinary least‑squares (OLS) problem to obtain the compressed value matrix, completing the reconstruction without any gradient steps.

The entire pipeline relies solely on linear‑algebra operations, making it highly parallelizable on GPUs.

Engineering Optimizations

Standard OMP re‑solves NNLS after each key selection, which becomes a bottleneck for long contexts. The authors introduce two tricks:

Top‑k batch selection : Choose multiple keys per iteration.

Delayed refitting : Postpone NNLS updates until a batch of keys is selected, reducing the number of solves.

These changes shrink the key‑selection time from 565 s to 104 s on a 60k‑token scenario.

Non‑Uniform Head Budgeting

Different attention heads exhibit vastly different sensitivity to KV capacity. By pre‑computing a sensitivity curve for each head (see Figure 3), the authors allocate the limited KV budget preferentially to the most sensitive heads using a greedy exchange algorithm. Ablation studies show that discarding this non‑uniform allocation causes a dramatic cliff‑drop in reconstruction quality.

Experimental Evaluation

Benchmarks:

QuALITY and LongHealth for reading‑comprehension and medical QA.

AIME 2025 synthetic tasks for online continuous compression.

Key results:

At 50× compression, the Attention Matching method reaches the Pareto frontier of speed versus accuracy, matching Cartridges’ performance while being two orders of magnitude faster.

Combining a summarization step with AM‑OMP yields up to 200× total compression with negligible loss compared to using summarization alone.

Online compression in AIME scenarios repeatedly compresses 50 % of the global cache (up to six times) while preserving inference accuracy comparable to an uncompressed baseline.

Conclusion

The proposed Attention Matching paradigm converts KV compression into a clear linear‑algebra workflow, eliminating gradient‑based optimization and delivering dramatic speedups. Precise local‑behavior matching, bias‑augmented key selection, and non‑uniform head budgeting together provide a robust foundation for memory‑constrained long‑context inference and continuous online compression.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linear regressionAttention MatchingKV compressionNon‑gradient optimizationOMPTransformer memory
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.