Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm

The article analyzes the KV‑Cache memory bottleneck of long‑context Transformers, introduces the KV‑CAT (KV‑Compression Aware Training) approach that simulates cache compression during pre‑training, and presents experiments showing unchanged base abilities while dramatically improving post‑training compression, retrieval and long‑text QA performance.

Machine Heart
Machine Heart
Machine Heart
Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm

As AI vendors push context windows to millions of tokens, the KV‑Cache required by Transformers grows linearly, consuming tens to hundreds of gigabytes of GPU memory and slowing inference, turning the context‑window race into a memory‑war.

The authors highlight that two models producing identical outputs can differ drastically in KV‑Cache compressibility. Using a simple letter‑frequency task, they contrast a naïve implementation that averages token representations—fragile to compression—with a structured implementation that records sequence length, allowing the compressed cache to be recalibrated and retain zero error.

To encourage the latter, the joint Oxford‑NVIDIA team proposes KV‑CAT (KV‑Compression Aware Training). The method adds lightweight router modules that, at each training step, decide which KV slots to keep (target ~50% retention). Training proceeds with two forward passes per batch—full‑cache and compressed‑cache—combined with three loss terms: a self‑distillation loss aligning compressed‑cache outputs to full‑cache outputs, an anchoring loss preserving next‑token prediction ability, and a router loss enforcing the 50% retention target.

After training, the router is disabled; the model remains a standard Transformer but its internal representations are inherently compression‑friendly, allowing any downstream KV‑compression technique to work more effectively.

Experiments on Qwen2.5 models (0.5B and 1.5B parameters) show that baseline capabilities are unchanged (±0.5 % on six language‑understanding benchmarks). Under equal compression budgets, KV‑CAT reduces perplexity gaps by up to 3.21× and cuts required optimization steps by up to 5×. In a long‑text retrieval test, accuracy rises from 28 % to 47 % (0.5B) and 49 % to 67 % (1.5B). On LongBench v2, compressed KV‑CAT models improve average accuracy by up to 39 % across seven long‑document QA tasks.

The approach adds extra pre‑training cost and router complexity, and current results are limited to sub‑2B‑parameter models; scaling to hundred‑billion‑parameter LLMs remains an open question.

Nevertheless, training‑time preparation for inference, by making models naturally compressible, is presented as a valuable direction for future large‑model engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model compressionTransformerlarge language modelsmemory efficiencyKV cacheKV-CAT
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.