Machine Heart
Jun 2, 2026 · Artificial Intelligence
Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm
The article analyzes the KV‑Cache memory bottleneck of long‑context Transformers, introduces the KV‑CAT (KV‑Compression Aware Training) approach that simulates cache compression during pre‑training, and presents experiments showing unchanged base abilities while dramatically improving post‑training compression, retrieval and long‑text QA performance.
KV cacheKV-CATTransformer
0 likes · 10 min read
