Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%

A systematic ICML 2026 study shows that sharing the K and V projection matrices in Transformers reduces KV cache size by half while incurring less than 5% perplexity degradation, offering a simple, retrain‑once solution for long‑context and edge inference.

KV cacheQKV sharingTransformer

0 likes · 10 min read

Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%