Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%
A systematic ICML 2026 study shows that sharing the K and V projection matrices in Transformers reduces KV cache size by half while incurring less than 5% perplexity degradation, offering a simple, retrain‑once solution for long‑context and edge inference.
Background
Standard self‑attention uses three independent projection matrices Q, K, V. During autoregressive inference each token’s K and V must be cached, so the KV cache grows linearly with context length and dominates memory usage.
Projection‑sharing variants
The paper defines three sharing configurations:
Q=K‑V : Q and K share a matrix, V remains separate.
Q‑K=V : K and V share a matrix, Q stays independent (primary scheme).
Q=K=V : all three share a single matrix.
Experimental setup
Experiments cover synthetic, vision, and autoregressive language tasks. Language models of 300 M and 1.2 B parameters are trained on the SlimPajama corpus (~10 B tokens). All non‑attention components are kept identical across variants. Evaluation metrics include perplexity (PPL), KV‑cache size, peak memory, decoding throughput, and five‑shot downstream accuracy.
Results – 300 M language model
Baseline QKV PPL = 5.11. KV cache for 32 K context ≈ 2.62 GB, for 128 K ≈ 10.49 GB.
Q‑K=V PPL = 5.27 (+3.1 % relative). KV cache halved to 1.31 GB (32 K) and 5.24 GB (128 K). Peak memory reduced 6.5 %–6.9 %; decoding throughput increased 4.4 %–5.3 %.
Q=K‑V PPL = 5.36, but no cache reduction because both K and V must still be stored.
Q=K=V PPL = 6.41 (+25.4 %), indicating severe quality loss.
Combining KV sharing with head‑sharing methods further reduces cache: GQA‑4 gives 75 % reduction, MQA 93.8 %, hybrid Q‑GQA‑4 reaches 87.5 % reduction, and Q‑MQA reaches 96.9 % reduction.
Five‑shot downstream evaluation: Q‑K=V accuracy = 35.99 % vs baseline 36.40 % (Δ = ‑0.41 pp).
Results – 1.2 B language model
Baseline PPL = 5.004.
Q‑K=V PPL = 5.128 (+2.48 %). KV cache reduced from 5900 MB to 2950 MB (≈ 50 %). Peak memory lowered 6.5 %–6.9 %; decoding throughput improved 4.4 %–5.3 %.
Representation analysis
Cosine similarity between K and V projection matrices ≈ 0.73 with comparable effective rank, while similarity of Q to K (≈ 0.42) and Q to V (≈ 0.31) is much lower. This suggests strong redundancy between K and V and justifies keeping Q separate.
Trade‑offs and recommendations
KV sharing reduces memory without extra training‑time compressors or cache‑eviction logic, making it attractive for long‑context, multi‑user, edge, or on‑device inference. The quality‑efficiency frontier shows:
Q‑K=V offers a modest PPL increase (<5 %) while halving cache.
GQA (head sharing) preserves quality best when cache pressure is low.
Hybrid schemes (Q‑GQA, Q‑MQA) achieve extreme cache reductions at higher quality cost.
Limitations
Findings are based on models trained from scratch with the shared projection structure; the method has not been validated on existing pretrained QKV models, on scales beyond 1.2 B parameters, or on sequence lengths greater than 2048 tokens.
Reproducibility
Code and training scripts are available at
https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
