Artificial Intelligence 10 min read

Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%

A systematic ICML 2026 study shows that sharing the K and V projection matrices in Transformers reduces KV cache size by half while incurring less than 5% perplexity degradation, offering a simple, retrain‑once solution for long‑context and edge inference.

Machine Learning Algorithms & Natural Language Processing

Jun 11, 2026

Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%

Background

Standard self‑attention uses three independent projection matrices Q, K, V. During autoregressive inference each token’s K and V must be cached, so the KV cache grows linearly with context length and dominates memory usage.

Projection‑sharing variants

The paper defines three sharing configurations:

Q=K‑V : Q and K share a matrix, V remains separate.

Q‑K=V : K and V share a matrix, Q stays independent (primary scheme).

Q=K=V : all three share a single matrix.

Experimental setup

Experiments cover synthetic, vision, and autoregressive language tasks. Language models of 300 M and 1.2 B parameters are trained on the SlimPajama corpus (~10 B tokens). All non‑attention components are kept identical across variants. Evaluation metrics include perplexity (PPL), KV‑cache size, peak memory, decoding throughput, and five‑shot downstream accuracy.

Results – 300 M language model

Baseline QKV PPL = 5.11. KV cache for 32 K context ≈ 2.62 GB, for 128 K ≈ 10.49 GB.

Q‑K=V PPL = 5.27 (+3.1 % relative). KV cache halved to 1.31 GB (32 K) and 5.24 GB (128 K). Peak memory reduced 6.5 %–6.9 %; decoding throughput increased 4.4 %–5.3 %.

Q=K‑V PPL = 5.36, but no cache reduction because both K and V must still be stored.

Q=K=V PPL = 6.41 (+25.4 %), indicating severe quality loss.

Combining KV sharing with head‑sharing methods further reduces cache: GQA‑4 gives 75 % reduction, MQA 93.8 %, hybrid Q‑GQA‑4 reaches 87.5 % reduction, and Q‑MQA reaches 96.9 % reduction.

Five‑shot downstream evaluation: Q‑K=V accuracy = 35.99 % vs baseline 36.40 % (Δ = ‑0.41 pp).

Results – 1.2 B language model

Baseline PPL = 5.004.

Q‑K=V PPL = 5.128 (+2.48 %). KV cache reduced from 5900 MB to 2950 MB (≈ 50 %). Peak memory lowered 6.5 %–6.9 %; decoding throughput improved 4.4 %–5.3 %.

Representation analysis

Cosine similarity between K and V projection matrices ≈ 0.73 with comparable effective rank, while similarity of Q to K (≈ 0.42) and Q to V (≈ 0.31) is much lower. This suggests strong redundancy between K and V and justifies keeping Q separate.

Trade‑offs and recommendations

KV sharing reduces memory without extra training‑time compressors or cache‑eviction logic, making it attractive for long‑context, multi‑user, edge, or on‑device inference. The quality‑efficiency frontier shows:

Q‑K=V offers a modest PPL increase (<5 %) while halving cache.

GQA (head sharing) preserves quality best when cache pressure is low.

Hybrid schemes (Q‑GQA, Q‑MQA) achieve extreme cache reductions at higher quality cost.

Limitations

Findings are based on models trained from scratch with the shared projection structure; the method has not been validated on existing pretrained QKV models, on scales beyond 1.2 B parameters, or on sequence lengths greater than 2048 tokens.

Reproducibility

Code and training scripts are available at

https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

efficiency model compression Transformer attention language models KV cache QKV sharing

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Projection‑sharing variants

Experimental setup

Results – 300 M language model

Results – 1.2 B language model

Representation analysis

Trade‑offs and recommendations

Limitations

Reproducibility

Machine Learning Algorithms & Natural Language Processing

How this landed with the community

Was this worth your time?

0 Comments

Results – 300 M language model

Results – 1.2 B language model