ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering
ClusterAttn tackles the KV‑cache bottleneck of large language models by exploiting the natural clustering of attention scores, achieving up to 92% compression without accuracy loss, boosting throughput 2.6–4.8×, handling 128K‑token sequences on a single GPU, and outperforming existing training‑free compression methods.
