ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering

ClusterAttn tackles the KV‑cache bottleneck of large language models by exploiting the natural clustering of attention scores, achieving up to 92% compression without accuracy loss, boosting throughput 2.6–4.8×, handling 128K‑token sequences on a single GPU, and outperforming existing training‑free compression methods.

KV cache compressionattention clusteringdensity clustering

0 likes · 8 min read

ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering