ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering
ClusterAttn tackles the KV‑cache bottleneck of large language models by exploiting the natural clustering of attention scores, achieving up to 92% compression without accuracy loss, boosting throughput 2.6–4.8×, handling 128K‑token sequences on a single GPU, and outperforming existing training‑free compression methods.
Problem
When large language models (LLMs) process long inputs, the key‑value (KV) cache grows proportionally to the token length, causing inference speed to drop, memory usage to soar, and even out‑of‑memory (OOM) failures.
Key Insight
During decoding, attention heads naturally “cluster”: tokens with high attention scores tend to gather into meaningful “information clusters”. This intrinsic attention clustering can be used to identify the most important parts of the KV cache.
Method: ClusterAttn
Feature aggregation – discover clusters : The last user query (the “observation window”) is used to sum attention scores of each preceding token, producing an importance vote that highlights potential clusters.
Density clustering – fit clusters : Inspired by DBSCAN, a density‑based attention clustering algorithm (DBAC) adaptively identifies contiguous token ranges with high attention density, avoiding the semantic fragmentation caused by naïve top‑K selection.
Cache stitching – feed inference : The compressed KV cache of the identified clusters is concatenated with the KV cache of the observation window, forming a much smaller cache that is fed to the decoder.
The entire pipeline requires no additional training; only about 20% of data is used for lightweight hyper‑parameter profiling.
Experiments
ClusterAttn was evaluated on Mistral‑7B, LWM and several long‑text benchmarks (LongBench, Needle‑in‑a‑Haystack).
Accuracy remains virtually unchanged while achieving a compression ratio of up to 92%; on some tasks performance even improves.
Throughput increases 2.6–4.8× and decoding latency drops 12%–23% compared with full‑attention (FlashAttn).
On the Needle‑in‑a‑Haystack test, ClusterAttn processes 128K‑token sequences on a single A100‑80GB GPU with >90% retrieval accuracy, whereas full‑attention models fail beyond 40K tokens.
Against current training‑free compression methods (H2O, StreamingLLM, SnapKV), ClusterAttn delivers superior precision, lower perplexity, and higher efficiency at smaller cache sizes.
Figures & Tables
Figure 1: Intrinsic attention clustering observed in Mistral‑7B during the prompt phase.
Figure 2: Three‑step ClusterAttn algorithm flow.
Table 1: LongBench evaluation – 92% compression with negligible performance drop.
Figure 3: Throughput comparison – ClusterAttn vs. full attention.
Figure 4: Needle‑in‑a‑Haystack – handling 128K tokens on a single GPU.
Table 2: Perplexity comparison – ClusterAttn achieves lower perplexity than H2O, StreamingLLM, and SnapKV.
Conclusion
ClusterAttn provides a training‑free, high‑fidelity, adaptive solution for KV‑cache compression, dramatically reducing memory footprint while preserving or even enhancing model performance, and opens new avenues for efficient long‑text LLM applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
