ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering

ClusterAttn tackles the KV‑cache bottleneck of large language models by exploiting the natural clustering of attention scores, achieving up to 92% compression without accuracy loss, boosting throughput 2.6–4.8×, handling 128K‑token sequences on a single GPU, and outperforming existing training‑free compression methods.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
ClusterAttn: Compressing KV Cache with Intrinsic Attention Clustering

Problem

When large language models (LLMs) process long inputs, the key‑value (KV) cache grows proportionally to the token length, causing inference speed to drop, memory usage to soar, and even out‑of‑memory (OOM) failures.

Key Insight

During decoding, attention heads naturally “cluster”: tokens with high attention scores tend to gather into meaningful “information clusters”. This intrinsic attention clustering can be used to identify the most important parts of the KV cache.

Method: ClusterAttn

Feature aggregation – discover clusters : The last user query (the “observation window”) is used to sum attention scores of each preceding token, producing an importance vote that highlights potential clusters.

Density clustering – fit clusters : Inspired by DBSCAN, a density‑based attention clustering algorithm (DBAC) adaptively identifies contiguous token ranges with high attention density, avoiding the semantic fragmentation caused by naïve top‑K selection.

Cache stitching – feed inference : The compressed KV cache of the identified clusters is concatenated with the KV cache of the observation window, forming a much smaller cache that is fed to the decoder.

The entire pipeline requires no additional training; only about 20% of data is used for lightweight hyper‑parameter profiling.

Experiments

ClusterAttn was evaluated on Mistral‑7B, LWM and several long‑text benchmarks (LongBench, Needle‑in‑a‑Haystack).

Accuracy remains virtually unchanged while achieving a compression ratio of up to 92%; on some tasks performance even improves.

Throughput increases 2.6–4.8× and decoding latency drops 12%–23% compared with full‑attention (FlashAttn).

On the Needle‑in‑a‑Haystack test, ClusterAttn processes 128K‑token sequences on a single A100‑80GB GPU with >90% retrieval accuracy, whereas full‑attention models fail beyond 40K tokens.

Against current training‑free compression methods (H2O, StreamingLLM, SnapKV), ClusterAttn delivers superior precision, lower perplexity, and higher efficiency at smaller cache sizes.

Figures & Tables

Mistral 7B intrinsic attention clustering
Mistral 7B intrinsic attention clustering

Figure 1: Intrinsic attention clustering observed in Mistral‑7B during the prompt phase.

ClusterAttn algorithm flow
ClusterAttn algorithm flow

Figure 2: Three‑step ClusterAttn algorithm flow.

LongBench results
LongBench results

Table 1: LongBench evaluation – 92% compression with negligible performance drop.

Throughput comparison
Throughput comparison

Figure 3: Throughput comparison – ClusterAttn vs. full attention.

Needle-in-a-Haystack results
Needle-in-a-Haystack results

Figure 4: Needle‑in‑a‑Haystack – handling 128K tokens on a single GPU.

Perplexity comparison
Perplexity comparison

Table 2: Perplexity comparison – ClusterAttn achieves lower perplexity than H2O, StreamingLLM, and SnapKV.

Conclusion

ClusterAttn provides a training‑free, high‑fidelity, adaptive solution for KV‑cache compression, dramatically reducing memory footprint while preserving or even enhancing model performance, and opens new avenues for efficient long‑text LLM applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsperformance benchmarkingKV cache compressionattention clusteringdensity clustering
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.