Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?

Although a KV‑cache eviction strategy can discard 90% of tokens, GPU memory usage stays almost unchanged because paged‑attention memory blocks remain occupied and fast attention kernels discard the full score matrix, preventing effective memory release.

AI Engineering
AI Engineering
AI Engineering
Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?

This article presents an interview question about LLM inference deployment: even after applying a KV‑cache compression that evicts 90% of supposedly unimportant tokens, GPU memory consumption hardly changes and OOM still occurs for long sequences.

KV‑cache is the primary GPU memory consumer during inference; each generated token adds key and value vectors for every layer, and these are never freed. For example, a 4‑bit quantized Qwen3‑32B model running a 32K‑token chain on a 24 GB GPU runs out of memory around 24K tokens.

An analogy compares the pre‑fill stage to reading a full meeting transcript and the decode stage to speaking, highlighting that 90% of the time is spent handling context, making KV‑cache crucial for speed and cost.

The first obstacle is the paged‑attention memory block mechanism. vLLM and most production inference services allocate GPU memory in fixed‑size blocks (≈16 tokens each). A block is released only when all its slots are empty. After evicting 14K of 16K tokens, the remaining tokens are scattered across many blocks, leaving no block fully empty, so memory is not freed. Re‑ordering tokens to fill gaps also adds bookkeeping overhead.

The second obstacle is the nature of fast attention kernels such as FlashAttention, which compute attention in chunks and discard the complete attention‑score matrix. Since KV‑cache eviction relies on attention scores to judge token importance, using these kernels forces a switch back to eager attention, losing the speed advantage.

Nvidia’s recent TriAttention method addresses both problems. It scores tokens based on geometric clustering of key and query vectors before applying RoPE, eliminating the need for a full attention matrix. Additionally, it compresses the cache every 128 decoded tokens by moving surviving tokens forward, consolidating empty slots into whole blocks that can be returned to the allocator while preserving token order.

Empirical results show that in long‑sequence scenarios TriAttention matches full‑attention accuracy, speeds up decoding by 2.5×, and reduces KV‑cache memory usage by 10.7×.

The industry consensus is that the key metric for KV‑cache compression is the number of physical blocks actually released, not the count of evicted tokens, because many research tests use pre‑allocated contiguous tensors that do not reflect real paged‑memory deployments.

Related resources:

https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/

http://x.com/i/article/2034896077460316163

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMvLLMFlashAttentionGPU memoryKV cacheTriAttention
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.