Tagged articles

TriAttention

1 articles · Page 1 of 1

Jun 28, 2026 · Artificial Intelligence

Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?

Although a KV‑cache eviction strategy can discard 90% of tokens, GPU memory usage stays almost unchanged because paged‑attention memory blocks remain occupied and fast attention kernels discard the full score matrix, preventing effective memory release.

FlashAttentionGPU memoryKV cache

0 likes · 7 min read

Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?