The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

Background: Prefill, Decode, and KV Cache

LLM inference consists of a compute‑heavy Prefill stage that processes all input tokens in parallel and a Decode stage that generates tokens autoregressively, spending most GPU time reading KV cache from HBM rather than computing. KV cache stores previously computed Key and Value tensors, avoiding recomputation for each new token.

For Llama‑3‑70B with an 8K context, KV cache per token equals 2 × 80 × 8 × 128 × 2 bytes ≈ 320 KB. This yields 2.56 GB per request and 81.9 GB for 32 concurrent requests—more memory than a single A100 80 GB GPU can hold, making KV cache management critical.

Era 0: Pre‑GenAI (before 2017)

Before Transformers, dominant models (ResNet, YOLO, VGG, Inception) were stateless feed‑forward networks, so the concept of KV cache did not exist. Inference frameworks such as ONNX Runtime and TensorRT were designed for these stateless workloads, loading a model, running forward passes, and returning results.

Era 1: Continuous KV Cache (2017)

The original Transformer paper introduced self‑attention and the need to cache K/V tensors between decode steps. Early engines (e.g., HuggingFace Transformers) allocated a contiguous tensor of size max_seq_len per request, with storage calculated as

2 × num_layers × num_heads × head_dim × max_seq_len

. This simple approach yields large speed gains over recomputing attention each step.

The downside is linear memory growth with max_seq_len × batch_size, causing severe internal fragmentation because most requests are far shorter than the maximum length. Empirical data shows only 20–38 % of allocated KV memory holds useful token state.

Era 2: PagedAttention (2023)

vLLM introduced a paging‑style KV cache inspired by operating‑system virtual memory. KV cache is split into fixed‑size blocks allocated on demand, with a block table mapping logical pages to physical memory, mirroring OS page tables.

The vLLM paper reports 2–4× higher throughput than FasterTransformer and Orca, fragmentation reduced to <4 % (down from 60–80 %), and concurrency scaling from dozens to thousands of requests.

PagedAttention also enables prefix caching: SGLang’s RadixAttention reuses KV pages for shared prompts, dramatically increasing throughput for multi‑turn dialogue and Retrieval‑Augmented Generation (RAG) scenarios.

Practical Comparison: vLLM vs SGLang Prefix Caching

Both frameworks support prefix caching but differ in implementation. vLLM uses hash‑based block‑level prefix matching, while SGLang employs a RadixAttention tree with LRU management, achieving higher cache‑hit rates in complex multi‑call workloads; vLLM’s approach remains simpler and performs well for standard chat use cases.

Era 3: Heterogeneous KV Cache (2024)

Model architectures and optimizations diversified, requiring management of heterogeneous cache types:

Speculative decoding maintains separate KV caches for a small draft model and a large target model.

Vision‑language models (e.g., QwenVL, InternVL) generate large image embeddings that can be cached across requests, but differ in size from text KV.

Quantized KV caches (FP8, etc.) store scaling factors alongside low‑precision data.

Sliding‑window attention (SWA) keeps only the most recent window_size tokens, requiring eviction of older entries.

These heterogeneous caches introduce fragmentation across independent managers, unpredictable memory allocation at startup, reduced prefix‑cache hit rates, and increased system complexity. vLLM mitigates this by separating managers for text KV, visual encoder cache, and Mamba cache, though this approach remains brittle and hard to extend.

Era 4: Distributed KV Cache (2025+)

Model sizes outgrow a single GPU, turning KV cache management into a multi‑node, data‑center‑scale problem.

Decoupled Inference

DistServe proposes deploying Prefill and Decode on separate GPU instances: Prefill is compute‑bound, Decode is memory‑bound, allowing each to use hardware optimally. Benchmarks show a 4.48× increase in request throughput (or a 10.2× tighter SLO at equal throughput).

vLLM’s Encoder Disaggregation isolates the visual encoder as an independent scalable service, improving goodput by 2–2.5× for multimodal workloads.

KV‑Cache‑Aware Load Balancing

NVIDIA Dynamo adds KV‑cache‑aware routing, directing requests to instances that already hold the relevant KV pages, maximizing prefix‑cache hit rates. This requires each instance to maintain a cluster‑wide view of cache state.

Hierarchical KV Cache

Moonshot AI’s Mooncake uses a tiered cache: hot pages stay in GPU HBM, cold pages spill to CPU DRAM or SSD. Overlapping I/O with GPU compute hides latency. In long‑context scenarios, Mooncake boosts throughput up to 525 % and adds 75 % more request capacity in real Kimi workloads.

Era 5: Unified Hybrid KV Cache (2025+)

The current frontier is a unified memory pool shared by all heterogeneous KV types, emphasizing composability—optimizations should stack without interfering.

Jenga: Large Pages + LCM Size Alignment

Jenga introduces a two‑level allocator. It computes the least common multiple (LCM) of different KV element sizes to define a “large page” that can host multiple KV shapes without fragmentation. For example, image token KV of 256 bytes and text token KV of 384 bytes yield an LCM of 768 bytes as the large‑page size, which is then subdivided into smaller pages per layer.

Compared with the original vLLM, Jenga improves GPU memory utilization by up to 79.6 % and achieves up to 4.92× higher throughput (average 1.80×).

SGLang: CUDA Virtual Memory

SGLang leverages the CUDA Virtual Memory API to remap device memory dynamically, keeping KV pages contiguous in virtual address space while physically scattered. An elastic memory pool can adjust allocation ratios between Mamba pools and KV cache pools at runtime.

The 2026 Q1 roadmap lists composability as a core goal, aiming to execute speculative decoding for mixed‑modality VLMs across multiple nodes, which will require extensive architectural refactoring.

Choosing the Right Approach for Different Scenarios

Standard text LLM services (chat, completion): adopt Era 2 (PagedAttention) with vLLM or SGLang; enable prefix caching when prompts are shared.

Multimodal models (VLM): fall into Era 3; evaluate framework handling of visual embeddings and consider Era 4 encoder disaggregation for image‑heavy loads.

Mixed architectures (Gemma 3, Jamba, Llama 4): apply Era 5 solutions such as Jenga’s LCM allocator or SGLang’s CUDA virtual memory.

Massive high‑throughput production: prioritize Era 4—decoupled Prefill/Decode and KV‑aware routing (e.g., NVIDIA Dynamo, Mooncake) for cost‑effective scaling.

Very long‑context workloads (>100K tokens): require hierarchical KV cache with GPU‑to‑CPU overflow as in Mooncake.

Conclusion

KV cache is the true bottleneck: Llama‑3‑70B handling 32 concurrent 8K‑token requests consumes over 80 GB of KV cache, exceeding the memory of a single A100 GPU.

The evolution of KV cache management mirrors the history of operating‑system memory management—moving from contiguous allocation to virtual memory paging, then to distributed shared memory—yet it has compressed eight years of progress into a single decade driven by explosive LLM demand. Understanding these eras is indispensable for any team building LLM infrastructure, as all future work builds on this foundation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

memory managementvLLMLLM InferenceSGLangkv cachePagedAttention
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.