How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference
This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.
Introduction
vLLM uses PagedAttention , a technique inspired by operating‑system virtual‑memory paging, to dynamically allocate KV‑cache on GPU and reduce memory fragmentation.
LLM inference stages
Inference is split into prefill (forward pass on the whole prompt) and decode (token‑by‑token generation). KV‑cache stores key/value pairs for each token so later attention can reuse them. During prefill the KV pairs are written to cache_k and cache_v; during decode each new token’s KV pair is appended, causing the cache to grow and dominate latency.
Key observations:
KV‑cache size grows with prompt length and concurrent request count, stressing GPU memory.
The output sequence length is unknown beforehand, making static allocation inefficient.
Traditional KV‑cache allocation
Most inference servers allocate a fixed rectangular block per request based on (batch_size, max_seq_len). This static layout creates internal and external fragmentation:
Light‑colored blocks : Prefill KV‑cache, always used.
Medium‑colored blocks : Decode KV‑cache that may be partially used (reservation fragment).
Dark‑colored blocks : Decode KV‑cache never used (internal fragment).
Gray blocks : External fragments that are non‑contiguous and cannot be reused.
Fragmentation wastes GPU memory and limits throughput.
PagedAttention principle
Operating‑system virtual memory
OS paging divides physical memory into fixed‑size pages and maps virtual pages to physical frames via a page table, eliminating fragmentation.
PagedAttention mechanics
vLLM treats each request as a process, logical KV blocks as virtual pages, and physical KV blocks as GPU frames. A block table maps logical blocks to physical blocks.
Single‑request flow :
During prefill, the prompt is split into logical blocks of size B (e.g., B=4). A 7‑token prompt creates two logical blocks.
Logical blocks are mapped to physical blocks; filled slots are recorded.
During decode, attention operates on the logical view while the block table fetches the underlying physical data.
When a logical block becomes full, a new logical block and a corresponding physical block are allocated.
Multi‑request flow :
Requests with identical KV data share the same physical blocks, tracked with a reference count.
When generated tokens diverge, a copy‑on‑write creates new physical blocks for the differing tokens.
PagedAttention in decoding scenarios
Parallel sampling
When the same prompt is sampled multiple times, traditional KV‑cache allocates separate space for each copy. PagedAttention shares the physical KV blocks for the identical prompt tokens, reducing memory usage. During decode, divergent tokens trigger copy‑on‑write, creating new physical blocks only for the differing tokens.
Beam search
Beam search expands multiple candidate sequences. PagedAttention keeps a shared logical block for the prompt (block 0) and creates new logical blocks for each beam’s tokens. When a beam is pruned, its logical and physical blocks are released, freeing GPU memory.
Scheduling and preemption
General principle
First‑Come‑First‑Serve (FCFS) ordering of incoming requests.
If GPU memory becomes scarce, later requests are preempted to free space for earlier ones.
Handling preempted requests
When a request is preempted, vLLM swaps its entire KV‑cache from GPU to CPU memory (all‑or‑nothing strategy). Once GPU memory is sufficient, the cached blocks are swapped back and computation resumes.
Distributed management
In multi‑GPU setups a central Scheduler maintains block tables for each device and broadcasts them to workers. Each worker’s cache engine manages its local KV blocks. In tensor‑parallel deployments (e.g., Megatron‑LM) all GPUs share the same logical‑to‑physical mapping but store different slices of the KV cache.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
