25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

Baobao Algorithm Notes

Apr 5, 2024

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

Introduction

vLLM uses PagedAttention , a technique inspired by operating‑system virtual‑memory paging, to dynamically allocate KV‑cache on GPU and reduce memory fragmentation.

LLM inference stages

Inference is split into prefill (forward pass on the whole prompt) and decode (token‑by‑token generation). KV‑cache stores key/value pairs for each token so later attention can reuse them. During prefill the KV pairs are written to cache_k and cache_v; during decode each new token’s KV pair is appended, causing the cache to grow and dominate latency.

Key observations:

KV‑cache size grows with prompt length and concurrent request count, stressing GPU memory.

The output sequence length is unknown beforehand, making static allocation inefficient.

Traditional KV‑cache allocation

Most inference servers allocate a fixed rectangular block per request based on (batch_size, max_seq_len). This static layout creates internal and external fragmentation:

Light‑colored blocks : Prefill KV‑cache, always used.

Medium‑colored blocks : Decode KV‑cache that may be partially used (reservation fragment).

Dark‑colored blocks : Decode KV‑cache never used (internal fragment).

Gray blocks : External fragments that are non‑contiguous and cannot be reused.

Fragmentation wastes GPU memory and limits throughput.

PagedAttention principle

Operating‑system virtual memory

OS paging divides physical memory into fixed‑size pages and maps virtual pages to physical frames via a page table, eliminating fragmentation.

PagedAttention mechanics

vLLM treats each request as a process, logical KV blocks as virtual pages, and physical KV blocks as GPU frames. A block table maps logical blocks to physical blocks.

Single‑request flow :

During prefill, the prompt is split into logical blocks of size B (e.g., B=4). A 7‑token prompt creates two logical blocks.

Logical blocks are mapped to physical blocks; filled slots are recorded.

During decode, attention operates on the logical view while the block table fetches the underlying physical data.

When a logical block becomes full, a new logical block and a corresponding physical block are allocated.

Multi‑request flow :

Requests with identical KV data share the same physical blocks, tracked with a reference count.

When generated tokens diverge, a copy‑on‑write creates new physical blocks for the differing tokens.

PagedAttention in decoding scenarios

Parallel sampling

When the same prompt is sampled multiple times, traditional KV‑cache allocates separate space for each copy. PagedAttention shares the physical KV blocks for the identical prompt tokens, reducing memory usage. During decode, divergent tokens trigger copy‑on‑write, creating new physical blocks only for the differing tokens.

Beam search

Beam search expands multiple candidate sequences. PagedAttention keeps a shared logical block for the prompt (block 0) and creates new logical blocks for each beam’s tokens. When a beam is pruned, its logical and physical blocks are released, freeing GPU memory.

Scheduling and preemption

General principle

First‑Come‑First‑Serve (FCFS) ordering of incoming requests.

If GPU memory becomes scarce, later requests are preempted to free space for earlier ones.

Handling preempted requests

When a request is preempted, vLLM swaps its entire KV‑cache from GPU to CPU memory (all‑or‑nothing strategy). Once GPU memory is sufficient, the cached blocks are swapped back and computation resumes.

Distributed management

In multi‑GPU setups a central Scheduler maintains block tables for each device and broadcasts them to workers. Each worker’s cache engine manages its local KV blocks. In tensor‑parallel deployments (e.g., Megatron‑LM) all GPUs share the same logical‑to‑physical mapping but store different slices of the KV cache.