vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

This article walks through a two‑month source‑code study of vLLM, explaining how token‑level scheduling, continuous batching, and the Paged Attention mechanism reshape tensor dimensions to turn large‑model inference into a compute‑bound, high‑throughput process while managing GPU memory efficiently.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

Overview

vLLM implements high‑throughput inference for decoder‑only LLMs (e.g., Llama‑3) by moving the scheduling granularity from request level to token level, allocating KV‑Cache in fixed‑size blocks, and fusing key kernels such as attention and linear projections.

Continuous Batching (Token‑Level Scheduling)

Each request tracks two counters: num_computed_tokens – tokens already processed (including prefix‑cache hits). num_tokens – total tokens in the request (prompt + generated output).

The scheduler repeatedly selects a set of tokens to compute such that num_computed_tokens catches up to num_tokens while respecting four hard limits:

Maximum concurrent requests: self.max_num_running_reqs = scheduler_config.max_num_seqs Token budget per step:

self.max_num_scheduled_tokens = scheduler_config.max_num_scheduled_tokens or scheduler_config.max_num_batched_tokens

Model maximum sequence length: self.max_model_len = model_config.max_model_len Availability of free KV‑Cache blocks.

All selected tokens are flattened into a single tensor [num_sched_tokens, …], eliminating padding and allowing the same GEMM kernels to be reused across requests.

Paged Attention and Block Table

At startup vLLM allocates the KV‑Cache with shape

[num_layers, 2, num_blocks, block_size, num_kv_heads, head_dim]

. For each request a block_table of shape [max_num_reqs, max_num_blocks_per_req] stores the physical block IDs assigned to that request.

Example (block size = 16): token 25 belongs to virtual block 1 (since 25 // 16 = 1). The block table entry for request 0 at index 1 is 8, so the final slot index is

slot_idx = block_idx * block_size + block_offset
# block_idx = 8, block_offset = 25 % 16 = 9
# slot_idx = 8 * 16 + 9 = 137

This virtual‑page‑table mechanism removes the need for a large contiguous KV allocation, dramatically improving GPU memory utilisation. The trade‑off is occasional uncoalesced accesses when an attention kernel reads across block boundaries, which slightly reduces L2 cache hit rate.

Attention Optimisation – FlashAttention & Online Softmax

Standard attention materialises large intermediate matrices (scores and softmax probabilities) that quickly hit the memory wall. FlashAttention fuses the entire attention flow into a single kernel that processes the matrix in tiles. To avoid a global max/exp reduction, vLLM implements an Online Softmax that maintains a running maximum and sum, updating them as each tile is processed. This eliminates the need to store the full intermediate matrix in HBM.

Prefill vs. Decode – Compute‑Bound vs. Memory‑Bound

During a Prefill step many tokens are processed at once; QKV projection, attention, O‑projection and MLP are dense GEMM operations, making the step compute‑bound. In a Decode step only one token ( query_lens = 1) is generated, turning the same operations into GEMV and making the step memory‑bound because the KV‑Cache must be streamed from HBM to SRAM for every token.

Continuous Batching restores GEMM intensity in Decode by grouping many requests at the token level, so the same high‑throughput kernels can be reused.

Fused Linear Layers

vLLM concatenates the weights of Q, K, V and the two FFN projections (Gate + Up) along the column dimension, turning two medium‑size matrix multiplies into a single wide GEMM. For Llama‑3‑8B the fused QKV width is 6144 (4096 + 1024 + 1024) and the fused Gate/Up width is 28672 (2 × 14336). This reduces kernel launch overhead and improves memory‑bandwidth utilisation.

RMSNorm and Activation

RMSNorm replaces LayerNorm by removing the bias term and scaling only along the feature dimension, preventing variance explosion in deep networks. The FFN activation is a SiLU‑gate multiplied by the Up projection, implemented in a single silu_and_mul kernel that writes the result directly to SRAM.

Sampling Pipeline

The sampler first snapshots raw logits with log_softmax if log‑probs are requested. Greedy requests perform an argmax before temperature scaling; all other requests apply temperature scaling, optional min_p filtering, top‑k/top‑p truncation, and finally Gumbel‑Max sampling. The selected token ID is appended to the input sequence for the next iteration.

Preemption Strategy

If the KV‑Cache runs out of free blocks, the scheduler preempts the lowest‑priority RUNNING request, frees its cache, resets num_computed_tokens to 0 and moves it back to the WAITING queue. In v1 the only preemption mode is RECOMPUTE (no host‑RAM swap), because recomputing a prefilling request is cheaper than PCIe data movement.

Key Takeaways

Token‑level scheduling (continuous batching) turns Decode into a compute‑bound phase.

Paged Attention’s block‑table layout maximises GPU memory utilisation while incurring modest bandwidth penalties from occasional uncoalesced accesses.

Fused QKV and Gate/Up projections reduce kernel launches and improve throughput.

FlashAttention with Online Softmax removes large intermediate tensors, breaking the memory wall.

Preemption prefers recompute over swapping to avoid PCIe bottlenecks.

References

Attention Is All You Need – https://arxiv.org/abs/1706.03762 Efficient Memory Management for Large Language Model Serving with PagedAttention – https://arxiv.org/abs/2309.06180 FlashAttention: Fast and Memory‑Efficient Exact Attention with IO‑Awareness – https://arxiv.org/abs/2205.14135 Flash‑Decoding for Long‑Context Inference – https://crfm.stanford.edu/2023/10/12/flashdecoding.html SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills – https://arxiv.org/abs/2308.16369
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMFlashAttentionLLM inferenceGPU optimizationcontinuous batchingpaged attention
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.