Why Separate Prefill and Decode? A Deep Dive into DistServe’s Split Inference Architecture
This article explores the two‑stage LLM inference pipeline, introduces TTFT and TPOT metrics, explains the motivation for prefilling‑decoding separation, presents experimental comparisons between split and merged architectures, and details optimization techniques and parallel‑strategy modeling for DistServe.
LLM Inference Stages and Evaluation Metrics
LLM inference consists of two sequential stages:
Prefill : the entire prompt is fed to the model to produce the first token.
Decode : subsequent tokens are generated one‑by‑one.
Performance is measured by:
TTFT (Time To First Token) – latency of the prefill stage.
TPOT (Time Per Output Token) – per‑token latency of the decode stage.
Practitioners often define Service Level Objectives (SLOs), e.g. P90 TTFT ≤ 0.4 s and P90 TPOT ≤ 0.04 s, and use the maximum request rate that satisfies both as the system’s goodput .
Why Split Prefill and Decode (PD Separation)
In monolithic servers (e.g., vLLM) prefill and decode share the same GPU resources and alternate based on request state. DistServe proposes a split architecture where dedicated prefill instances and decode instances run on separate GPU groups and exchange KV‑Cache after prefill finishes.
Although splitting appears to double memory usage and add KV‑Cache transfer overhead, experiments show a substantial throughput gain.
Experimental Comparison
Hardware : 1× NVIDIA A100 80 GB, 13 B model, input length = 512, output length = 64.
SLOs : P90 TTFT = 0.4 s, P90 TPOT = 0.04 s.
When prefill and decode are merged (single GPU handling both), the system meets both SLOs up to 1.6 rps (goodput = 1.6). When the GPU is dedicated to only prefill or only decode, goodput rises to 5.6 rps and 10 rps respectively. Assuming negligible KV‑Cache transfer, three GPUs allocated as two prefill instances and one decode instance sustain 10 rps (≈3.3 rps per GPU), more than twice the merged architecture’s throughput.
Optimization Directions for the Split Architecture
Independent Compute and Storage
Prefill is compute‑bound (high FLOPs) while decode is memory‑bound (frequent KV‑Cache reads). Separate GPU pools allow allocating compute‑heavy resources to prefill and memory‑heavy resources to decode.
Separate Batching Strategies
Increasing batch size yields diminishing returns for prefill (compute‑bound) but improves decode throughput (memory‑bound). Therefore distinct batching policies are beneficial.
Parallel‑Strategy Optimization
Because prefill and decode no longer share a model replica, their parallelism can be tuned independently:
Tensor‑parallelism (TP) splits the model’s attention heads across GPUs.
Pipeline‑parallelism (PP) splits transformer layers across GPUs.
DistServe explores configurations such as 2‑way PP for prefill and 2‑way TP for decode.
Prefill Parallelism Modeling
Using an M/D/1 queue, the average TTFT for a single GPU is: TTFT = D / (1 - R·D) where D is the deterministic service time per request and R is the Poisson arrival rate. Extending to multi‑GPU PP or TP introduces modified service times ( D_m, D_s) and a degradation factor K that captures TP communication overhead.
Decode Parallelism
DistServe relies on empirical simulation for decode: more GPUs improve throughput under PP, while TP reduces latency.
Practical Code: Finding the Best Parallel Strategy
The goal is to maximize goodput for each stage by selecting optimal parallel configurations and the number of instances (placement). Two hardware scenarios are considered:
Scenario 1 : High inter‑node bandwidth; KV‑Cache transfer cost ignored.
Scenario 2 : Limited bandwidth; prefills and decodes that communicate frequently must reside on the same node, forcing matching PP dimensions.
Key variables used in the search: N: maximum nodes an instance may occupy. M: maximum GPUs per node. C: total GPU memory in the cluster. W: sampled workload distribution (e.g., input lengths, model sizes). R: request arrival rate. best_plm: tuple (n, config_p, m, config_d) yielding maximal goodput. config_p: best parallel config for a prefill instance (TP/PP settings). config_d: best parallel config for a decode instance. n: number of prefill instances (data‑parallel factor). m: number of decode instances.
The algorithm performs a grid search over n, m, config_p, and config_d, invoking two simulators: simu_prefill – binary‑searches the highest request rate that satisfies the TTFT SLO under a given config_p and workload W. simu_decode – analogous simulation for TPOT under config_d.
The highest rate that meets both SLOs defines the goodput for that placement; the search returns the placement with maximal goodput.
Modeling TTFT and TPOT in Practice
DistServe’s appendix separates compute‑bound FLOP cost ( C1) from memory‑bound KV‑Cache I/O cost ( C2). FlashAttention removes the attention‑score FLOPs, so only the remaining GEMM operations contribute to C1. The latency model is:
T1 = C1 · FLOPs
T2 = C2 · IO
TTFT = T1 + T2Analogous expressions apply to TPOT. Parameters C1 and C2 are fitted to empirical measurements, enabling fast latency estimation without full simulation.
Final Recommendations
For a 175 B model, DistServe recommends deploying multiple prefill and decode instances with tailored parallelism (e.g., 2‑way PP for prefills, 2‑way TP for decodes) and placing them according to the bandwidth scenario. This configuration achieves the highest goodput while respecting the TTFT and TPOT SLOs.
Code and reproducible experiments are available at https://github.com/LLMServe/DistServe/tree/main/simdistserve.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
