How PD (Prefill‑Decode) Disaggregation Makes LLM Inference Faster and More Stable
The article explains PD (Prefill‑Decode) disaggregation, an architecture that separates the compute‑bound Prefill stage from the memory‑bound Decode stage onto different GPU pools, eliminating interference, enabling independent scaling, leveraging hardware specialization, and delivering up to 85% lower tail latency for large language model inference.
Why PD Disaggregation Matters
Large language model (LLM) inference faces a trade‑off between fast response and stable throughput. Traditional deployments mix Prefill (the compute‑intensive stage that processes the entire prompt in parallel) and Decode (the memory‑bandwidth‑intensive stage that generates tokens one by one) on the same GPU pool, causing resource contention and unpredictable tail latency.
Two Fundamental Stages
Prefill processes the input prompt, builds the KV cache, and emits the first token. It is compute‑bound; a 2000‑token prompt can require billions of FLOPs.
Decode performs autoregressive generation token by token, reading model weights and the KV cache. It is memory‑bandwidth‑bound, with the bandwidth of the GPU’s HBM becoming the bottleneck.
Problems with Mixed Deployment
Unpredictable tail latency: a heavy Prefill blocks short Decode requests.
Low resource utilization: Prefill needs high compute but modest bandwidth, Decode needs the opposite.
Scaling difficulty: scaling the whole node forces a compromise between compute and bandwidth.
What PD Disaggregation Does
PD (Prefill‑Decode) Disaggregation physically separates the two stages into dedicated hardware pools, allowing each to run on the most suitable GPU type.
Interference Elimination
Benchmarks show that after applying PD disaggregation:
Decode P99 inter‑token latency drops by 52%–85%.
Tail‑latency variance becomes predictable instead of “unacceptable”.
User‑perceived response‑time standard deviation shrinks dramatically.
Independent Scaling
Prefill load scales with input token count, while Decode load scales with output token count and request concurrency. PD lets operators:
Scale the Decode cluster during traffic spikes.
Scale the Prefill cluster for long‑text workloads.
Shrink resources in off‑peak periods to save cost.
Hardware Specialization
Different GPUs excel at different stages: NVIDIA H100 offers massive HBM bandwidth ideal for Decode, whereas A100 or L40S provide higher compute density suited for Prefill. PD lets each stage run on the optimal hardware.
PD Functional Architecture
Router / Scheduler
The router examines request characteristics (prompt length, expected output length) and dispatches the request to either a Prefill or Decode node, handling load balancing, queue management, retries, and timeouts.
Prefill Cluster
Acts as a compute engine. Each node receives the user prompt, runs self‑attention to build the KV cache, emits the first token, and forwards the KV cache and token to the Decode side. Design focus: high compute density and parallel efficiency; batch sizes > 1 dramatically improve Prefill throughput.
Decode Cluster
Acts as a generation engine. Each node receives the KV cache and first token, then performs step‑wise inference, streaming tokens until generation ends. Design focus: maximal HBM bandwidth and KV‑cache access speed; H100’s 3.35 TB/s bandwidth is highlighted as ideal.
KV‑Cache Transport Layer
Connects the two clusters via a high‑speed “highway”. Challenges include large data volume (hundreds of MB for a 2000‑token request), latency sensitivity (affects first‑token delay), and bandwidth cost. Common solutions in the industry are RDMA (zero‑copy, low latency), NVLink/NVSwitch (intra‑server GPU interconnect), and specialized middleware such as NIXL or Mooncake.
Deploying PD on the TAI Platform
Users enable PD mode in the TAI AI development platform, select the provided inference engines (vLLM or sglang), configure GPU resources, instance counts, startup commands, and environment variables for Prefill, Decode, and proxy roles, and then launch the deployment.
Typical Use Cases
High‑concurrency online chat or customer‑service bots where millisecond‑level latency is critical.
Long‑text inference such as Retrieval‑Augmented Generation or document summarization, where prompts can contain thousands of tokens.
Multi‑turn agent workflows (e.g., AutoGPT) that maintain extensive context histories and require steady throughput.
While PD disaggregation is not a universal solution, it delivers clear performance and scalability benefits in the scenarios above.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
