DistServe — 1 Technical Articles

Jul 11, 2024 · Artificial Intelligence

Why Separate Prefill and Decode? A Deep Dive into DistServe’s Split Inference Architecture

This article explores the two‑stage LLM inference pipeline, introduces TTFT and TPOT metrics, explains the motivation for prefilling‑decoding separation, presents experimental comparisons between split and merged architectures, and details optimization techniques and parallel‑strategy modeling for DistServe.

DistServeGoodputLLM inference

0 likes · 28 min read

Why Separate Prefill and Decode? A Deep Dive into DistServe’s Split Inference Architecture