How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations
This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.
Background and Results
After DeepSeek release, inference demand surged. Existing engines like vllm/sglang suffered low GPU utilization and slow speed. The Angel‑HCF team, together with Tencent Cloud, optimized DeepSeek to achieve industry‑leading performance: 15800+ tokens/s, QPM=212 under a 50 ms per‑token limit.
Core Goal: Maximize throughput while keeping first‑token latency <2 s and per‑token generation <50 ms.
Technical Roadmap: Hardware co‑design, algorithm innovations, system engineering (framework optimization, w4a8c8 quantization, MTP parallel decoding, PD separation, large EP, TBO, etc.).
Multi‑Machine Performance Optimization
We migrated from a pure C++ framework to a Python runtime with C++ kernels, spending three months on framework, operator, and quantization improvements. This article focuses on PD separation, large EP, DP parallelism, and multi‑layer MTP optimizations.
2.1 PD Separation Design and Optimization
2.1.1 Different Parallel Strategies for Prefill and Decode
Prefill is compute‑bound; we use large TP + small EP parallelism. Decode is memory‑bound; we use DP + large EP to increase batch size and reduce memory pressure.
2.1.2 Efficient KV‑Cache Transfer under PD Separation
2.1.2.1 Overlapping Compute and Transfer
Asynchronous KV‑Cache transfer overlaps with the current iteration, using RDMA to keep GPU utilization high.
2.1.2.2 Layerwise Transfer
Layerwise transmission reduces latency independent of sequence length, merging small packets when ISL<8k.
2.1.3 PD Load Balancing
Prefill uses length‑sorted chunk scheduling; Decode uses remaining‑slot‑based scheduling and dynamic scaling.
2.1.4 PD Architecture and Performance
Tests on 16×H20‑96G show PD improves throughput by 30‑40 % in the 20‑25 tokens/s range.
2.2 EP Parallel Optimization
DeepSeek’s MoE sparsity required communication and load‑balancing improvements.
2.2.1 DeepEP Multi‑Machine Communication
Using the TRMT library reduced communication overhead from >40 % to a 60 % decrease.
2.2.2 Expert Load Balancing
Hot‑spot experts were balanced with the EPLB algorithm and redundant experts, lowering activation imbalance to 1.2‑1.5×.
2.3 DP Parallel Adaptation and Optimization
DP parallelism, adapted from training frameworks, boosts single‑node throughput >50 % while meeting SLOs.
2.3.1 Runtime Modes
DP requires mock requests per forward step and all‑reduce synchronization.
2.3.2 PD Adaptation for DP
Load balancing via round‑robin, request‑count, and KV‑Cache size; KV‑Cache transfer illustrated.
Multi‑Layer MTP Optimization and Practice
MTP predicts multiple tokens per position, improving inference speed. Training multiple MTP layers and optimizing sampling raise acceptance rates.
3.1 Training Multi‑Layer MTP
Two approaches: five independent MTP weights or shared weights across five layers, using Megatron‑LLM.
Results: shared MTP2 +7.4 %, independent MTP2 +8.8 %, independent MTP3 +9.0 % over open‑source baselines.
3.2 Sampling Optimization
Three verification methods—token‑by‑token, rejection sampling, typical sampling—were evaluated; our hybrid method with dynamic temperature achieved ~0.7 acceptance without accuracy loss.
3.2.1 Token‑by‑Token
Strict verification, ~0.51 acceptance.
3.2.2 Rejection Sampling
~0.56 acceptance.
3.2.3 Typical Sampling
~0.62 acceptance.
3.2.4 Our Method
Combines different samplers for draft and verify, reaching ~0.7 acceptance while keeping model accuracy unchanged.
Future Work
Continuing to explore large EP, TBO, DeepEP communication, and global KV‑Cache, aiming for >20 000 tokens/s.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
