How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

Background and Results

After DeepSeek release, inference demand surged. Existing engines like vllm/sglang suffered low GPU utilization and slow speed. The Angel‑HCF team, together with Tencent Cloud, optimized DeepSeek to achieve industry‑leading performance: 15800+ tokens/s, QPM=212 under a 50 ms per‑token limit.

Core Goal: Maximize throughput while keeping first‑token latency <2 s and per‑token generation <50 ms.

Technical Roadmap: Hardware co‑design, algorithm innovations, system engineering (framework optimization, w4a8c8 quantization, MTP parallel decoding, PD separation, large EP, TBO, etc.).

Multi‑Machine Performance Optimization

We migrated from a pure C++ framework to a Python runtime with C++ kernels, spending three months on framework, operator, and quantization improvements. This article focuses on PD separation, large EP, DP parallelism, and multi‑layer MTP optimizations.

2.1 PD Separation Design and Optimization

2.1.1 Different Parallel Strategies for Prefill and Decode

Prefill is compute‑bound; we use large TP + small EP parallelism. Decode is memory‑bound; we use DP + large EP to increase batch size and reduce memory pressure.

2.1.2 Efficient KV‑Cache Transfer under PD Separation

2.1.2.1 Overlapping Compute and Transfer

Asynchronous KV‑Cache transfer overlaps with the current iteration, using RDMA to keep GPU utilization high.

2.1.2.2 Layerwise Transfer

Layerwise transmission reduces latency independent of sequence length, merging small packets when ISL<8k.

2.1.3 PD Load Balancing

Prefill uses length‑sorted chunk scheduling; Decode uses remaining‑slot‑based scheduling and dynamic scaling.

2.1.4 PD Architecture and Performance

Tests on 16×H20‑96G show PD improves throughput by 30‑40 % in the 20‑25 tokens/s range.

2.2 EP Parallel Optimization

DeepSeek’s MoE sparsity required communication and load‑balancing improvements.

2.2.1 DeepEP Multi‑Machine Communication

Using the TRMT library reduced communication overhead from >40 % to a 60 % decrease.

2.2.2 Expert Load Balancing

Hot‑spot experts were balanced with the EPLB algorithm and redundant experts, lowering activation imbalance to 1.2‑1.5×.

2.3 DP Parallel Adaptation and Optimization

DP parallelism, adapted from training frameworks, boosts single‑node throughput >50 % while meeting SLOs.

2.3.1 Runtime Modes

DP requires mock requests per forward step and all‑reduce synchronization.

2.3.2 PD Adaptation for DP

Load balancing via round‑robin, request‑count, and KV‑Cache size; KV‑Cache transfer illustrated.

Multi‑Layer MTP Optimization and Practice

MTP predicts multiple tokens per position, improving inference speed. Training multiple MTP layers and optimizing sampling raise acceptance rates.

3.1 Training Multi‑Layer MTP

Two approaches: five independent MTP weights or shared weights across five layers, using Megatron‑LLM.

Results: shared MTP2 +7.4 %, independent MTP2 +8.8 %, independent MTP3 +9.0 % over open‑source baselines.

3.2 Sampling Optimization

Three verification methods—token‑by‑token, rejection sampling, typical sampling—were evaluated; our hybrid method with dynamic temperature achieved ~0.7 acceptance without accuracy loss.

3.2.1 Token‑by‑Token

Strict verification, ~0.51 acceptance.

3.2.2 Rejection Sampling

~0.56 acceptance.

3.2.3 Typical Sampling

~0.62 acceptance.

3.2.4 Our Method

Combines different samplers for draft and verify, reaching ~0.7 acceptance while keeping model accuracy unchanged.

Future Work

Continuing to explore large EP, TBO, DeepEP communication, and global KV‑Cache, aiming for >20 000 tokens/s.

inference optimizationDeepSeekGPU utilizationMTPAI performancemulti-node
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.