Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability
This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.
Background
Deepseek recently released the Deepseek‑V3 and Deepseek‑R1 models, claiming ultra‑low training costs and performance comparable to top‑tier closed‑source models. Alongside the models, they open‑sourced high‑performance inference and training components and published their inference cost and profit margins, sparking industry discussion.
System Overview
Deepseek‑V3/R1 uses a split architecture with Data Parallelism (DP) and Expert Parallelism (EP). Each layer contains 256 experts and 32 redundant experts. The deployment configuration is:
Prefill: routing experts EP32, MLA and shared experts DP32, one deployment unit = 4 nodes, 32 redundant routing experts, each card 9 routing experts + 1 shared expert
Decode: routing experts EP144, MLA and shared experts DP144, one deployment unit = 18 nodes, 32 redundant routing experts, each card 2 routing experts + 1 shared expertThe article focuses on EP, arguing that without EP the split architecture offers no fundamental benefit. EP separates two very different computational behaviors: the prefill stage processes tens of millions of tokens per user, while decode processes only a few tokens, making a stable, low‑fragmentation environment essential.
EP vs. TP Advantages
Cost Efficiency : EP enables higher throughput per fixed resources by maximizing batch size. In a decode scenario with 18 nodes (144 H800 GPUs), a TP deployment would replicate the model across 9 instances, consuming ~5400 GB of model weights. EP consolidates the model into a single instance using only ~1.5 × the weight size, saving ~4500 GB of GPU memory and allowing a much larger batch size.
Memory Utilization : For a 671 B model, the activation size is 37 B. Using DP+EP, a single node can host the entire model (≈37 B) on 8 H800 GPUs, achieving a generation throughput of ~14.7 k tokens/s, which is unattainable with TP due to excessive weight duplication.
Technical Benefits of EP
Effective Model Parallelism
EP requires no tensor partitioning and naturally supports model parallelism across up to 256 GPUs (256 experts). TP, by contrast, must split Q/K/V heads and FFN linear layers, leading to inefficient tensor shapes and fragmented matrix multiplications.
Elimination of Degenerate Matrices
DP combined with EP avoids the “abnormal matrix” issue that arises when TP splits dimensions that are smaller than the hardware’s optimal tile size (e.g., num_head * seq_len < 64). EP’s DP keeps num_head = 128, matching Hopper’s WGMMA requirements and eliminating redundant calculations.
Communication Independence from Global Batch Size
EP’s All‑2‑All communication volume depends only on the macro‑batch size of a single card, not on the total batch size of the whole instance. Scaling from 10 to 100 cards multiplies total batch size tenfold while keeping per‑card communication constant, unlike TP where communication scales linearly with the global batch size.
Memory‑Bound Mitigation
Memory‑bound here refers to GPU memory limits that prevent larger batch sizes, not bandwidth limits. In a TP‑based fused‑MOE deployment, a 671 B model on 16 GPUs wastes ~634 B of parameters, severely restricting batch size. EP allocates roughly one routing and one shared expert per card, matching activation size and removing this memory bottleneck.
Challenges of EP Deployment
Reliability
With an EP instance spanning 22 nodes and 176 GPUs, the probability of a single‑card failure multiplies by 176, dramatically increasing the chance of a large‑scale outage. Maintaining high availability requires many redundant instances and rapid fault isolation, otherwise a single instance failure can overload remaining services.
Scalability
Inside an instance, EP can scale only up to the number of experts (256 + 32 redundant), and All‑2‑All efficiency degrades as more cards synchronize. Across instances, sufficient redundancy is needed to absorb traffic from failed instances; otherwise service availability suffers.
Maintainability
Diagnosing issues in a 176‑GPU instance is far more complex than in single‑GPU deployments. Reproducing bugs requires coordinated replay of traffic across all cards, and manual troubleshooting becomes impractical, demanding robust monitoring and automated debugging tools.
Summary of EP Advantages
Significant reduction of model weight redundancy, freeing GPU memory for larger batch sizes.
Elimination of degenerate matrix operations, improving AI utilization and alleviating bandwidth‑bound limits.
Per‑card communication volume independent of total instance batch size, enabling efficient scaling.
Conversion of sparse models to dense equivalents by fully utilizing saved memory for larger batches.
The primary drawback is the increased risk of large‑scale failures due to the massive size of a single EP instance, which challenges system stability and availability.
Illustration
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
