How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations
The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.
GPU Parallelism Fundamentals
Data Parallel (DP) replicates the full model on each GPU and aggregates gradients. When model size exceeds a single GPU’s memory, parameters, gradients, optimizer states, and activations must be sharded, introducing communication overhead.
For a 10 B‑parameter model in FP16:
Parameters: 20 GB
Gradients: 20 GB
AdamW optimizer states (two 32‑bit moments + master weights): 80 GB + 40 GB = 120 GB
Baseline memory demand ≈ 160 GB without activations; activations push the requirement beyond a single GPU, necessitating distributed strategies.
Common Parallel Strategies
ZeRO‑1 : shards optimizer state only.
ZeRO‑2 : shards optimizer state and gradients.
ZeRO‑3 / FSDP : shards model parameters as well.
Tensor Parallel (TP) : splits tensors across GPUs; high communication volume, best on NVLink domains.
Pipeline Parallel (PP) : distributes Transformer layers across GPU groups; low communication volume, suitable for cross‑node execution.
Expert Parallel (EP) : native to MoE; experts placed on different GPUs, tokens routed via all‑to‑all, results combined; consumes significant InfiniBand (IB) bandwidth.
Context Parallel (CP) : splits the sequence dimension inside attention for long‑context pre‑training.
Composability of Parallelisms
Parallelisms are orthogonal and can be stacked. A 3‑D parallelism combines DP × TP × PP where total GPU count N = DP_total × TP × PP. EP can be overlaid on DP, giving DP_total = EP × DP_replica.
DeepSeek‑V3 uses 2048 H800 GPUs with PP=16, EP=64, yielding DP_total=128 and DP_replica=2. Global batch size GBS = MBS × DP_total × grad_accum_steps.
Why DeepSeek Chooses DP + ZeRO‑1 + EP + PP
ZeRO‑3 requires an all‑gather of the full parameter set each micro‑batch, consuming GB‑scale IB bandwidth and colliding with EP traffic. ZeRO‑2 introduces similar gradient‑level all‑gather traffic. ZeRO‑1 adds no communication during the forward pass and performs a single all‑gather only during the optimizer step, matching the communication cost of vanilla DP while saving ~12× memory.
Thus IB bandwidth is reserved for EP, while PP handles cross‑node communication with minimal volume (B × seq_len × hidden_size × 2).
Pipeline Bubble and Mitigation
PP introduces pipeline bubbles because backward computation must wait for forward results. The classic GPipe bubble ratio is (P‑1)/(M+P‑1), where P is the number of pipeline stages and M the micro‑batch count.
Techniques:
1F1B (one forward, one backward) discards activations earlier, lowering memory usage and allowing larger micro‑batches.
Zero‑Bubble PP (ZB1P) splits each backward layer into sub‑steps, enabling the gradient‑w.r.t. activation (dx) to proceed before the weight gradient (dw), thereby compressing bubbles.
DualPipe: Overlapping EP Communication with Computation
DualPipe builds on ZB1P by creating two reverse micro‑batches per pipeline stage. While one direction performs forward computation, the opposite direction overlaps EP’s all‑to‑all communication with computation, effectively hiding EP traffic.
Trade‑off: memory footprint doubles because each rank holds two copies of parameters.
Waved‑EP in DeepSeek‑V4
Waved‑EP is a kernel that removes dependence on PP scheduling. Experts are divided into several “waves”; each wave’s dispatch, compute, and combine stages are overlapped with the next wave’s dispatch, achieving communication‑compute overlap within a single kernel.
Reported speed‑ups versus the baseline (Comet): 1.96× for reinforcement‑learning workloads and 1.50–1.73× for general workloads.
Implementation Details
DeepSeek allocates 20 SMs per H800 GPU exclusively for custom PTX‑level communication kernels; the remaining SMs handle computation. These kernels are not NCCL‑based, enabling fine‑grained control.
Triton is not designed for communication‑compute fusion. Its abstraction focuses on high‑performance GEMM/element‑wise kernels within a single GPU and lacks robust support for cross‑GPU primitives such as NVSHMEM. TileLang, used for Waved‑EP, provides both high‑level DSL productivity and low‑level PTX/CUDA control.
Hardware Implications
With DP + ZeRO‑1 + EP + PP, DeepSeek can train models exceeding 1 T parameters on an 8‑GPU node connected via InfiniBand. The authors suggest future hardware prioritize balanced compute‑communication overlap rather than merely increasing raw bandwidth.
Open Questions
How to split matrices when Muon prevents ZeRO‑1 from sharding elements.
DualPipe‑V: removing redundant parameters while keeping activation memory comparable to 1F1B.
Integrating CP with attention compression techniques.
References
1. https://arxiv.org/pdf/2512.24880
2. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
3. https://arxiv.org/pdf/2412.1943Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
