How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

Data Party THU
Data Party THU
Data Party THU
How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

GPU Parallelism Fundamentals

Data Parallel (DP) replicates the full model on each GPU and aggregates gradients. When model size exceeds a single GPU’s memory, parameters, gradients, optimizer states, and activations must be sharded, introducing communication overhead.

For a 10 B‑parameter model in FP16:

Parameters: 20 GB

Gradients: 20 GB

AdamW optimizer states (two 32‑bit moments + master weights): 80 GB + 40 GB = 120 GB

Baseline memory demand ≈ 160 GB without activations; activations push the requirement beyond a single GPU, necessitating distributed strategies.

Common Parallel Strategies

ZeRO‑1 : shards optimizer state only.

ZeRO‑2 : shards optimizer state and gradients.

ZeRO‑3 / FSDP : shards model parameters as well.

Tensor Parallel (TP) : splits tensors across GPUs; high communication volume, best on NVLink domains.

Pipeline Parallel (PP) : distributes Transformer layers across GPU groups; low communication volume, suitable for cross‑node execution.

Expert Parallel (EP) : native to MoE; experts placed on different GPUs, tokens routed via all‑to‑all, results combined; consumes significant InfiniBand (IB) bandwidth.

Context Parallel (CP) : splits the sequence dimension inside attention for long‑context pre‑training.

Composability of Parallelisms

Parallelisms are orthogonal and can be stacked. A 3‑D parallelism combines DP × TP × PP where total GPU count N = DP_total × TP × PP. EP can be overlaid on DP, giving DP_total = EP × DP_replica.

DeepSeek‑V3 uses 2048 H800 GPUs with PP=16, EP=64, yielding DP_total=128 and DP_replica=2. Global batch size GBS = MBS × DP_total × grad_accum_steps.

Why DeepSeek Chooses DP + ZeRO‑1 + EP + PP

ZeRO‑3 requires an all‑gather of the full parameter set each micro‑batch, consuming GB‑scale IB bandwidth and colliding with EP traffic. ZeRO‑2 introduces similar gradient‑level all‑gather traffic. ZeRO‑1 adds no communication during the forward pass and performs a single all‑gather only during the optimizer step, matching the communication cost of vanilla DP while saving ~12× memory.

Thus IB bandwidth is reserved for EP, while PP handles cross‑node communication with minimal volume (B × seq_len × hidden_size × 2).

Pipeline Bubble and Mitigation

PP introduces pipeline bubbles because backward computation must wait for forward results. The classic GPipe bubble ratio is (P‑1)/(M+P‑1), where P is the number of pipeline stages and M the micro‑batch count.

Techniques:

1F1B (one forward, one backward) discards activations earlier, lowering memory usage and allowing larger micro‑batches.

Zero‑Bubble PP (ZB1P) splits each backward layer into sub‑steps, enabling the gradient‑w.r.t. activation (dx) to proceed before the weight gradient (dw), thereby compressing bubbles.

DualPipe: Overlapping EP Communication with Computation

DualPipe builds on ZB1P by creating two reverse micro‑batches per pipeline stage. While one direction performs forward computation, the opposite direction overlaps EP’s all‑to‑all communication with computation, effectively hiding EP traffic.

Trade‑off: memory footprint doubles because each rank holds two copies of parameters.

Waved‑EP in DeepSeek‑V4

Waved‑EP is a kernel that removes dependence on PP scheduling. Experts are divided into several “waves”; each wave’s dispatch, compute, and combine stages are overlapped with the next wave’s dispatch, achieving communication‑compute overlap within a single kernel.

Reported speed‑ups versus the baseline (Comet): 1.96× for reinforcement‑learning workloads and 1.50–1.73× for general workloads.

Implementation Details

DeepSeek allocates 20 SMs per H800 GPU exclusively for custom PTX‑level communication kernels; the remaining SMs handle computation. These kernels are not NCCL‑based, enabling fine‑grained control.

Triton is not designed for communication‑compute fusion. Its abstraction focuses on high‑performance GEMM/element‑wise kernels within a single GPU and lacks robust support for cross‑GPU primitives such as NVSHMEM. TileLang, used for Waved‑EP, provides both high‑level DSL productivity and low‑level PTX/CUDA control.

Hardware Implications

With DP + ZeRO‑1 + EP + PP, DeepSeek can train models exceeding 1 T parameters on an 8‑GPU node connected via InfiniBand. The authors suggest future hardware prioritize balanced compute‑communication overlap rather than merely increasing raw bandwidth.

Open Questions

How to split matrices when Muon prevents ZeRO‑1 from sharding elements.

DualPipe‑V: removing redundant parameters while keeping activation memory comparable to 1F1B.

Integrating CP with attention compression techniques.

References

1. https://arxiv.org/pdf/2512.24880
2. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
3. https://arxiv.org/pdf/2412.1943
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture of ExpertsDeepSeekPipeline ParallelismModel ParallelismZeROTileLangWaved-EPGPU Communication Overlap
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.