Artificial Intelligence 18 min read

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

Data Party THU

May 17, 2026

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

GPU Parallelism Fundamentals

Data Parallel (DP) replicates the full model on each GPU and aggregates gradients. When model size exceeds a single GPU’s memory, parameters, gradients, optimizer states, and activations must be sharded, introducing communication overhead.

For a 10 B‑parameter model in FP16:

Parameters: 20 GB

Gradients: 20 GB

AdamW optimizer states (two 32‑bit moments + master weights): 80 GB + 40 GB = 120 GB

Baseline memory demand ≈ 160 GB without activations; activations push the requirement beyond a single GPU, necessitating distributed strategies.

Common Parallel Strategies

ZeRO‑1 : shards optimizer state only.

ZeRO‑2 : shards optimizer state and gradients.

ZeRO‑3 / FSDP : shards model parameters as well.

Tensor Parallel (TP) : splits tensors across GPUs; high communication volume, best on NVLink domains.

Pipeline Parallel (PP) : distributes Transformer layers across GPU groups; low communication volume, suitable for cross‑node execution.

Expert Parallel (EP) : native to MoE; experts placed on different GPUs, tokens routed via all‑to‑all, results combined; consumes significant InfiniBand (IB) bandwidth.

Context Parallel (CP) : splits the sequence dimension inside attention for long‑context pre‑training.

Composability of Parallelisms

Parallelisms are orthogonal and can be stacked. A 3‑D parallelism combines DP × TP × PP where total GPU count N = DP_total × TP × PP. EP can be overlaid on DP, giving DP_total = EP × DP_replica.

DeepSeek‑V3 uses 2048 H800 GPUs with PP=16, EP=64, yielding DP_total=128 and DP_replica=2. Global batch size GBS = MBS × DP_total × grad_accum_steps.

Why DeepSeek Chooses DP + ZeRO‑1 + EP + PP

ZeRO‑3 requires an all‑gather of the full parameter set each micro‑batch, consuming GB‑scale IB bandwidth and colliding with EP traffic. ZeRO‑2 introduces similar gradient‑level all‑gather traffic. ZeRO‑1 adds no communication during the forward pass and performs a single all‑gather only during the optimizer step, matching the communication cost of vanilla DP while saving ~12× memory.

Thus IB bandwidth is reserved for EP, while PP handles cross‑node communication with minimal volume (B × seq_len × hidden_size × 2).

Pipeline Bubble and Mitigation

PP introduces pipeline bubbles because backward computation must wait for forward results. The classic GPipe bubble ratio is (P‑1)/(M+P‑1), where P is the number of pipeline stages and M the micro‑batch count.

Techniques:

1F1B (one forward, one backward) discards activations earlier, lowering memory usage and allowing larger micro‑batches.

Zero‑Bubble PP (ZB1P) splits each backward layer into sub‑steps, enabling the gradient‑w.r.t. activation (dx) to proceed before the weight gradient (dw), thereby compressing bubbles.

DualPipe: Overlapping EP Communication with Computation

DualPipe builds on ZB1P by creating two reverse micro‑batches per pipeline stage. While one direction performs forward computation, the opposite direction overlaps EP’s all‑to‑all communication with computation, effectively hiding EP traffic.

Trade‑off: memory footprint doubles because each rank holds two copies of parameters.

Waved‑EP in DeepSeek‑V4

Waved‑EP is a kernel that removes dependence on PP scheduling. Experts are divided into several “waves”; each wave’s dispatch, compute, and combine stages are overlapped with the next wave’s dispatch, achieving communication‑compute overlap within a single kernel.

Reported speed‑ups versus the baseline (Comet): 1.96× for reinforcement‑learning workloads and 1.50–1.73× for general workloads.

Implementation Details

DeepSeek allocates 20 SMs per H800 GPU exclusively for custom PTX‑level communication kernels; the remaining SMs handle computation. These kernels are not NCCL‑based, enabling fine‑grained control.

Triton is not designed for communication‑compute fusion. Its abstraction focuses on high‑performance GEMM/element‑wise kernels within a single GPU and lacks robust support for cross‑GPU primitives such as NVSHMEM. TileLang, used for Waved‑EP, provides both high‑level DSL productivity and low‑level PTX/CUDA control.

Hardware Implications

With DP + ZeRO‑1 + EP + PP, DeepSeek can train models exceeding 1 T parameters on an 8‑GPU node connected via InfiniBand. The authors suggest future hardware prioritize balanced compute‑communication overlap rather than merely increasing raw bandwidth.

Open Questions

How to split matrices when Muon prevents ZeRO‑1 from sharding elements.

DualPipe‑V: removing redundant parameters while keeping activation memory comparable to 1F1B.

Integrating CP with attention compression techniques.

References

1. https://arxiv.org/pdf/2512.24880
2. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
3. https://arxiv.org/pdf/2412.1943

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts DeepSeek pipeline parallelism model parallelism ZeRO TileLang Waved-EP GPU Communication Overlap

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

GPU Parallelism Fundamentals

Common Parallel Strategies

Composability of Parallelisms

Why DeepSeek Chooses DP + ZeRO‑1 + EP + PP

Pipeline Bubble and Mitigation

DualPipe: Overlapping EP Communication with Computation

Waved‑EP in DeepSeek‑V4

Implementation Details

Hardware Implications

Open Questions

References

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Why DeepSeek Chooses DP + ZeRO‑1 + EP + PP