Artificial Intelligence 19 min read

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

Machine Learning Algorithms & Natural Language Processing

May 6, 2026

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

Recent models have moved from dense to Mixture‑of‑Experts (MoE), causing a sharp drop in MFU. The drop is not due to small expert dimensions (the split dimension is at least 1024); instead it stems from GPU distributed parallelism and the need to hide communication behind computation.

DeepSeek Parallel Strategy Overview

DeepSeek‑V4 inherits the scalable infrastructure of DeepSeek‑V3. V3 used 2048 H800 GPUs with 8‑card NVLink interconnects and a parallel configuration of 16‑way Pipeline Parallel (PP) × 64‑way Expert Parallel (EP) × ZeRO‑1 Data Parallel (DP).

NVLink provides ~900 GB/s intra‑node bandwidth, while InfiniBand (IB) between nodes offers ~50 GB/s. EP traffic consumes IB bandwidth, making it the primary communication bottleneck and driving the need for sophisticated overlap techniques.

Fundamentals of GPU Parallelism

Data Parallel (DP) replicates the full model on each GPU, processes different data batches, and aggregates gradients. When model size exceeds a single GPU’s memory, parameters, gradients, optimizer states, and activations must be partitioned, introducing communication.

For a 10 B model in FP16, parameters occupy 20 GB, gradients another 20 GB, and AdamW optimizer states require 80 GB (two 32‑bit moments per parameter). Including a master weight (40 GB), the baseline memory demand reaches 160 GB, not counting activations. Models of 3–5 B already need distributed strategies beyond naïve DP.

Common Parallel Strategies

ZeRO‑1: shards optimizer state only.

ZeRO‑2: shards optimizer state and gradients.

ZeRO‑3 / Fully Sharded Data Parallel (FSDP): shards model parameters as well.

Tensor Parallel (TP): splits tensors across GPUs, performing both storage and compute partitioning.

Pipeline Parallel (PP): assigns different Transformer layers to separate GPU groups, creating a pipeline with low inter‑node communication.

Expert Parallel (EP): MoE‑specific; routes tokens to expert GPUs via all‑to‑all, then combines results. Communication is heavy and runs over IB.

Context Parallel (CP): splits the sequence dimension of attention, used for long‑context pre‑training.

Composability of Strategies

When N GPUs are available, the strategies are orthogonal and can be stacked, e.g., 3‑D parallelism = DP × TP × PP, where the product equals N. EP can be parasitic on DP, sharing the same GPUs. In DeepSeek’s 2048‑GPU setup: PP=16, EP=64, yielding DP_total=128 and DP_replica=2, so the global batch size is GBS = MBS × DP_total × grad_accum_steps.

DP ZeRO‑1 + EP/PP Combination

DeepSeek chooses ZeRO‑1 over ZeRO‑2/3 to avoid IB contention. ZeRO‑3 requires an all‑gather of full parameters each micro‑batch (GB‑level traffic), which would compete with EP’s IB bandwidth. ZeRO‑2’s reduce‑scatter of gradients also incurs GB‑scale traffic. ZeRO‑1 communicates only during optimizer steps, matching the communication volume of plain DP while saving ~12× memory, making it effectively free.

Consequently, the final strategy is PP + EP + DP‑ZeRO‑1, without TP or ZeRO‑2/3, preserving IB bandwidth for EP and enabling efficient training on modest hardware.

Pipeline Parallel Bubble Issues

PP’s communication is small, so the main inefficiency is pipeline bubbles. The naïve GPipe pipeline suffers bubbles proportional to (P‑1)/(M+P‑1), where P is the number of PP stages and M the micro‑batch count. 1F1B (one forward, one backward) reduces memory by discarding activations earlier, allowing a larger micro‑batch size and shrinking bubbles. Zero‑Bubble PP (ZB1P) further splits backward computation to compute gradients (dx) earlier, pushing weight‑gradient (dw) computation into the bubble period, at the cost of holding activations longer.

DualPipe for Overlap

DeepSeek‑V3 introduced DualPipe, building on ZB1P. DualPipe creates two backward micro‑batches per PP stage, overlapping forward and backward so that EP’s all‑to‑all communication can be hidden behind computation during the steady‑state phase.

Implementation details: on H800 GPUs (132 SMs), 20 SMs are dedicated to communication kernels (SM‑dedicated), while the remaining SMs run compute kernels. This deviates from the traditional SM‑shared NCCL model, using a custom PTX‑level communication kernel.

Waved‑EP in V4

Waved‑EP is a new EP compute‑communication overlap kernel introduced in DeepSeek‑V4. It partitions experts into several “waves”; each wave performs dispatch, computation, and combine in a pipelined fashion, allowing later waves to start computation while earlier waves are still communicating.

Motivation: in small‑batch RL or inference scenarios, DualPipe’s overlap is insufficient because inference lacks a backward pass to pair with forward communication. Benchmarks show 1.96× speed‑up in RL and 1.50–1.73× in general workloads compared to the baseline (Comet).

Waved‑EP is a mega‑kernel that fuses NVSHMEM dispatch, GEMM, and combine, which Triton cannot express due to its limited cross‑GPU communication support. TileLang, the DSL used in the V4 paper (Wang et al., 2026), provides both high‑level productivity and low‑level PTX/CUDA control, enabling this kernel.

Overall, with careful parallel strategy design and kernel‑level overlap, an 8‑card node plus IB can train models exceeding 1 trillion parameters. The authors suggest hardware designers prioritize balanced compute‑communication overlap power rather than merely increasing raw bandwidth.

References

1. https://arxiv.org/pdf/2512.24880
2. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
3. https://arxiv.org/pdf/2412.1943

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MoE Pipeline Parallel TileLang DeepSeek V4 Expert Parallel GPU Distributed Training Waved-EP ZeRO-1

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.