Tagged articles

GPU Distributed Training

1 articles · Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 6, 2026 · Artificial Intelligence

Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap

The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.

DeepSeek-V4Expert ParallelGPU Distributed Training
0 likes · 19 min read
Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap