Machine Learning Algorithms & Natural Language Processing
May 6, 2026 · Artificial Intelligence
Why DeepSeek‑V4’s MFU Drops: Parallel Strategies and Compute‑Communication Overlap
The article dissects DeepSeek‑V4’s shift from dense to MoE models, explains why MFU plummets despite sufficient expert dimensions, and details how a carefully designed GPU parallel strategy—combining DP, ZeRO‑1, PP, EP and the new Waved‑EP kernel—overlaps communication and computation to reclaim throughput on 8‑card NVLink nodes linked by InfiniBand.
DeepSeek V4Expert ParallelGPU Distributed Training
0 likes · 19 min read
