How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%
This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.
Background
DeepSeek V3 was released in December 2024 and was trained for roughly $5.57 million, prompting interest in its training efficiency.
Motivation
The goal is to compute the Model FLOPs Utilization (MFU) of DeepSeek V3 from publicly available information to provide a benchmark for future work.
Parameter Specification
dim = 7168
inter_dim = 18432
moe_inter_dim = 2048
n_layers = 61
n_dense_layers = 3
n_heads = 128
n_routed_experts = 256
n_shared_experts = 1
n_activated_experts = 8
q_lora_rank = 1536
kv_lora_rank = 512
qk_nope_head_dim = 128
qk_rope_head_dim = 64
v_head_dim = 128Forward FLOP Formulas
Define qk_head_dim = qk_nope_head_dim + qk_rope_head_dim.
Q projection (down + up):
flops = 2 * bs * seq_len * dim * q_lora_rank + 2 * bs * seq_len * q_lora_rank * n_heads * qk_head_dimKV down projection:
flops += 2 * bs * seq_len * dim * (kv_lora_rank + qk_rope_head_dim)KV up projection:
flops += 2 * bs * seq_len * kv_lora_rank * n_heads * (qk_nope_head_dim + v_head_dim)Score (Q×Kᵀ):
flops += 2 * bs * seq_len * seq_len * n_heads * qk_head_dim / 2Score×V:
flops += 2 * bs * seq_len * seq_len * n_heads * v_head_dim / 2Output projection (Wo): flops += 2 * bs * seq_len * n_heads * v_head_dim * dim Additional FLOPs are computed for MoE, MLP, embedding, and LM‑head layers:
MoE forward:
flops += 2 * bs * seq_len * dim * moe_inter_dim * 3 + 2 * bs * seq_len * moe_inter_dimMLP forward:
flops += 2 * bs * seq_len * dim * inter_dim * 3 + 2 * bs * seq_len * inter_dimEmbedding: flops += 2 * bs * seq_len * dim LM head (single head, no MTP):
flops += 2 * bs * seq_len * dim * vocab_sizeMFU Computation
Backward FLOPs are approximated as twice the forward FLOPs, excluding the recomputation cost of the attention backward pass. DeepSeek V3 has 61 layers (3 dense, 58 MoE) and a context length of 4 K tokens. Summing FLOPs for 1 T tokens yields flops_per_1T_tokens.
GPU‑hours are derived from the reported 2.664 M GPU‑hours on H800 GPUs and converted to H100 BF16 performance:
gpu_hours = 2.664 * 3600 / 1024 # H100‑equivalent hours
H100_peak_bf16_flops = 989.5e12 # 989.5 TFLOPS
MFU = flops_per_1T_tokens * 14.8 / (gpu_hours * H100_peak_bf16_flops)Results
The original script (
https://github.com/feifeibear/DPSKV3MFU/blob/main/dpskv3_flops.py) reports an MFU of 37.2 %.
After correcting unit inconsistencies and recalibrating the GPU‑hour conversion, the refined MFU is 36.2 %. An alternative upper‑bound estimation (using a simplified attention FLOP model) yields 39.0 %.
Comparison with DeepSeek V2
Using the same cluster assumptions, DeepSeek V2’s MFU is estimated at ≈21 % (scaled value ≈121). The V3 improvement of roughly 61 % (196 / 121) reflects the impact of engineering optimizations in the HAI‑LLM framework.
Conclusion
DeepSeek V3 achieves an MFU around 36‑37 %, a substantial efficiency gain over its predecessor. The formulas and the open‑source Python script enable reproducible MFU calculations for other large‑scale language models.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
