8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

Baobao Algorithm Notes

Jan 7, 2025

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Background

DeepSeek V3 was released in December 2024 and was trained for roughly $5.57 million, prompting interest in its training efficiency.

Motivation

The goal is to compute the Model FLOPs Utilization (MFU) of DeepSeek V3 from publicly available information to provide a benchmark for future work.

Parameter Specification

dim = 7168
inter_dim = 18432
moe_inter_dim = 2048
n_layers = 61
n_dense_layers = 3
n_heads = 128
n_routed_experts = 256
n_shared_experts = 1
n_activated_experts = 8
q_lora_rank = 1536
kv_lora_rank = 512
qk_nope_head_dim = 128
qk_rope_head_dim = 64
v_head_dim = 128

Forward FLOP Formulas

Define qk_head_dim = qk_nope_head_dim + qk_rope_head_dim.

Q projection (down + up):

flops = 2 * bs * seq_len * dim * q_lora_rank + 2 * bs * seq_len * q_lora_rank * n_heads * qk_head_dim

KV down projection:

flops += 2 * bs * seq_len * dim * (kv_lora_rank + qk_rope_head_dim)

KV up projection:

flops += 2 * bs * seq_len * kv_lora_rank * n_heads * (qk_nope_head_dim + v_head_dim)

Score (Q×Kᵀ):

flops += 2 * bs * seq_len * seq_len * n_heads * qk_head_dim / 2

Score×V:

flops += 2 * bs * seq_len * seq_len * n_heads * v_head_dim / 2

Output projection (Wo): flops += 2 * bs * seq_len * n_heads * v_head_dim * dim Additional FLOPs are computed for MoE, MLP, embedding, and LM‑head layers:

MoE forward:

flops += 2 * bs * seq_len * dim * moe_inter_dim * 3 + 2 * bs * seq_len * moe_inter_dim

MLP forward:

flops += 2 * bs * seq_len * dim * inter_dim * 3 + 2 * bs * seq_len * inter_dim

Embedding: flops += 2 * bs * seq_len * dim LM head (single head, no MTP):

flops += 2 * bs * seq_len * dim * vocab_size

MFU Computation

Backward FLOPs are approximated as twice the forward FLOPs, excluding the recomputation cost of the attention backward pass. DeepSeek V3 has 61 layers (3 dense, 58 MoE) and a context length of 4 K tokens. Summing FLOPs for 1 T tokens yields flops_per_1T_tokens.

GPU‑hours are derived from the reported 2.664 M GPU‑hours on H800 GPUs and converted to H100 BF16 performance:

gpu_hours = 2.664 * 3600 / 1024  # H100‑equivalent hours
H100_peak_bf16_flops = 989.5e12  # 989.5 TFLOPS
MFU = flops_per_1T_tokens * 14.8 / (gpu_hours * H100_peak_bf16_flops)

Results

The original script (

https://github.com/feifeibear/DPSKV3MFU/blob/main/dpskv3_flops.py

) reports an MFU of 37.2 %.

After correcting unit inconsistencies and recalibrating the GPU‑hour conversion, the refined MFU is 36.2 %. An alternative upper‑bound estimation (using a simplified attention FLOP model) yields 39.0 %.

Comparison with DeepSeek V2

Using the same cluster assumptions, DeepSeek V2’s MFU is estimated at ≈21 % (scaled value ≈121). The V3 improvement of roughly 61 % (196 / 121) reflects the impact of engineering optimizations in the HAI‑LLM framework.

Conclusion

DeepSeek V3 achieves an MFU around 36‑37 %, a substantial efficiency gain over its predecessor. The formulas and the open‑source Python script enable reproducible MFU calculations for other large‑scale language models.

DeepSeek Large Language Model Model Efficiency MFU AI performance training cost

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.