Artificial Intelligence 20 min read

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

This article provides a detailed technical analysis of FP8 training, comparing Nvidia’s TransformerEngine approach with DeepSeek V3’s novel scheme, and examines how block‑wise scaling, high‑precision accumulation, and vector length and correlation affect quantization error and signal‑to‑noise ratio in large‑language‑model training.

Baobao Algorithm Notes

Mar 10, 2025

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

1. FP8 Floating-Point Format

FP8 is an 8‑bit floating‑point format defined by IEEE‑754, first introduced by Nvidia in the H100 GPU in 2022. The evolution of floating‑point formats on Nvidia hardware is:

2016: P100 introduced FP16.

2017: V100 introduced Tensor Core for FP16 matrix multiplication.

2020: A100 introduced TF32 and bfloat16.

2022: H100 introduced FP8.

FP8 is expected to continue Nvidia’s “Huang’s Law” of 1000× GPU compute growth every ten years. Two FP8 variants exist: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). The article includes a diagram of the bit layouts.

2. Transformer Engine FP8 Scheme

Nvidia’s TransformerEngine provides FP8 implementations for Linear, Attention, LayerNorm, etc. Weights and gradients are stored in high precision; only matrix multiplications use FP8. Forward pass uses E4M3, backward uses E5M2. The library fuses operators to reduce memory pressure. Compared with BF16, FP8 yields ~30 % end‑to‑end speedup but adds ~5 % extra memory for scaling factors.

Performance roughly doubles.

Memory usage roughly halves (the claim, though actual memory saving is limited).

Communication volume roughly halves.

However, the scheme incurs accuracy risk (loss differences ≥ 2 %), more complex compute flow, and the need to support both FP8 formats.

TransformerEngine FP8 matrix multiplication scheme

3. DeepSeek V3 FP8 Scheme

DeepSeek’s approach differs in three key ways:

Weights are stored directly in FP8.

Only the E4M3 format is used.

Block‑wise scaling replaces tensor‑wise scaling.

The architecture places master weights and weight gradients in FP32, activation gradients and optimizer states in BF16, and distributes them across data‑parallel ranks to control memory.

3.1 Reducing Memory and Communication

Low‑precision optimizer states store AdamW’s first‑ and second‑order moments in BF16 while keeping master weights in FP32. Activations after attention use E5M6 with power‑scale factors. MoE’s SwiGLU inputs are cached in FP8 with recompute. Communication tensors are quantized to FP8, halving bandwidth.

3.2 Fine‑Grained Quantization

Because FP8’s reduced dynamic range can cause overflow/underflow, DeepSeek applies block‑wise scaling: each 1×128 or 128×128 block gets its own scaling factor, allowing larger factors for most blocks and smaller ones for outlier‑heavy blocks.

High‑precision accumulation is used: Tensor Core performs low‑precision accumulation, then partial sums are transferred to CUDA cores for FP32 accumulation, mitigating the ~14‑bit accumulator limitation of H800 Tensor Cores.

4. Systematic FP8 Accuracy Analysis

The article studies how vector length and vector correlation affect quantization error in inner‑product operations, which dominate LLM training. Experiments quantize random vectors of varying lengths to FP8 E4M3 and measure error distribution and signal‑to‑noise ratio (SNR).

Longer vectors increase cumulative quantization error.

Higher correlation between vectors amplifies error.

Tensor‑wise scaling (a single factor per tensor) improves SNR, but block‑wise scaling (per block) yields far larger improvements, especially when vectors are highly correlated. High‑precision chunked accumulation further raises SNR.

4.1 Impact of Vector Length

Histograms show error grows with vector size; scaling factors concentrate errors near zero.

4.2 Impact of Vector Correlation

When vectors are strongly correlated, error distributions shift, and block‑wise scaling dramatically improves SNR.

4.3 Conclusions

DeepSeek V3’s FP8 scheme combines block‑wise scaling, high‑precision chunked accumulation, and selective FP8 storage to achieve comparable or slightly better SNR than Nvidia’s TransformerEngine while keeping memory and communication benefits. The analysis suggests that handling highly correlated attention computations with fine‑grained scaling is crucial for low‑precision LLM training.

All experimental code is available at https://github.com/deepseek-ai/DeepGEMM and

https://github.com/reiase/ieee754_simulation/blob/master/simfloat_fp8_e4m3_v2.ipynb

LLM Quantization DeepSeek FP8 Low-Precision Training TransformerEngine

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.