Artificial Intelligence 9 min read

Challenges and Debugging Strategies for FP8 Training of Large Models

The article explains the performance benefits of using FP8 for large‑model training, outlines three main categories of FP8‑related issues such as loss spikes, divergence, and downstream metric gaps, and introduces a dedicated FP8 debug tool with metrics like MSE, cosine similarity, underflow, and overflow to help diagnose and resolve these problems.

DataFunSummit

Jan 24, 2025

Challenges and Debugging Strategies for FP8 Training of Large Models

More and more technical teams are adopting FP8 for large‑model training because FP8 offers significant advantages: on newer GPUs, NVIDIA TensorCores deliver twice the peak performance of BF16 for compute‑intensive operators and four times the acceleration of TF32, while the reduced data size of FP8 eases memory pressure for memory‑intensive operators.

However, the smaller dynamic range and precision of FP8 compared to FP16/BF16/FP32 can introduce new challenges during training, requiring careful handling of potential accuracy issues.

Major FP8 training problems and possible solutions

Based on discussions with many teams, FP8 training issues are grouped into three categories:

Spike problems (Loss Spike) – not unique to FP8; similar spikes can appear with BF16. Causes are varied and often algorithm‑related, with no specific fix. If spikes are frequent and require many iterations to recover, deeper investigation is needed.

Loss increase or divergence – subdivided into three cases:

Case 1: Immediate loss divergence at training start, usually a software bug; using the latest NVIDIA NeMo/Mcore/Transformer Engine releases is recommended.

Case 2: Check training configuration for new features such as CPU offloading or FP8 parameters; try disabling them.

Case 3: Numerical issues; try computing with BF16 while keeping FP8 tensors, or apply scaling recipes (current scaling, fangrand scaling) and fallback sensitive layers (first and last) to BF16.

Downstream metric gaps despite stable loss – also two cases:

Case 1: All downstream metrics are off; verify inference pipeline, scaling factors, and weights, and consider adjusting the FP8 recipe or falling back layers to BF16.

Case 2: Inference uses BF16 while training used FP8; try FP8‑to‑FP8 inference to see if metrics improve.

FP8 Debug Tool Overview

The FP8 Debug tool provides metrics such as MSE and cosine similarity (quantization error between BF16 and FP8), tensor underflow/overflow counts, and compares delayed scaling factors with current scaling factors. It can dump selected tensors, record per‑step statistics, and works with any version of NVIDIA's NeMo Megatron without modifying framework code.

During analysis, users can dump tensors (e.g., forward inputs, weights, gradients), print step‑wise results, and monitor metrics like AMin/AMax, current vs. delayed scaling, and underflow/overflow ratios to pinpoint problematic layers.

Internal experiments show that in bad cases, MSE can reach 10³ for certain forward tensors, while good cases stay around 10⁻²; underflow rates can exceed 80 % for problematic layers, suggesting fallback to BF16 or alternative scaling strategies.

The tool is still in internal testing; interested parties should contact their NVIDIA technical representative for access and feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

debugging AI Precision NVIDIA FP8

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.