Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.
1. Evolution of AI Training Stability
Since AlexNet’s breakthrough in 2012, AI training has progressed from a few GPUs in research labs to massive GPU clusters requiring dedicated power systems, shifting stability management from simple operations to precise engineering.
1.1 Early Small‑Model Era: Manual Operations
Before 2022, training tasks typically used a dozen GPUs with PyTorch or TensorFlow data parallelism; engineers often preferred restarting over debugging. Monitoring resembled a car dashboard, showing only basic task status, and operators relied on NVIDIA‑smi, dcgm, and nsys to inspect hardware metrics.
1.2 Large‑Model Storm: From Quantity to Quality
The emergence of ChatGPT triggered a shift to thousand‑card and ten‑thousand‑card clusters, exposing the inadequacy of legacy operational tools.
Case Study : In early 2024, Baidu Baige helped an AIGC startup scale from hundreds to thousands of GPUs. A training hang went unnoticed for hours due to missing fault perception and tolerance mechanisms, resulting in a 30‑hour loss of valuable compute.
2. Baidu Baige’s Panorama of Training Stability
Training stability is now a core infrastructure component, akin to seismic design in buildings. Baidu Baige introduced the metric “invalid training time”:
Invalid Training Time = (Number of Fault Interruptions × Fault Recovery Duration) + Total Checkpoint Write Time
Fault recovery duration includes fault perception latency, scheduling time, initialization time, and re‑computation time. Reducing invalid training time requires focusing on infrastructure stability and task fault tolerance.
Improve infrastructure delivery quality.
Increase fault‑tolerance recall, precision, and timeliness.
Optimize checkpoint mechanisms to cut save and re‑compute time.
Through a layered fault‑tolerance architecture spanning task load, framework, communication, and infrastructure, Baidu Baige achieves >90% coverage of training anomalies with sub‑second perception, minute‑level localization, and an average 3‑minute self‑healing time.
3. Infrastructure Delivery Quality Assurance
In the CPU era, hardware delivery involved basic stress tests. In the GPU era, delivery must evaluate CPU, GPU, RDMA, storage, power, and temperature, followed by real‑time fault perception and graded self‑healing (automatic drain/restart for Error level, automatic replacement for Fault level).
4. Task Fault Tolerance
Fault tolerance hinges on accurate fault detection (explicit vs. implicit) and rapid recovery.
4.1 Automatic Hang Perception
Typical NCCL timeout errors appear after 10–30 minutes, which is unacceptable at scale. Baidu Baige proposes multiple perception methods:
Log silence detection: if all workers stop logging for a period shorter than the NCCL timeout, a hang is inferred.
Process‑stack sampling (e.g., py‑spy, pystack): unchanged stacks over several samples indicate a hang.
Metric anomalies: zero RDMA traffic, 100% GPU utilization but low SM utilization across ranks suggest a hang.
Communication‑library probing: BCCL timestamps each collective operation; a 30‑second stall flags a hang.
Code example of an NCCL timeout:
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.Another line for Rank 0:
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802713 milliseconds before timing out.4.2 Automatic Hang Diagnosis
Diagnosis combines multiple probes:
BCCL trace‑hang: identify nodes that consistently fail to complete communications across groups.
RDMA/GPU metric analysis: nodes with zero RDMA flow and low GPU utilization are likely sources.
Process‑stack comparison: ranks stuck in non‑barrier functions pinpoint the hang origin.
Integrated analysis to correlate these signals and isolate the root cause.
4.3 eBPF‑Based Implicit Fault Perception & Diagnosis
eBPF probes capture kernel‑level events without instrumenting user code, tracking:
Key training functions (forward, backward, collective ops) at microsecond granularity.
Process scheduling blocks (TASK_UNINTERRUPTIBLE) exceeding a threshold.
CUDA runtime API latency via uprobe on libcuda.so.
RDMA verbs (ibv_post_send, ibv_poll_cq) latency and status.
Analysis includes baseline vs. real‑time anomaly detection and cross‑rank consistency checks, enabling second‑level hang detection with >40% higher accuracy.
5. Fault Recovery Timeliness
Multi‑level restart strategies reduce interruption time:
Explicit single‑node faults: replace the node and mask it at the cluster level.
Implicit single‑node faults: replace the node and mask it at the task level.
Non‑single‑node faults: attempt in‑place restart; if unsuccessful, resubmit the entire job.
These strategies shrink average recovery from ~30 minutes to ~30 seconds with >95% success.
5.2 Triggered Checkpointing
Unlike fixed‑interval checkpoints, triggered checkpoints save model state upon specific events (faults, OOM, etc.), reducing unnecessary I/O and storage. Zero‑redundancy checkpoints (step‑wise saving) eliminate re‑computation but incur massive storage costs. A hybrid approach—triggered checkpoints with asynchronous, incremental dumps and occasional redundant backups—balances reliability and efficiency.
6. Business Requirements for Stability
AI training stability has become a precision engineering discipline, with Baidu Baige achieving 99.5% effective training time for thousand‑card and ten‑thousand‑card clusters, supporting flagship models such as the domestic mathematics model “Jiuzhang” and the Sora‑like model “Vidu”.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.