Artificial Intelligence 22 min read

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

DataFunSummit

Mar 20, 2025

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

1. Evolution of AI Training Stability

Since AlexNet’s breakthrough in 2012, AI training has progressed from a few GPUs in research labs to massive GPU clusters requiring dedicated power systems, shifting stability management from simple operations to precise engineering.

1.1 Early Small‑Model Era: Manual Operations

Before 2022, training tasks typically used a dozen GPUs with PyTorch or TensorFlow data parallelism; engineers often preferred restarting over debugging. Monitoring resembled a car dashboard, showing only basic task status, and operators relied on NVIDIA‑smi, dcgm, and nsys to inspect hardware metrics.

1.2 Large‑Model Storm: From Quantity to Quality

The emergence of ChatGPT triggered a shift to thousand‑card and ten‑thousand‑card clusters, exposing the inadequacy of legacy operational tools.

Case Study : In early 2024, Baidu Baige helped an AIGC startup scale from hundreds to thousands of GPUs. A training hang went unnoticed for hours due to missing fault perception and tolerance mechanisms, resulting in a 30‑hour loss of valuable compute.

2. Baidu Baige’s Panorama of Training Stability

Training stability is now a core infrastructure component, akin to seismic design in buildings. Baidu Baige introduced the metric “invalid training time”:

Invalid Training Time = (Number of Fault Interruptions × Fault Recovery Duration) + Total Checkpoint Write Time

Fault recovery duration includes fault perception latency, scheduling time, initialization time, and re‑computation time. Reducing invalid training time requires focusing on infrastructure stability and task fault tolerance.

Improve infrastructure delivery quality.

Increase fault‑tolerance recall, precision, and timeliness.

Optimize checkpoint mechanisms to cut save and re‑compute time.

Through a layered fault‑tolerance architecture spanning task load, framework, communication, and infrastructure, Baidu Baige achieves >90% coverage of training anomalies with sub‑second perception, minute‑level localization, and an average 3‑minute self‑healing time.

3. Infrastructure Delivery Quality Assurance

In the CPU era, hardware delivery involved basic stress tests. In the GPU era, delivery must evaluate CPU, GPU, RDMA, storage, power, and temperature, followed by real‑time fault perception and graded self‑healing (automatic drain/restart for Error level, automatic replacement for Fault level).

4. Task Fault Tolerance

Fault tolerance hinges on accurate fault detection (explicit vs. implicit) and rapid recovery.

4.1 Automatic Hang Perception

Typical NCCL timeout errors appear after 10–30 minutes, which is unacceptable at scale. Baidu Baige proposes multiple perception methods:

Log silence detection: if all workers stop logging for a period shorter than the NCCL timeout, a hang is inferred.

Process‑stack sampling (e.g., py‑spy, pystack): unchanged stacks over several samples indicate a hang.

Metric anomalies: zero RDMA traffic, 100% GPU utilization but low SM utilization across ranks suggest a hang.

Communication‑library probing: BCCL timestamps each collective operation; a 30‑second stall flags a hang.

Code example of an NCCL timeout:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.

Another line for Rank 0:

[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802713 milliseconds before timing out.

4.2 Automatic Hang Diagnosis

Diagnosis combines multiple probes:

BCCL trace‑hang: identify nodes that consistently fail to complete communications across groups.

RDMA/GPU metric analysis: nodes with zero RDMA flow and low GPU utilization are likely sources.

Process‑stack comparison: ranks stuck in non‑barrier functions pinpoint the hang origin.

Integrated analysis to correlate these signals and isolate the root cause.

4.3 eBPF‑Based Implicit Fault Perception & Diagnosis

eBPF probes capture kernel‑level events without instrumenting user code, tracking:

Key training functions (forward, backward, collective ops) at microsecond granularity.

Process scheduling blocks (TASK_UNINTERRUPTIBLE) exceeding a threshold.

CUDA runtime API latency via uprobe on libcuda.so.

RDMA verbs (ibv_post_send, ibv_poll_cq) latency and status.

Analysis includes baseline vs. real‑time anomaly detection and cross‑rank consistency checks, enabling second‑level hang detection with >40% higher accuracy.

5. Fault Recovery Timeliness

Multi‑level restart strategies reduce interruption time:

Explicit single‑node faults: replace the node and mask it at the cluster level.

Implicit single‑node faults: replace the node and mask it at the task level.

Non‑single‑node faults: attempt in‑place restart; if unsuccessful, resubmit the entire job.

These strategies shrink average recovery from ~30 minutes to ~30 seconds with >95% success.

5.2 Triggered Checkpointing

Unlike fixed‑interval checkpoints, triggered checkpoints save model state upon specific events (faults, OOM, etc.), reducing unnecessary I/O and storage. Zero‑redundancy checkpoints (step‑wise saving) eliminate re‑computation but incur massive storage costs. A hybrid approach—triggered checkpoints with asynchronous, incremental dumps and occasional redundant backups—balances reliability and efficiency.

6. Business Requirements for Stability

AI training stability has become a precision engineering discipline, with Baidu Baige achieving 99.5% effective training time for thousand‑card and ten‑thousand‑card clusters, supporting flagship models such as the domestic mathematics model “Jiuzhang” and the Sora‑like model “Vidu”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems fault tolerance eBPF Large‑Scale Training checkpointing

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.