Industry Insights 22 min read

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

Baidu Geek Talk

Mar 17, 2025

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

Background: A Decade of Growth

In 2012 AlexNet’s breakthrough sparked modern AI, yet training clusters then comprised only a few servers. Over ten years, GPU farms expanded to thousands of cards, turning training stability from a simple operational concern into a core infrastructure challenge.

1. Early Small‑Model Era – Manual Operations

Before 2022, most jobs ran on a dozen GPUs using PyTorch or TensorFlow data parallelism. Engineers often preferred restarting a failed job over debugging. Monitoring resembled a car dashboard, showing only basic task status; hardware issues required on‑site visits with the "NVIDIA three‑toolset" (nvidia‑smi, dcgm, nsys).

2. Large‑Model Storm – New Requirements

The launch of ChatGPT forced the shift to thousand‑card clusters, exposing the inadequacy of legacy ops. Baidu Baige’s real‑world case illustrates the pain: a weekend hang caused hours of lost compute because the platform lacked fault perception and automatic recovery.

3. Defining "Invalid Training Time"

Invalid training time = (number of fault interruptions × fault‑recovery duration) + total checkpoint write time

Fault‑recovery duration includes detection latency, scheduling delay, task initialization, and re‑computation time. Reducing this metric requires focusing on infrastructure stability and task fault‑tolerance.

4. Fault‑Tolerance Architecture

The architecture spans four layers: Task Load → Framework → Communication → Infrastructure . Automatic anomaly perception, diagnosis, and recovery now cover >90% of failure scenarios, achieving sub‑second detection, minute‑level localization, and an average three‑minute self‑healing time.

4.1 Automatic Hang Perception

Hang typically manifests as a timeout error from NCCL watchdog. Example log:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.

Default timeouts (10 min for PyTorch, 30 min for Megatron‑LM) are unacceptable at scale. Baidu Baige proposes four perception methods:

Log silence detection: If all workers stop logging for longer than a fraction of the timeout, a hang is likely.

Call‑stack stagnation: Periodic sampling of py‑spy / pystack reveals unchanged stacks across workers.

Metric anomalies: Zero RDMA traffic combined with 100% GPU utilization (or low SM utilization) indicates a blocked collective.

Communication‑library hooks: BCCL timestamps each collective; a 30‑second stall flags a hang.

4.2 Automatic Hang Diagnosis

Perception alone cannot guarantee a hang. Baidu Baige aggregates probes in a master component and applies a 5‑minute window rule: if at least two of log silence, stack stagnation, metric anomaly, or BCCL stall are observed, the task is declared hung.

Diagnosis then pinpoints the source:

BCCL Tracehang: Nodes that fail to complete communication across multiple groups are likely the root.

RDMA/GPU metric divergence: A rank with zero GPU utilization while others run at 100% suggests the culprit.

Call‑stack mismatch: Ranks stuck in non‑barrier functions differ from the majority.

4.3 eBPF‑Based Hidden‑Fault Detection

To capture kernel‑level anomalies, Baidu Baige deploys eBPF probes that monitor:

Training‑critical function latency (microsecond granularity).

Process‑state switches (detecting prolonged TASK_UNINTERRUPTIBLE periods).

CUDA runtime API latency via uprobe on libcuda.so.

RDMA verbs activity (e.g., ibv_post_send, ibv_poll_cq).

These data feed two analyses: baseline vs. real‑time anomaly detection, and cross‑rank consistency checks (e.g., outlier system‑call rates, NVLink bandwidth).

In practice, the eBPF pipeline reduced hidden‑fault detection latency from minutes to seconds and improved diagnosis accuracy by over 40%.

5. Reducing Fault‑Recovery Time

5.1 Multi‑Level Restart Strategy

Three tiers address different fault scopes:

Explicit single‑node fault → replace node and mask at cluster level.

Implicit single‑node fault → replace node and mask at task level.

Multi‑node fault → attempt in‑place restart; if unsuccessful, resubmit the entire job.

This approach shrank average recovery time from ~30 minutes to ~30 seconds with >95% success.

5.2 Trigger‑Based Checkpointing

Traditional checkpoints use fixed intervals, leading to redundant storage and long re‑computation after failures. Trigger‑based checkpoints fire on specific events (e.g., out‑of‑memory, detected hang) and combine:

Integrated fault perception that automatically saves state before exit.

Asynchronous dump to shared memory, followed by RDMA‑accelerated transfer to a standby node.

Periodic redundant backups to guard against catastrophic crashes.

When paired with incremental checkpointing, this scheme cuts storage overhead while keeping re‑computation minimal.

6. Outlook

As training clusters scale to tens of thousands of GPUs, future systems will need even finer‑grained, predictive fault‑avoidance mechanisms—balancing sub‑second detection with petabyte‑scale storage costs. Baidu Baige’s current pipeline already achieves 99.5% effective training time for flagship models such as the domestic mathematics LLM “Jiuzhang” and the Sora‑style video model “Vidu”.

distributed systems fault tolerance eBPF large models Infrastructure AI training

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.