How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis
Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.
1. Collective Communication Is Crucial for Distributed Training
In distributed training each GPU processes only a part of the model or data. GPUs synchronize gradients and parameters through collective communication, allowing the whole cluster to act as a single accelerator.
If a GPU stalls during collective communication, all other GPUs wait, slowing the entire job.
Therefore, collective‑communication performance directly determines the speed of distributed tasks.
To maximize performance, clusters typically use high‑speed RDMA networks and accelerate with collective‑communication libraries.
2. Large Models Raise New Requirements for System Operability and Stability
Training large models can take weeks, with clusters of thousands of GPUs. Failures are frequent, reducing resource utilization and extending project timelines.
Insufficient operability and stability lower the "effective training time" and increase costly downtime—for example, a 30‑day job may lose 10 days to fault handling.
The collective‑communication library itself must be optimized for operability and stability in large‑model scenarios.
3. Overview of Baidu Collective Communication Library (BCCL)
BCCL, released by Baidu Intelligent Cloud, is a collective‑communication library tailored for large‑model training and a key component of Baidu’s Baige 3.0 platform.
Built on the open‑source NCCL, BCCL adds functionality and enhances capabilities for observability, fault diagnosis, stability, and performance on Baidu’s GPU chips.
Observability: real‑time bandwidth statistics.
Fault Diagnosis: ability to diagnose hangs in collective communication.
Stability: improved network stability and fault‑tolerance.
Performance Optimization: higher collective‑communication throughput on mainstream GPU chips.
We will now detail BCCL’s capabilities in these four areas.
4. Observability – Real‑Time Bandwidth Statistics
4.1 Background
During training, overall cluster performance may degrade without obvious causes, requiring comprehensive checks.
4.2 Problem
Existing monitoring tools cover storage, RDMA, and GPUs but lack direct, real‑time insight into collective‑communication performance. Current work‑arounds involve indirect RDMA traffic monitoring or stopping training to run nccl‑test, both costly and disruptive.
4.3 Features and Effects
BCCL provides real‑time bandwidth statistics for collective communication, even under complex communication patterns, enabling precise performance observation, fault isolation, and optimization decisions.
This data helps narrow down faulty communication groups and assess whether bandwidth is saturated.
5. Fault Diagnosis – Detecting Collective‑Communication Hangs
5.1 Background
GPU failures can cause training jobs to stop or hang silently, with no explicit error logs, making root‑cause identification difficult.
5.2 Problem
Because collective communication is synchronous, a faulty GPU leaves other GPUs waiting, and hangs do not produce logs, so engineers cannot quickly pinpoint the offending device.
Traditional tools like nccl‑test often cannot reproduce hangs, leading to lengthy manual investigations that may take days.
5.3 Features and Effects
BCCL continuously records internal communication states. When a hang occurs, it outputs per‑rank status, allowing engineers to quickly narrow down the faulty GPU and dramatically reduce troubleshooting time.
6. Stability – Network Reliability and Fault Tolerance
6.1 Background
Transient network port up‑downs can cause process failures and abort training jobs, an unavoidable issue in physical networks.
6.2 Features and Effects
BCCL adds retry mechanisms for both control‑plane and data‑plane failures, improving job startup resilience and handling RDMA retransmission limits, thereby enhancing overall training stability.
7. Performance Optimization – Enhancing Collective‑Communication Throughput
BCCL is deeply optimized for the mainstream GPU chips used in Baidu Intelligent Cloud. In a dual‑machine H800 test, BCCL achieved a 10% higher bandwidth utilization compared to NCCL.
8. Summary
On 20 December 2023 Baidu released Baige·AI Heterogeneous Computing Platform 3.0, a smart infrastructure for large‑model training.
With BCCL’s operability and stability improvements, the platform achieves 98% effective training time and 95% bandwidth utilization.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
