Artificial Intelligence 11 min read

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

Baidu Intelligent Cloud Tech Hub

Mar 1, 2024

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

1. Collective Communication Is Crucial for Distributed Training

In distributed training each GPU processes only a part of the model or data. GPUs synchronize gradients and parameters through collective communication, allowing the whole cluster to act as a single accelerator.

If a GPU stalls during collective communication, all other GPUs wait, slowing the entire job.

Therefore, collective‑communication performance directly determines the speed of distributed tasks.

To maximize performance, clusters typically use high‑speed RDMA networks and accelerate with collective‑communication libraries.

2. Large Models Raise New Requirements for System Operability and Stability

Training large models can take weeks, with clusters of thousands of GPUs. Failures are frequent, reducing resource utilization and extending project timelines.

Insufficient operability and stability lower the "effective training time" and increase costly downtime—for example, a 30‑day job may lose 10 days to fault handling.

The collective‑communication library itself must be optimized for operability and stability in large‑model scenarios.

3. Overview of Baidu Collective Communication Library (BCCL)

BCCL, released by Baidu Intelligent Cloud, is a collective‑communication library tailored for large‑model training and a key component of Baidu’s Baige 3.0 platform.

Built on the open‑source NCCL, BCCL adds functionality and enhances capabilities for observability, fault diagnosis, stability, and performance on Baidu’s GPU chips.

Observability: real‑time bandwidth statistics.

Fault Diagnosis: ability to diagnose hangs in collective communication.

Stability: improved network stability and fault‑tolerance.

Performance Optimization: higher collective‑communication throughput on mainstream GPU chips.

We will now detail BCCL’s capabilities in these four areas.

4. Observability – Real‑Time Bandwidth Statistics

4.1 Background

During training, overall cluster performance may degrade without obvious causes, requiring comprehensive checks.

4.2 Problem

Existing monitoring tools cover storage, RDMA, and GPUs but lack direct, real‑time insight into collective‑communication performance. Current work‑arounds involve indirect RDMA traffic monitoring or stopping training to run nccl‑test, both costly and disruptive.

4.3 Features and Effects

BCCL provides real‑time bandwidth statistics for collective communication, even under complex communication patterns, enabling precise performance observation, fault isolation, and optimization decisions.

This data helps narrow down faulty communication groups and assess whether bandwidth is saturated.

5. Fault Diagnosis – Detecting Collective‑Communication Hangs

5.1 Background

GPU failures can cause training jobs to stop or hang silently, with no explicit error logs, making root‑cause identification difficult.

5.2 Problem

Because collective communication is synchronous, a faulty GPU leaves other GPUs waiting, and hangs do not produce logs, so engineers cannot quickly pinpoint the offending device.

Traditional tools like nccl‑test often cannot reproduce hangs, leading to lengthy manual investigations that may take days.

5.3 Features and Effects

BCCL continuously records internal communication states. When a hang occurs, it outputs per‑rank status, allowing engineers to quickly narrow down the faulty GPU and dramatically reduce troubleshooting time.

6. Stability – Network Reliability and Fault Tolerance

6.1 Background

Transient network port up‑downs can cause process failures and abort training jobs, an unavoidable issue in physical networks.

6.2 Features and Effects

BCCL adds retry mechanisms for both control‑plane and data‑plane failures, improving job startup resilience and handling RDMA retransmission limits, thereby enhancing overall training stability.

7. Performance Optimization – Enhancing Collective‑Communication Throughput

BCCL is deeply optimized for the mainstream GPU chips used in Baidu Intelligent Cloud. In a dual‑machine H800 test, BCCL achieved a 10% higher bandwidth utilization compared to NCCL.

8. Summary

On 20 December 2023 Baidu released Baige·AI Heterogeneous Computing Platform 3.0, a smart infrastructure for large‑model training.

With BCCL’s operability and stability improvements, the platform achieves 98% effective training time and 95% bandwidth utilization.

observability Distributed Training AI infrastructure collective communication Fault Diagnosis

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.