How Baidu’s AIPod Network Powers Massive AI Model Training
This article explains the design and engineering of Baidu's AIPod high‑performance network, detailing the massive bandwidth, scalability, stability, and low‑latency requirements of large‑scale AI model training and the practical tools used to monitor and troubleshoot such workloads.
Welcome to Baidu Intelligent Cloud AI Base public class series. This session focuses on the key technology for large‑scale AI training: the high‑performance AI network.
1. Network Requirements for Large‑Model Training
Large models such as Baidu's 260‑billion‑parameter model require thousands of GPUs, leading to massive distributed parallel training demands. Training scales are classified by GPU count: sub‑100‑GPU (small), 100‑1000 GPU (medium), >1000 GPU (large), and >10 000 GPU (ultra‑large).
Three common parallel strategies are used:
Data parallelism: each GPU holds a full model replica; gradients are synchronized via Allreduce, with communication volume proportional to model size.
Pipeline parallelism: model layers are split across GPUs; point‑to‑point communication of activations and gradients occurs many times per iteration but with modest bandwidth.
Tensor parallelism: GPUs jointly compute tensor operations (e.g., matrix multiplication); Allreduce of large tensor results dominates bandwidth needs.
In practice, a hybrid of all three strategies is employed: tensor parallelism within a single node (leveraging NVLink), pipeline parallelism across nodes, and data parallelism across groups of nodes (DP groups).
2. AIPod High‑Performance Network Design
To meet the goals of ultra‑scale, ultra‑high bandwidth, and ultra‑stable operation, Baidu designed the AIPod network with a three‑layer non‑convergent CLOS topology (Leaf‑Spine‑SuperSpine). Each server connects its eight GPUs to eight distinct Leaf switches, forming a group that can support up to 512 GPUs. Full‑mesh connections among Spine switches and SuperSpine links enable clusters of over 16 K GPUs.
The network uses 8‑channel architecture, non‑convergent 1:1 uplink/downlink bandwidth, and top‑tier switch chips (up to 51.2 Tbps). Bandwidth targets require >20 GB/s per GPU Allreduce to achieve 90 % scaling.
Hash‑based routing collisions across switches are mitigated by using multiple NCCL connections per GPU pair, increasing routing entropy. Network‑aware scheduling places tasks within the same aggregation group to avoid cross‑group traffic.
To eliminate remaining collisions, AIPod implements dynamic load balancing (DLB) and adaptive routing similar to InfiniBand's adaptive routing, allowing out‑of‑order packets and selecting optimal paths based on queue depth and utilization.
Stability is enhanced with fast fault recovery (milliseconds for upstream loss, seconds for downstream) and a black‑box probing mechanism that injects per‑second probe packets on every link, automatically isolating faulty components.
Lossless operation relies on PFC; AIPod monitors for PFC deadlocks and storm conditions using telemetry from Baidu’s custom switches, providing immediate visibility into any anomalies.
3. Low‑Latency and Storage Considerations
Although bandwidth dominates AI training, AIPod also achieves microsecond‑level latency by minimizing fiber length and optimizing switch queueing. Storage I/O is supported via high‑performance VPC networking, with RDMA‑accelerated parallel file systems (200 GB/s per client) and high‑throughput object storage (10 GB/s per client).
4. Practical Training Experience
Two thousand‑GPU training runs on Baidu’s Baige platform (RoCE and IB clusters) demonstrate sustained >100 GB/s per GPU communication. High‑precision, second‑level monitoring visualizes task‑level traffic, revealing bursty patterns that would be missed with coarse sampling.
Diagnostic tools include a task‑level visualizer, a fault‑location engine that correlates NCCL timeouts with underlying GPU or network issues, and a slow‑node detection utility that bisects the cluster to pinpoint underperforming GPUs.
5. Q&A Highlights
RoCE v2 with hardware‑assisted out‑of‑order reassembly is used.
Elastic RDMA (ERI) provides 200 GB/s bandwidth and 5 µs latency on VPC.
Non‑convergent topology adds redundancy with modest cost impact, improving overall performance‑to‑price ratio.
Overall, AIPod’s high‑performance network is a core enabler for efficient, cost‑effective large‑model training in the AI era.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
