How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%
This article details Baidu Intelligent Cloud's journey in designing, constructing, and operating a 32,000‑card hybrid AI compute cluster, covering challenges in power, cooling, networking, multi‑cluster scheduling, and security, and explains how innovative hardware, software, and operational strategies achieved over 50% MFU improvement and industry‑first performance records.
On August 29, 2025, Baidu Intelligent Cloud's Hybrid Cloud Division General Manager Du Hai presented a case study of a 32,000‑card domestic AI compute cluster at the Baidu Cloud Intelligence Conference, describing the technical challenges of building and efficiently using ultra‑large‑scale AI infrastructure.
The cluster features:
Hard‑core foundation : uses domestically produced Kunlun chips for core technology independence.
Over 10 P FLOPS with a 98% effective training rate.
Power usage effectiveness (PUE) of 1.199.
Five‑star stability certification for a ten‑thousand‑card scale.
Four core product layers constitute the "Intelligent Native Hybrid Cloud" solution:
High‑efficiency AI Data Center (AIDC) as the hardware base.
ABC Stack – a high‑performance AI cloud foundation built on AIDC.
Baidu Baige AI Computing Platform on top of the cloud foundation.
BHCMP – a multi‑cluster compute‑operation platform.
Key infrastructure challenges include power capacity, cooling limits, and space layout for massive GPU parallelism. Baidu addressed them by:
Introducing an integrated + energy‑storage power architecture with 750 V DC, source‑grid‑load‑storage, and green‑power direct supply.
Deploying a distributed cooling system (Du Bingchuan + Du Lingxi) combined with Supernode 2.0 cabinets to achieve liquid‑air hybrid cooling.
Adopting a network‑centric concentric layout that scales from floor‑level to multi‑building and cross‑campus configurations, keeping inter‑node latency stable.
For inter‑cluster connectivity, Baidu launched a cross‑campus RDMA long‑haul solution comprising custom high‑performance cache switches and an optimized RDMA protocol, ensuring lossless long‑distance transmission.
Network routing was streamlined by aggregating POD routes to about 4,000 entries, and by integrating adaptive routing with multi‑plane topologies, achieving up to 20% throughput gain and sub‑second fault‑switching.
Operationally, Baidu built a full‑stack monitoring and fault‑tolerance system that can identify over 280 GPU fault types with a 98% automatic recovery rate, and provides real‑time task‑level diagnostics to pre‑empt soft failures.
Security measures include comprehensive hardware, data, and model compliance management, as well as a dedicated security‑operation workflow that closes risk loops.
To enable efficient multi‑cluster resource sharing, Baidu created a unified power‑aware scheduling platform that standardizes heterogeneous resources, supports one‑click pool onboarding within three minutes, and dynamically matches workloads to the optimal resource pool based on latency and load priorities.
Since deployment, the system now spans five major regions in China, delivering high‑quality compute services nationwide. The 32,000‑card cluster set several records: 90 days for civil‑engineering construction, one month to light up ten‑thousand cards, four months to reach full production, and a model‑floating‑unit (MFU) exceeding 50%.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
