Artificial Intelligence 13 min read

How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance

This article explains how Baidu’s Kunlun supernode, built on high‑density liquid‑cooled cabinets and a modular 1U 4‑card design, breaks traditional 8‑card limits, boosts compute density four‑fold, improves power and cooling efficiency, and provides a scalable foundation for large‑model AI training and inference.

Baidu Intelligent Cloud Tech Hub

May 23, 2025

How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance

Amid the explosion of large‑model parameters and the need for both training and inference performance, “supernodes” have become a key direction for next‑generation AI infrastructure.

Unlike traditional AI servers, supernodes integrate stronger compute and data‑transfer capabilities, typically using a high‑bandwidth domain (HBD) to interconnect AI accelerator cards, breaking the 8‑card and 16‑card limits and enabling lossless scaling under extreme latency constraints.

1. Baidu’s AI Server Design and Deployment History

Baidu has over a decade of experience designing and deploying AI servers. In the OCP (Open Compute Project) compute initiative, Baidu shared its AI advantages, collaborated with Facebook and Microsoft to co‑define the OAM (OCP Accelerator Module) standard, and in 2011 launched the first generation “North Pole” (also known as the “Scorpio”) cabinet.

The company later introduced the X‑MAN super AI computer, which powered Baidu’s AI workloads. In 2022, Baidu built China’s first full‑IB network thousand‑card GPU cluster based on X‑MAN 4.0, supporting the launch of Wenxin YiYan, and subsequently X‑MAN 5.0 powered the Kunlun P800 30‑k‑card cluster.

2. Kunlun Supernode Modules

2.1 Cabinet

The Kunlun supernode is based on Baidu’s Tianchi high‑density liquid‑cooled cabinet, offering an integrated delivery mode and a three‑blind‑plug design for water, power, and network. This enables rapid, reliable component insertion without precise alignment, allowing ordinary line‑maintenance staff to install and operate the system easily, dramatically shortening time‑to‑service compared with traditional air‑cooled servers.

In a typical 64‑card scenario, traditional 8U air‑cooled AI servers require 64U of space, whereas the Kunlun supernode achieves the same capacity in only 28U (16 × 1U compute trays + 8 × 1U switch trays + 2 × 2U power shelves), more than doubling space‑utilization efficiency and improving deployment density and PUE.

2.2 Compute Node (Compute Tray)

Each compute node adopts a 1U 4‑card layout, delivering four‑times the compute density of conventional 8U 8‑card designs. The modular architecture decouples the CPU board, PCIe switch board, and GPU board, and includes dual PCIe‑Switch chips and dual uplink links for a 1:1 non‑blocking interconnect, enabling efficient scheduling and low‑latency communication.

The node supports a wide range of I/O configurations, including Baidu’s Taihang DPU, up to four NICs, four NVMe drives, two M.2 slots, and optional HBA or RAID cards, meeting diverse AI workload requirements while supporting domestic CPU platforms.

2.3 Switch Node (Switch Tray)

In AI infrastructure, network interconnect is a performance‑critical component. The Kunlun supernode introduces a multi‑switch architecture that breaks the traditional single‑machine 8‑card interconnect limit. For a 32‑card configuration, four Switch Tray modules provide full interconnect, ensuring any two XPUs communicate with only a one‑hop path, significantly reducing latency and improving bandwidth utilization in AllReduce, Alltoall, and similar communication patterns.

The design also supports scale‑out, allowing hundreds to tens of thousands of XPU cards to be assembled into a unified compute pool, providing a robust network foundation for large‑model training.

2.4 Power Shelf

The power shelf centralizes PSU modules, separating power from compute nodes. Each 2U shelf houses twelve 3300 W or 5500 W PSUs with 10+2 redundancy and dual‑input ATS technology, reducing the number of power units by 40 % compared with traditional designs. A single cabinet can deliver 33 kW–120 kW, supporting 500 W–1000 W per XPU/GPU, and offers AC + AC, AC + DC, and DC + DC redundant power modes.

2.5 Cooling Module

The Kunlun supernode employs a hybrid liquid‑and‑air cooling architecture. CPU and XPU chips use micro‑channel liquid‑cool plates with parallel water‑flow design, lowering XPU temperatures by more than 20 °C and improving thermal stability and energy efficiency. NICs, memory, and SSDs are air‑cooled. The liquid‑cool system can be deployed alongside a traditional air‑cooled data center by installing Baidu’s self‑developed cooling distribution unit (CDU) “Tianji 1.0” next to each cabinet.

2.6 Management Module

A dual‑layer out‑of‑band management architecture combines a rack‑level RMC (Rack Management Controller) with node‑level BMC (Baseboard Management Controller). This provides intelligent power management, liquid‑cool monitoring, asset tracking, predictive maintenance, and comprehensive fault detection for CPU, memory, XPU, NIC/DPU, disks, fans, and motherboard, including leak detection and one‑click log analysis.

3. Conclusion

The launch of the Kunlun supernode marks a solid step forward for Baidu Cloud’s AI infrastructure, delivering a qualitative leap in compute density, energy efficiency, and deployment flexibility, and offering powerful support for large‑model training, inference, and other complex AI tasks.