Industry Insights 16 min read

Inside xAI’s 100k‑GPU Colossus: Supermicro Liquid‑Cooled Racks Explained

The article provides a detailed, step‑by‑step tour of xAI’s Colossus supercomputer— a $‑billion AI cluster built in 122 days with 100,000 NVIDIA H100 GPUs—covering Supermicro liquid‑cooled 4U racks, cooling distribution units, power and water infrastructure, storage nodes, CPU servers, 400 GbE networking, and the operational challenges of scaling such a massive system.

BirdNest Tech Talk

Nov 20, 2024

Inside xAI’s 100k‑GPU Colossus: Supermicro Liquid‑Cooled Racks Explained

Overview of the Colossus Cluster

The author walks through xAI’s Colossus data‑center, a multi‑billion‑dollar AI installation completed in 122 days that houses 100,000 NVIDIA H100 GPUs. The tour is based on a privileged on‑site visit approved by Elon Musk’s team and supplemented by a ServeTheHome article (2023) that first revealed the cluster.

Supermicro Liquid‑Cooled Rack Architecture

Each rack is built from eight 4U Supermicro servers, each server holding eight H100 GPUs, giving 64 GPUs per rack. The eight servers plus a Supermicro Cooling‑Liquid Distribution Unit (CDU) form a single GPU compute rack. Eight such racks are grouped to provide 512 GPUs, which can be combined into larger mini‑clusters.

The servers use Supermicro’s 4U universal GPU platform, chosen for its deep liquid‑cooling integration and maintainability. The design includes a custom liquid‑cooling block that cooles all eight GPUs on a single HGX tray, and a separate CPU cooling block.

Unlike many competing designs that bolt liquid‑cooling onto air‑cooled boards, Supermicro’s system is liquid‑cooled from the ground up, with all components—including the four Broadcom PCIe switches on the motherboard—cooled by a single custom liquid block.

Each CDU contains redundant pumps and power supplies, allowing hot‑swap replacement without shutting down the rack. The author recounts personally swapping a pump during a previous project, noting the same procedure would apply here.

Storage and CPU Nodes

Supermicro also supplies the storage tier. The data‑center uses 2.5‑inch NVMe trays in 1U storage nodes, reflecting the industry shift from disk‑based to flash storage for AI workloads because of lower power consumption and higher density. Although flash per PB is costlier, total cost of ownership favors it at this scale.

CPU compute nodes are 1U servers designed for balanced compute density and thermal output. Some of these servers feature an NVMe tray on the front and dedicated airflow paths for cooling.

High‑Performance Networking

The cluster’s most striking feature is its 400 GbE optical fabric. Each GPU server has nine 400 GbE links, delivering roughly 3.6 Tbps of bandwidth per server. The GPUs use NVIDIA BlueField‑3 SuperNICs and Spectrum‑X networking, providing RDMA capabilities that bypass traditional bottlenecks.

CPU nodes also receive 400 GbE connections but via a separate switch hierarchy, a common design in large HPC clusters that separates GPU‑focused and CPU‑focused traffic.

All fiber cables are cut, terminated, and labeled on‑site; the author notes the meticulous organization required for hundreds of fibers.

Power, Water, and Facility Design

Because the racks are liquid‑cooled, the facility’s power and water distribution are critical. Large chilled water loops feed each CDU, which then circulates coolant through the GPU trays. Heat is extracted by rear‑door heat exchangers that act like automotive radiators, moving warm air out and transferring heat to the facility’s water loop.

To buffer the millisecond‑scale power spikes caused by GPU workloads, the team installed Tesla Megapack containers on site, providing fast‑response energy storage.

Operational Insights and Challenges

The author notes that while the cluster is already running an initial 25,000‑GPU hall, expansion to the full 100,000‑GPU capacity is underway. Key lessons include the need for close coordination among multiple vendors, the importance of redundant cooling and power paths, and the emerging demand for liquid‑cooled optical switch modules to handle the power draw of next‑generation 800 GbE switches.

Overall, the tour demonstrates how a combination of fully liquid‑cooled hardware, massive Ethernet bandwidth, and robust facility engineering enables the construction of one of the world’s largest AI supercomputers.

Liquid cooling High‑Performance Networking Colossus Data center architecture Nvidia H100 AI supercomputing Supermicro

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.