How NVIDIA Builds 256‑GPU and 576‑GPU SuperPods with H100, GH200, and GB200 Interconnects
The article analyzes NVIDIA's DGX SuperPOD architectures across three GPU generations—H100, GH200, and GB200—detailing their NVLink/NVSwitch topologies, bandwidth calculations, scalability limits, and the practical challenges of constructing 256‑GPU and 576‑GPU supercomputing clusters.
Overview
In the era of large AI models, training on a single GPU is obsolete; enterprises now interconnect hundreds or thousands of GPUs to form a supercomputer. NVIDIA’s DGX SuperPOD series provides the next‑generation data‑center AI architecture for training, inference, high‑performance computing (HPC), and mixed workloads.
1. 256‑GPU SuperPod Based on H100
In a DGX A100 node, eight GPUs are linked by NVLink and NVSwitch, while nodes communicate via a 200 Gbps InfiniBand HDR (or RoCE) network. The DGX H100 extends NVLink across nodes using an NVLink‑network Switch; NVSwitch handles intra‑node traffic, and the NVLink‑network Switch handles inter‑node traffic, enabling a 256‑GPU SuperPod with a total reduce bandwidth of 450 GB/s—identical to a single node’s eight‑GPU bandwidth.
However, the DGX H100 SuperPod suffers from limited inter‑node connectivity: only 72 NVLink connections cross nodes. The combined bidirectional bandwidth of these 72 links is 3.6 TB/s, half of the 7.2 TB/s provided by the eight GPUs within a node, causing a convergence bottleneck at the NVSwitch.
2. 256‑GPU SuperPod Based on GH200 and GH200 NVL32
In 2023 NVIDIA announced the DGX GH200, a generative‑AI engine that pairs an H200 GPU with a Grace CPU (one Grace per GPU). Besides GPU‑GPU NVLink 4.0 links, GPU‑CPU communication also uses NVLink 4.0, delivering a massive 900 GB/s network bandwidth. Within a rack, copper cabling may be used; between racks, fiber optics are typical.
A 256‑GPU GH200 cluster requires each GH200 to connect to nine 800 Gbps (100 GB/s) optical modules (two NVLink 4.0 links per module). The GH200 SuperPod differs from the H100 design in that both intra‑node and inter‑node connections rely on NVLink‑network Switches.
The DGX GH200 adopts a two‑level Fat‑tree topology: each node contains eight GH200 GPUs and three first‑level NVLink‑network Switches (an NVSwitch Tray with two NVSwitch chips and 128 ports). Thirty‑two such nodes are fully meshed via 36 second‑level NVLink‑network Switches, forming a 256‑GPU SuperPod while preserving a non‑convergent network.
3. 576‑GPU SuperPod Based on GB200 and GB200 NVL72
GB200 combines one Grace CPU with two Blackwell GPUs (GPU performance is not directly comparable to B200). A GB200 Compute Tray, built on NVIDIA’s MGX design, houses two GB200 modules—i.e., two Grace CPUs and four GPUs.
A GB200 NVL72 node comprises 18 Compute Trays (36 Grace CPUs, 72 GPUs) plus nine NVLink‑network Switch Trays. Each Blackwell GPU provides 18 NVLink ports; a fourth‑generation NVLink‑network Switch Tray offers 144 ports, so 72 × 18 / 144 = 9 Switch Trays are needed for full interconnect.
Official NVIDIA diagrams show eight GB200 NVL72 nodes forming a 576‑GPU SuperPod, but analysis reveals that all NVLink ports are already consumed by internal connections, leaving no spare ports for scaling to a larger two‑level switch fabric. Consequently, the 576‑GPU SuperPod likely relies on a Scale‑Out RDMA network (InfiniBand or RoCE) rather than a pure NVLink‑based Scale‑Up architecture. To achieve a fully NVLink‑connected 576‑GPU system would require 18 NVSwitches per 72 GB200, which cannot fit in a single rack.
NVIDIA also mentions single‑rack and dual‑rack versions of NVL72. The dual‑rack configuration may allocate one GB200 subsystem per Compute Tray, using 18 NVLink‑network Switch Trays across the two racks to satisfy the two‑level interconnect requirements.
Conclusion
The three generations illustrate a progression from intra‑node NVLink/NVSwitch designs (H100) to more extensive NVLink‑network Switch fabrics (GH200) and finally to hybrid Scale‑Out RDMA solutions (GB200) when NVLink bandwidth and port availability become limiting factors. Understanding these topologies, bandwidth calculations, and convergence points is essential for architects planning large‑scale AI and HPC deployments.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
