What Is a SuperNode? Inside AI‑Optimized High‑Performance Compute Pods
The article explains the concept of SuperNode (SuperPod) as a new AI‑focused compute infrastructure, outlines its high‑density integration, ultra‑fast interconnects, and unified resource management, and compares three leading implementations from NVIDIA, Huawei, and the ETH‑X project.
SuperNode (SuperPod) Overview
SuperNode, also called SuperPod, is a newly developed compute‑infrastructure architecture designed to meet the training and inference demands of large AI models. It tightly integrates many accelerators (GPU, TPU, NPU, etc.) with high‑bandwidth interconnects to form a High‑Bandwidth Domain (HBD) that delivers near‑single‑machine performance at massive scale.
Key Features
High‑density compute integration : Packs a large number of GPUs or other AI accelerators into a limited physical space, achieving extreme compute density.
High‑speed interconnect : Uses technologies such as NVLink and InfiniBand to provide high‑bandwidth, low‑latency communication between GPUs and between GPUs and the network, eliminating PCIe or standard Ethernet bottlenecks.
Deep compute‑network fusion : The network becomes part of the computation, enabling network‑aware computing, fused compute‑network workloads, and even computation‑driven network redesign.
Unified resource management and scheduling : Integrates compute, storage, and network resources under a single management and routing layer, improving utilization and operational efficiency.
Representative Implementations
NVIDIA DGX SuperPOD (NVL72 example)
The DGX SuperPOD series is NVIDIA’s flagship AI super‑computing platform. The GB200 NVL72 SuperNode integrates 36 Grace CPUs and 72 Blackwell GPUs in a liquid‑cooled cabinet, using a “GPU‑GPU NVLink Scale‑Up + Node‑Node RDMA Scale‑Out” interconnect scheme.
Compute Tray : The system contains 18 Compute Trays; each tray holds two GB200 chips, each chip comprising two Blackwell B200 GPUs and one Grace CPU, for a total of 72 B200 GPUs and 36 Grace CPUs. NVLink and NVLink‑C2C provide GPU‑GPU and GPU‑CPU high‑speed memory sharing, delivering 7.2 TB/s (single‑direction 28.8 Tb/s) per tray and 129.6 TB/s for the whole cabinet.
Switch Tray : Nine Switch Trays each embed two NVSwitch chips, offering 18 NVSwitch chips in total. The back‑plane cables connect Compute and Switch Trays, providing 14.4 TB/s (single‑direction 57.6 Tb/s) per tray and the same 129.6 TB/s for the cabinet, enabling full‑mesh GPU connectivity.
Scale‑Up : NVLink 5 and NVSwitch create a high‑bandwidth, low‑latency internal network, allowing all GPUs to access each other’s HBM and the Grace CPUs’ DDR memory as a unified memory space.
Scale‑Out : CX8800 Gbps RNICs connect the SuperNode to an InfiniBand RDMA Scale‑Out network, allowing multiple NVL72 SuperNodes to form larger SuperPODs (e.g., eight nodes yielding 576 B200 GPUs).
Huawei CloudMatrix 384
CloudMatrix 384 is Huawei’s ultra‑large AI supernode solution, comprising 384 Ascend 910C NPU chips interconnected in a full‑mesh topology. It introduces a peer‑compute architecture that extends the bus from inside a server to the entire cabinet and even across cabinets.
Compute Tray : Each tray holds eight 910C NPUs and seven L1‑HCCS‑SW switch chips (for Scale‑Up) plus one CDR switch chip (for Scale‑Out). Each NPU chip integrates two 910B cores and eight HBM2e memory stacks, delivering 16.781 TFLOPS and 3.2 TB/s memory bandwidth per card.
Switch Tray : Uses CloudEngine 16800 switches with 16 slots, each supporting up to 48 × 400 G interfaces, forming an all‑to‑all (All‑to‑Al1) topology that removes traditional bandwidth bottlenecks.
Scale‑Up : Provides up to 269 TB/s bandwidth, 2.1 × the NVL72 bandwidth, using 400 G low‑power optical modules and omitting DSP chips to reduce latency and power consumption.
Scale‑Out : Employs an 8‑rail spine‑leaf topology with 400 G optics, achieving a total inter‑node bandwidth 5.3 × that of NVIDIA’s NVL72.
ETH‑X Project
The ETH‑X project, led by ODCC with partners such as China Academy of Information and Communications Technology and Tencent, supports a supernode with 64 GPUs per cabinet and adopts an open RoCE interconnect instead of NVIDIA’s proprietary NVLink.
Compute Tray : Each tray contains four GPUs and one x86 CPU, connected via a PCIe switch. The full cabinet hosts 64 GPUs and provides four NICs per tray for Scale‑Out expansion.
Switch Tray : Each tray includes a high‑performance 51.2 Tbps Ethernet switch supporting RoCE, with eight switches per cabinet delivering a total of 409.6 Tbps bandwidth (half for intra‑cabinet GPU connections, half for inter‑cabinet links).
Intel Gaudi 3 GPUs contribute 4.8 Tbps each, requiring twelve Switch Trays for the entire cabinet. An alternative configuration can omit external Scale‑Out ports, using all serdes links for internal connectivity with only four 2U Switch Trays.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
