Designing High‑Performance Cluster Networks for AI Large Models: InfiniBand vs RoCE
The article analyzes the networking challenges of AI super‑large models, comparing InfiniBand and RoCE technologies, and presents design guidelines for ultra‑scale, high‑bandwidth, low‑latency, and highly stable cluster interconnects to maximize GPU utilization and overall training efficiency.
Based on the OpenAI "Scaling Laws for Neural Language Models", increasing model parameters improves performance but is limited by compute constraints, making the optimization of cluster compute capacity a core issue.
The effective compute power of a cluster is split into GPU utilization and linear scaling; GPU utilization depends on chip process, memory/I‑O bottlenecks, inter‑GPU bandwidth, topology, and power, while linear scaling is governed by node communication, parallel training frameworks, and resource scheduling.
Designing an efficient cluster networking scheme that provides low latency, high bandwidth, and non‑blocking inter‑node communication is crucial for reducing data‑synchronization time and increasing the GPU effective compute ratio (GPU compute time / total training time).
According to China Mobile Research Institute's "AI Large‑Model Data Center Network Evolution Whitepaper", AI super‑large models impose new network requirements:
Ultra‑scale networking: Models with trillions of parameters demand massive compute and correspondingly scalable network capacity.
Ultra‑high bandwidth: In‑rack and inter‑rack collective communications (All‑Reduce) can reach hundreds of GB, requiring extremely high intra‑ and inter‑rack bandwidth.
Ultra‑low latency: Communication latency consists of static (chip forwarding and distance) and dynamic (switch queuing, packet loss, retransmission) components.
Ultra‑high stability & automation: As GPU counts grow, network reliability becomes the "short board"; failures directly affect node connectivity and resource utilization.
RDMA can cut end‑to‑end latency by bypassing the OS kernel; the main RDMA solutions are InfiniBand and RoCEv2 (RDMA over Ethernet).
The Ultra Ethernet Consortium (UEC), launched on July 19, 2023 by major cloud, networking, and semiconductor vendors, aims to provide an open, interoperable, high‑performance full‑stack Ethernet architecture for AI and HPC workloads.
InfiniBand, introduced in 2000, pioneered RDMA and offers low latency, high bandwidth, and high reliability; it dominated TOP500 supercomputers after 2015. Nvidia’s Mellanox is the primary supplier, with Nvidia acquiring Mellanox in 2019.
Using Nvidia's latest GB200 platform as an example, the rack design includes 18 compute trays and 9 switch trays connected via copper cable cartridges and liquid cooling, achieving 25× performance over air‑cooled H100 at the same power.
Each compute tray houses two GB200 Grace‑Blackwell Superchips, four ConnectX‑800G InfiniBand SuperNICs, and one Bluefield‑3 DPU. The Superchip integrates two Blackwell GPUs and one Grace CPU, delivering ~20 petaFLOPS FP8 AI performance, 8 TB/s memory bandwidth, and 18 NVLink ports (1.8 TB/s bi‑directional).
ConnectX‑800G provides 800 Gb/s end‑to‑end bandwidth via PCIe 6.0, supporting OSFP 224 and QSFP112 connectors and NVIDIA Socket Direct.
Bluefield‑3 DPU offers 400 Gb/s Ethernet or 400 Gb/s InfiniBand connectivity, offloading and accelerating software‑defined networking, storage, security, and management.
Within a rack, NVLink 5th‑generation links deliver 1.8 TB/s bi‑directional bandwidth—double the previous generation and over 14× PCIe Gen5. This enables 1.8 TB/s GPU‑to‑GPU communication, making large‑scale GPU expansion feasible.
For inter‑rack scaling, when GPU count exceeds 72, a single‑layer network is insufficient. NVLink‑only clusters can connect up to 576 GPUs using dual‑rack NVL72 configurations, while InfiniBand clusters can leverage NVIDIA Quantum‑X800 Q3400 switches (144 × 800 Gb/s ports) to interconnect up to 10 368 GPUs.
Network layer analysis shows that a 2‑layer architecture requires roughly 2.5 optical modules (1.6 T) per GPU, while a 3‑layer architecture needs about 3.5 modules per GPU.
Network Series Report – Switch Overview: IB vs Ethernet
Model‑as‑a‑Service (MaaS) Report 2024
Nvidia Latest GPU & Interconnect Roadmap 2024
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.