Choosing the Right GPU Cluster Network: NVLink, InfiniBand, RoCE & DDC Explained
This article examines the key GPU/TPU cluster networking options—NVLink, InfiniBand, RoCE Ethernet, and emerging DDC full‑scheduling fabrics—detailing their latency, loss‑less transmission, congestion control, cost, power, and scalability considerations for large‑scale AI training deployments.
Modern AI training clusters rely on high‑performance interconnects to move gradients and model data between thousands of GPUs. Selecting the appropriate network technology requires balancing end‑to‑end latency, loss‑less transmission, congestion control, total cost, power consumption, and cooling.
Core Requirements for GPU Networks
End‑to‑end latency: Frequent GPU‑to‑GPU communication makes low latency essential for reducing overall training time.
Loss‑less transmission: Dropped gradients force a rollback to the last checkpoint, dramatically hurting performance.
Effective congestion control: In tree topologies, multiple nodes sending to a single node cause transient and persistent congestion, which can stall many GPUs.
Additional factors include total system cost, power draw, and cooling requirements.
NVLink Switch System
NVLink is a high‑speed point‑to‑point link designed specifically for GPU‑GPU communication. Nvidia demonstrated a NVSwitch‑based topology that connects up to 32 nodes (256 GPUs). The third‑generation NVSwitch provides 64 NVLink ports with up to 12.8 Tbps of switching capacity, supporting multicast and in‑switch aggregation to reduce inter‑GPU traffic.
In GPT‑3 training, NVSwitch achieved roughly twice the speed of InfiniBand, though its bandwidth (12.8 Tbps) is still four times lower than the 51.2 Tbps offered by top‑tier Ethernet switches. Scaling beyond ~1,000 GPUs with NVSwitch is cost‑prohibitive and limited by protocol constraints, and NVSwitches are only sold by Nvidia, preventing mixed‑vendor GPU clusters.
InfiniBand Network
InfiniBand (IB) has been a high‑performance alternative since 1999, offering low latency, loss‑less transmission, and RDMA capabilities. It is widely used in HPC, AI/ML clusters, and data centers. IB provides credit‑based flow control, guaranteeing loss‑less delivery, and supports congestion notification similar to ECN.
All IB switches support RDMA, allowing direct GPU‑to‑GPU memory transfers without CPU involvement, which boosts throughput and reduces latency.
However, IB’s closed architecture, higher cost, and the need for specialized HCAs and cables make it less popular than Ethernet for very large clusters. Major deployments include OpenAI’s 10,000‑GPU Azure cluster and Meta’s 16K‑GPU cluster using Nvidia A100 GPUs and Quantum‑2 IB switches.
RoCE Lossless Ethernet
Ethernet speeds now range from 1 Gbps to 800 Gbps, with future expectations of 1.6 Tbps. Compared with IB, Ethernet offers higher port speeds and total switching capacity, and its switches are generally cheaper per bandwidth unit due to intense competition among ASIC vendors.
High‑end Ethernet ASICs can provide up to 51.2 Tbps of switching capacity with 800 Gbps ports—twice the capacity of Nvidia’s Quantum‑2 IB platform. Ethernet also supports loss‑less transmission via Priority Flow Control (PFC), which provides eight service classes, each of which can be configured for loss‑less delivery.
RDMA over Converged Ethernet (RoCEv2) encapsulates RDMA frames in IP/UDP, enabling direct GPU memory writes without CPU intervention. Advanced congestion control schemes such as DCQCN further reduce tail latency and packet loss.
DDC Full‑Scheduling Fabric (VOQ Architecture)
Recent switch/router ASICs support full‑scheduling fabrics (also called AI Fabric) that employ Virtual Output Queues (VOQ). In a VOQ design, each leaf switch maintains a separate queue for every possible egress port, allowing precise control of packet admission and preventing oversubscription.
When a leaf switch’s VOQ accumulates packets, it requests transmission from the egress leaf. The egress scheduler grants requests based on strict hierarchy and available buffer space, ensuring that no link is overloaded.
This approach eliminates head‑of‑line (HOL) blocking and dramatically reduces tail latency, even at massive scales (32K‑64K GPUs). However, VOQ systems require large ingress buffers proportional to the number of GPUs and priority queues, and vendor‑specific protocols can hinder multi‑vendor interoperability.
Summary of Mainstream GPU Cluster Networking Technologies
NVLink: Provides high‑speed GPU‑GPU links with low overhead, best suited for intra‑server or small‑scale multi‑server clusters; limited scalability beyond a few hundred GPUs.
InfiniBand: Offers native RDMA, ultra‑low latency, and loss‑less delivery; ideal for medium‑scale clusters where cost and closed ecosystem are acceptable.
RoCE Ethernet: Leverages the mature Ethernet ecosystem, delivering the lowest cost per bandwidth and rapid bandwidth iteration; well‑suited for large‑scale training clusters.
DDC (VOQ‑based) Fabric: Combines cell‑based switching with VOQ to solve Ethernet congestion; still in research/early‑adoption phase but shows promise for future ultra‑large clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
