Which GPU Cluster Network Wins for LLM Training? NVLink, InfiniBand, RoCE & DDC Compared
This article analyzes the main GPU/TPU cluster networking options—NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics—examining latency, lossless transmission, congestion control, cost, power, and scalability to determine their suitability for large‑scale LLM training.
Background and Key Requirements
Modern large‑language‑model (LLM) training relies on high‑performance GPU/TPU clusters. To achieve good training throughput, the inter‑GPU network must satisfy three core criteria: (1) low end‑to‑end latency, because frequent GPU‑to‑GPU communication directly impacts total training time; (2) lossless transmission, as any lost gradient or intermediate result forces a rollback to the previous checkpoint; and (3) effective congestion‑control mechanisms to avoid both transient and persistent congestion in tree‑topology networks, which can stall multiple GPUs.
Beyond these technical factors, total system cost, power consumption, and cooling also influence the choice of networking technology.
1. NVLink Switch System
NVLink is a high‑speed point‑to‑point link designed specifically for GPU‑GPU communication. Nvidia demonstrated a NVSwitch‑based topology that can connect up to 32 nodes (256 GPUs) at the Hot Chips 2022 conference. The third‑generation NVSwitch provides 64 NVLink ports with up to 12.8 Tbps of switching capacity and supports multicast and in‑switch aggregation, reducing the amount of data that must traverse the external network.
In GPT‑3 training, NVSwitch delivered roughly twice the speed of an InfiniBand network, though its total bandwidth (≈12.8 Tbps) is still four times lower than the 51.2 Tbps offered by high‑end Ethernet switches. Scaling NVSwitch beyond ~1,000 GPUs is cost‑prohibitive and limited by protocol constraints, and Nvidia does not sell NVSwitches separately, preventing mixed‑vendor GPU clusters.
2. InfiniBand (IB) Network
InfiniBand, introduced in 1999, provides a high‑speed, low‑latency, lossless RDMA‑capable fabric widely used in HPC and AI clusters. Its advantages include sub‑microsecond latency, zero‑loss transmission, and remote direct memory access (RDMA) that moves data directly between GPU memories without CPU involvement.
However, IB switches are relatively expensive, require dedicated host channel adapters and cables, and are harder to configure and scale compared to Ethernet. While suitable for small‑to‑medium clusters, scaling to >32 K GPUs can be challenging due to the centralized subnet manager and higher hardware costs.
3. RoCE Lossless Ethernet
Ethernet speeds now range from 1 Gbps to 800 Gbps, with future roadmaps targeting 1.6 Tbps. Compared with InfiniBand, Ethernet offers higher port speeds and total switching capacity, and its switches are generally cheaper per gigabit because of intense competition among ASIC vendors.
RoCE (RDMA over Converged Ethernet) provides lossless transmission via Priority Flow Control (PFC) and supports up to eight traffic classes, allowing certain classes to be marked lossless. Advanced end‑to‑end congestion‑control schemes such as DCQCN can further reduce congestion and packet loss. Load‑balancing is typically achieved with ECMP routing and can be enhanced with adaptive schemes that reroute traffic from congested paths.
High‑end Ethernet ASICs can deliver up to 51.2 Tbps of switching capacity with 800 Gbps ports—twice the capacity of Nvidia’s Quantum‑2 IB platform—potentially halving the number of switches needed for a given GPU fabric.
4. DDC Full‑Schedule Fabric (VOQ Architecture)
Recent switch/router ASICs from vendors such as Juniper and Broadcom support full‑schedule fabrics that combine cell‑based switching with Virtual Output Queues (VOQ). In a VOQ system, each leaf switch maintains a separate queue for every possible egress port, allowing precise per‑flow scheduling and eliminating most head‑of‑line (HOL) blocking.
Packets are buffered at the ingress leaf until the egress leaf grants transmission rights, ensuring that link bandwidth is never oversubscribed. This request‑grant handshake adds a small round‑trip latency but dramatically reduces tail latency and eliminates incast‑induced congestion.
Limitations include the need for large ingress buffers proportional to the number of GPUs and priority queues, and vendor lock‑in because each supplier uses proprietary VOQ protocols, making mixed‑vendor deployments difficult.
Comparative Summary
NVLink : Excellent intra‑server GPU communication with low latency and high bandwidth, but limited scalability and high cost for large clusters.
InfiniBand : Provides native RDMA, low latency, and lossless transmission; best for small‑to‑medium clusters where cost and closed architecture are acceptable.
RoCE Ethernet : Leverages the mature Ethernet ecosystem, offers the lowest cost per bandwidth, highest port speeds, and rapid bandwidth iteration—making it the most suitable for medium‑to‑large GPU clusters.
DDC/VOQ Fabric : Promising new approach that mitigates Ethernet congestion and HOL blocking, but still in research phase with vendor‑specific implementations and higher complexity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
