Operations 16 min read

Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC

This article reviews the main GPU/TPU cluster networking solutions—including NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics—examining their latency, loss‑free transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC

Popular GPU/TPU cluster networking technologies such as NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics are introduced, with a focus on how they connect GPUs and support large‑scale LLM training.

Key performance requirements for GPU networks include low end‑to‑end latency, loss‑free transmission to avoid training rollbacks, and effective congestion control to prevent bottlenecks in tree topologies.

NVLink Switch System : NVLink provides high‑speed point‑to‑point links between GPUs, offering higher performance and lower overhead than traditional networks. NVIDIA demonstrated NVSwitch topologies connecting up to 32 nodes (256 GPUs). The third‑generation NVSwitch has 64 NVLink ports and 12.8 Tbps capacity, supporting multicast and network aggregation, but its bandwidth is lower than high‑end 51.2 Tbps switches and scaling beyond 1,000 GPUs is cost‑prohibitive.

InfiniBand (IB) Network : Since 1999, IB has been a high‑performance, low‑latency, loss‑free RDMA solution widely used in HPC and AI clusters. It offers credit‑based flow control, zero‑loss transmission, and remote direct memory access, but its closed architecture and higher cost limit scalability and make large‑scale deployments more challenging.

ROCE Lossless Ethernet : Ethernet provides a broad ecosystem, higher port speeds (up to 800 Gbps) and lower cost per bandwidth. High‑end Ethernet ASICs can deliver up to 51.2 Tbps, double the capacity of NVIDIA’s Quantum‑2 IB. ROCEv2 enables RDMA over Ethernet with priority flow control (PFC) for loss‑free transmission and supports advanced congestion control (e.g., DCQCN) to reduce tail latency.

Load balancing techniques such as ECMP and adaptive routing are discussed, along with the need for sufficient buffering in VOQ architectures to handle congestion and avoid head‑of‑line blocking.

DDC Full‑Schedule Fabric (VOQ) : Recent switch/router chips support full‑schedule fabrics using virtual output queues (VOQ). Packets are buffered once at the ingress leaf, and a request‑grant protocol coordinates transmission to egress leaves, providing strict scheduling, reduced tail latency, and elimination of incast‑induced congestion. Limitations include large buffer requirements, increased per‑packet latency due to request‑grant handshakes, and vendor lock‑in.

Summary of Main Technologies : NVLink excels for intra‑server GPU communication but scales poorly beyond small clusters; InfiniBand offers native RDMA with low latency but higher cost; ROCE Ethernet leverages a mature ecosystem and cost efficiency for medium‑to‑large clusters; DDC VOQ fabrics show promise for ultra‑large deployments but remain in research stages.

References to additional articles and resources are provided for deeper exploration of each technology.

AI trainingInfiniBandRoCENVLinkDDCGPU networkingVOQ
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.