How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects
This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.
InfiniBand in AI Training Clusters
In AI clusters built for large‑model training, InfiniBand is the preferred high‑performance network because of its high bandwidth, low latency, and native RDMA capabilities, which make it the backbone for many vendors' training solutions.
1. IB Architecture and Protocol Stack
InfiniBand works with NVLink and NVSwitch to form a three‑tier communication architecture:
Intra‑node: NVLink/NVSwitch provide fast GPU‑to‑GPU links within a server.
Inter‑node: InfiniBand connects GPUs across servers, supporting distributed training.
Multi‑node topology: InfiniBand switches (e.g., Quantum, Spectrum) build Fat‑Tree or Dragonfly topologies for scalable performance.
The protocol stack mirrors the OSI model but is optimized for high‑performance:
Physical layer : defines high‑speed serial interfaces (HDR, NDR, XDR) and encoding.
Link layer : assembles frames, provides flow control, CRC error detection, and virtual channels for traffic isolation.
Network layer : routes packets, supports static and adaptive routing for load balancing and fault tolerance.
Transport layer : offers several services—Reliable Connection (RC), Reliable Datagram (RD), Unreliable Connection (UC), and Unreliable Datagram (UD)—to match different communication patterns.
2. Key Technologies and Bandwidth Evolution
InfiniBand’s core advantage is its native RDMA integration, allowing GPUDirect to bypass the CPU and host memory, reducing latency to microseconds and freeing CPU cycles for computation.
Additional features include end‑to‑end reliability (packet sequencing, acknowledgments, retransmission), service‑level and virtual‑channel mechanisms for multi‑tenant isolation, adaptive path selection, FEC, and a Subnet Manager for topology and QoS control.
Bandwidth has progressed from early 10 Gb/s to the latest 800 Gb/s per port, as shown below.
3. InfiniBand vs. Ethernet
In practice, InfiniBand and Ethernet complement each other. Their main differences lie in protocol architecture, performance stability, and management approaches. With Ethernet adopting RoCEv2 and CXL, convergence is emerging—e.g., NVIDIA Quantum‑2 supports both InfiniBand and Ethernet.
4. RoCEv2 Technology
RoCE (RDMA over Converged Ethernet) enables zero‑copy data transfer by writing directly to remote memory over Ethernet, achieving high bandwidth and low latency without kernel involvement. RoCEv2 extends RDMA to the network layer using UDP encapsulation, allowing packets to be routed by standard Ethernet equipment.
Key benefits include:
Zero‑copy: reduces data copies from four to one, lowering CPU overhead.
Low latency: sub‑microsecond delays on 100 Gbps links, better than TCP/IP.
Efficient stack: UDP‑based, no TCP connection overhead, supports millions of queues for massive concurrency.
Kernel bypass: user‑space drivers save thousands of CPU cycles per operation.
5. High‑Performance RDMA Landscape
RoCEv1 (2010) reused InfiniBand’s network and transport layers but kept Ethernet only at the link layer, limiting routing. RoCEv2 (2014) moves RDMA to the network layer, enabling routing over existing Ethernet infrastructure and becoming the dominant RDMA protocol alongside InfiniBand.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
