Artificial Intelligence 11 min read

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

In generative large‑model training, servers must exchange massive amounts of data frequently. Traditional TCP/IP requires multiple copies between user and kernel space, reducing transmission efficiency. RDMA enables applications to access remote memory directly without kernel involvement, offering high throughput, low latency, and zero CPU overhead, which greatly improves training efficiency and suits generative AI workloads.

RDMA comprises three protocol families: InfiniBand (IB), Ethernet‑based RDMA (RoCE), and TCP/IP‑based RDMA (iWARP). All conform to the RDMA standard and share the same upper‑layer API.

InfiniBand defines a dedicated stack from the link layer to the transport layer, designed for high‑performance computing. IB requires specialized hardware such as IB switches, NICs, and cables, providing high bandwidth, low latency, and lossless transmission, but incurs high procurement and maintenance costs and is not compatible with existing Ethernet equipment.

To broaden RDMA adoption on existing Ethernet deployments, the InfiniBand Trade Association (IBTA) defined the RoCE standard, allowing RDMA operations over standard Ethernet without dedicated IB hardware.

RoCE extends the Ethernet protocol stack to support RDMA. Two versions exist: RoCE v1 (link‑layer only, no routing or congestion control) and RoCE v2 (UDP‑based, supporting routing and ECN/CNP congestion control). RoCE v2 offers better cost‑performance and is widely deployed in large‑scale data centers.

iWARP, proposed by the IETF, implements RDMA over TCP. Its reliability makes it more tolerant of lossy networks, but the large number of TCP connections consumes significant memory and the complexity of TCP flow control can degrade performance, limiting its widespread use.

In summary, InfiniBand delivers superior performance for high‑performance computing, while RoCE provides easier integration with existing Ethernet infrastructure and lower cost, making RoCE the mainstream choice for AI model‑training networks today.

In 2023, a consortium of cloud and networking giants introduced Ultra Ethernet Transport (UET), an IP‑based protocol designed for next‑generation AI and HPC networks. UET incorporates multipath routing, packet‑spraying load balancing, incast management, efficient rate‑control algorithms, and an API that tolerates out‑of‑order packets, aiming to simplify congestion‑control tuning and support million‑node scale networks.

Large‑scale RDMA deployments typically use a Fat‑Tree (Clos) architecture, employing many commodity switches to create multiple equal‑cost paths. Switches use ECMP to achieve load balancing, forming a non‑blocking network.

AI training traffic is characterized by a few large "elephant" flows, which can cause ECMP hash polarization and uneven load distribution. Two optimization approaches are proposed: (1) increase flow entropy (e.g., PLB) to spread flows across more paths; (2) employ network‑state‑aware traffic engineering via a centralized SDN controller or adaptive routing to compute optimal paths based on real‑time topology and traffic information.

Specific techniques include centralized traffic engineering (SDN controller computes shortest‑path constraints), network‑level load balancing (calculates optimal forwarding based on AI training traffic patterns), and adaptive routing (switches select the least‑congested egress port per packet). Because RoCE may deliver out‑of‑order packets, NICs must reorder data before handing it to applications.

Research also notes that native RoCE’s strict in‑order delivery hampers load balancing; future designs should adopt packet‑spraying and programmable switches or smart NICs to reorder packets, fully exploiting multiple available paths.

Related reading: The article lists numerous additional resources on InfiniBand, RoCE, iWARP, high‑performance networking, and AI‑focused data‑center design.

network architectureHigh Performance ComputingRDMAAI trainingInfiniBandRoCEiWARP
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.