Fundamentals 11 min read

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

Remote Direct Memory Access (RDMA) is a high‑speed network memory access technology that bypasses the operating system kernel (e.g., sockets, TCP/IP stack), allowing direct read/write between a node’s NIC memory and another node’s memory, thereby reducing CPU overhead.

RDMA is implemented primarily through three technologies: InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP. InfiniBand and RoCE are the mainstream choices due to their superior performance and wide adoption, especially in bandwidth‑ and latency‑critical AI model training scenarios.

InfiniBand: The Bandwidth Champion

InfiniBand supports 100 Gbps (EDR) and 200 Gbps (HDR) links, using a “fat‑tree” topology with a core layer and an access layer. While its performance is unmatched, the cost is high (e.g., a 36‑port aggregation switch and $13 k per cable), limiting its use to supercomputing centers and research labs.

RoCE: A Cost‑Effective RDMA Alternative

RoCE brings RDMA to Ethernet, offering a more economical solution compared with InfiniBand, though achieving a truly loss‑less network still costs at least 50 % of an InfiniBand deployment.

GPUDirect RDMA for Large‑Scale Model Training

GPUDirect RDMA enables GPUs on different nodes to exchange data directly via InfiniBand NICs, bypassing CPU and system memory. This dramatically reduces communication latency for models stored in GPU memory, accelerating training of massive AI models.

Optimizing GPU‑InfiniBand Configurations

Reference designs such as NVIDIA’s DGX system pair each GPU with a dedicated InfiniBand NIC (1:1), scaling up to nine NICs per node. A more cost‑effective ratio (1 InfiniBand NIC to 4 GPUs) is also viable, but shared NICs can introduce contention.

Network designs often employ multiple InfiniBand NICs per node connected through PCI‑e switches and leaf‑spine topologies, achieving linear bandwidth scaling (e.g., 8 × 400 Gbps NICs with 8 × H100 GPUs delivering >12 GB/s per link).

Design Recommendations

For high‑performance, loss‑less environments, choose either InfiniBand or RoCE based on application requirements and existing infrastructure; both provide low latency, high throughput, and minimal CPU overhead, making them ideal for HPC and AI workloads.

Additional reading links and promotional material are listed in the original source.

AIHigh Performance ComputingGPURDMAInfiniBandRoCE
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.