Fundamentals 11 min read

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

Architects' Tech Alliance

Apr 21, 2024

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

Remote Direct Memory Access (RDMA) is a high‑speed network memory access technology that bypasses the operating system kernel (e.g., sockets, TCP/IP stack), allowing direct read/write between a node’s NIC memory and another node’s memory, thereby reducing CPU overhead.

RDMA is implemented primarily through three technologies: InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP. InfiniBand and RoCE are the mainstream choices due to their superior performance and wide adoption, especially in bandwidth‑ and latency‑critical AI model training scenarios.

InfiniBand: The Bandwidth Champion

InfiniBand supports 100 Gbps (EDR) and 200 Gbps (HDR) links, using a “fat‑tree” topology with a core layer and an access layer. While its performance is unmatched, the cost is high (e.g., a 36‑port aggregation switch and $13 k per cable), limiting its use to supercomputing centers and research labs.

RoCE: A Cost‑Effective RDMA Alternative

RoCE brings RDMA to Ethernet, offering a more economical solution compared with InfiniBand, though achieving a truly loss‑less network still costs at least 50 % of an InfiniBand deployment.

GPUDirect RDMA for Large‑Scale Model Training

GPUDirect RDMA enables GPUs on different nodes to exchange data directly via InfiniBand NICs, bypassing CPU and system memory. This dramatically reduces communication latency for models stored in GPU memory, accelerating training of massive AI models.

Optimizing GPU‑InfiniBand Configurations

Reference designs such as NVIDIA’s DGX system pair each GPU with a dedicated InfiniBand NIC (1:1), scaling up to nine NICs per node. A more cost‑effective ratio (1 InfiniBand NIC to 4 GPUs) is also viable, but shared NICs can introduce contention.

Network designs often employ multiple InfiniBand NICs per node connected through PCI‑e switches and leaf‑spine topologies, achieving linear bandwidth scaling (e.g., 8 × 400 Gbps NICs with 8 × H100 GPUs delivering >12 GB/s per link).

Design Recommendations

For high‑performance, loss‑less environments, choose either InfiniBand or RoCE based on application requirements and existing infrastructure; both provide low latency, high throughput, and minimal CPU overhead, making them ideal for HPC and AI workloads.

Additional reading links and promotional material are listed in the original source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ai GPU RDMA InfiniBand RoCE High‑Performance Computing

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.