Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training
This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.
Remote Direct Memory Access (RDMA) is a high‑speed network memory access technology that bypasses the operating system kernel (e.g., sockets, TCP/IP stack), allowing direct read/write between a node’s NIC memory and another node’s memory, thereby reducing CPU overhead.
RDMA is implemented primarily through three technologies: InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP. InfiniBand and RoCE are the mainstream choices due to their superior performance and wide adoption, especially in bandwidth‑ and latency‑critical AI model training scenarios.
InfiniBand: The Bandwidth Champion
InfiniBand supports 100 Gbps (EDR) and 200 Gbps (HDR) links, using a “fat‑tree” topology with a core layer and an access layer. While its performance is unmatched, the cost is high (e.g., a 36‑port aggregation switch and $13 k per cable), limiting its use to supercomputing centers and research labs.
RoCE: A Cost‑Effective RDMA Alternative
RoCE brings RDMA to Ethernet, offering a more economical solution compared with InfiniBand, though achieving a truly loss‑less network still costs at least 50 % of an InfiniBand deployment.
GPUDirect RDMA for Large‑Scale Model Training
GPUDirect RDMA enables GPUs on different nodes to exchange data directly via InfiniBand NICs, bypassing CPU and system memory. This dramatically reduces communication latency for models stored in GPU memory, accelerating training of massive AI models.
Optimizing GPU‑InfiniBand Configurations
Reference designs such as NVIDIA’s DGX system pair each GPU with a dedicated InfiniBand NIC (1:1), scaling up to nine NICs per node. A more cost‑effective ratio (1 InfiniBand NIC to 4 GPUs) is also viable, but shared NICs can introduce contention.
Network designs often employ multiple InfiniBand NICs per node connected through PCI‑e switches and leaf‑spine topologies, achieving linear bandwidth scaling (e.g., 8 × 400 Gbps NICs with 8 × H100 GPUs delivering >12 GB/s per link).
Design Recommendations
For high‑performance, loss‑less environments, choose either InfiniBand or RoCE based on application requirements and existing infrastructure; both provide low latency, high throughput, and minimal CPU overhead, making them ideal for HPC and AI workloads.
Additional reading links and promotional material are listed in the original source.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.