Fundamentals 7 min read

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

This article explains the background of GPU communication, introduces DMA and RDMA fundamentals, describes how GPUDirect RDMA enables direct GPU-to-GPU memory access across machines, and presents performance results showing reduced latency and increased bandwidth for distributed deep‑learning training.

Architects' Tech Alliance

Feb 3, 2019

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

In recent deep‑learning workloads, single‑node GPU servers can no longer meet the computational demands, making multi‑node, multi‑GPU training essential; consequently, inter‑node communication becomes a critical performance factor.

Background : The previous articles covered GPUDirect P2P and NVLink, which boost intra‑node GPU communication. This article focuses on GPUDirect RDMA, a technology that accelerates communication between GPUs on different machines.

2. RDMA Overview

RDMA (Remote Direct Memory Access) allows network data transfer to bypass the CPU, providing high‑throughput, low‑latency communication.

2.1 DMA Principle

Direct Memory Access (DMA) offloads data movement from the CPU to a dedicated controller, enabling hardware‑only transfers between memory and I/O devices, reducing CPU overhead and improving efficiency.

2.2 RDMA Principle

RDMA extends this concept to network communication, letting a NIC read/write remote memory directly, eliminating kernel‑space processing and achieving zero‑copy data movement.

In practice, RDMA combines smart NIC hardware with optimized software stacks, employing zero‑copy and kernel bypass techniques to minimize latency and CPU involvement.

2.3 RDMA Implementations

RDMA can be realized via InfiniBand or Ethernet; Ethernet variants include iWARP and RoCE (v1/v2). InfiniBand offers high performance but requires expensive hardware, while RoCE and iWARP are more cost‑effective.

3. GPUDirect RDMA

3.1 Principle

GPUDirect RDMA enables a GPU on one machine to directly access the memory of a GPU on another machine, eliminating the intermediate copies between GPU memory and system memory, thereby reducing communication latency.

3.2 Usage Requirements

To use GPUDirect RDMA, the GPU and the RDMA NIC must reside under the same root complex, as illustrated below.

3.3 Performance

Mellanox NICs support GPUDirect RDMA over both InfiniBand and RoCE. Benchmarks using OSU micro‑benchmarks show significant latency reduction and bandwidth increase when GPUDirect RDMA is enabled.

Real‑world HPC applications, such as particle dynamics simulations with HOOMD, demonstrate up to a 2× performance boost as the number of nodes increases when GPUDirect RDMA is employed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning RDMA InfiniBand GPU communication GPUDirect

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.