Fundamentals 12 min read

Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementation

The article provides a comprehensive overview of Remote Direct Memory Access (RDMA), explaining its underlying principles, performance advantages over traditional TCP/IP, various protocol families such as InfiniBand, RoCE, and iWARP, and practical implementation considerations for high‑performance networking.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementation

Abstract: Remote Direct Memory Access (RDMA) is a technology that transfers data directly from the memory of one computer to another without OS intervention. This article guides the technology and refers readers to the full ebook for detailed analysis.

RDMA first appeared in InfiniBand networks for high‑performance computing clusters. Traditional socket‑based TCP/IP communication requires data to be copied between DRAM, CPU cache, and NIC buffers, consuming CPU cycles and bandwidth, and increasing latency. For example, a 40 Gbps TCP/IP flow can saturate a server’s CPU, whereas RDMA reduces CPU usage from 100 % to 5 % and latency from milliseconds to under 10 µs.

RDMA enables computers to directly access remote memory without processor involvement, moving data quickly while leaving the OS untouched. The principle and comparison with TCP/IP are illustrated in the diagram below.

In essence, RDMA uses specialized hardware and network technology so that NICs can read each other's memory directly, achieving high bandwidth, low latency, and low resource utilization. Applications simply specify memory addresses, start the transfer, and wait for completion.

Initially implemented on InfiniBand, RDMA was expensive and limited to Mellanox and Intel solutions. Later, vendors ported RDMA to Ethernet, reducing cost. On Ethernet, RDMA appears as iWARP and RoCE, with RoCE further divided into RoCEv1 (link‑layer) and RoCEv2 (IP‑layer supporting routing). The protocol stacks are compared in the following diagram.

InfiniBand: a next‑generation RDMA‑capable network protocol requiring RDMA‑enabled NICs and switches.

RoCE: enables RDMA over Ethernet; uses Ethernet headers at the lower layer and InfiniBand headers at the higher layer; requires special NICs but works on standard Ethernet switches.

iWARP: enables RDMA over TCP; does not require special NICs (software implementation possible) but sacrifices most performance benefits.

The transport interface between RDMA applications and RNICs is called Verbs (RDMA API), which includes two main types:

Memory Verbs (One‑Sided RDMA) : RDMA Reads, Writes, Atomics – no remote CPU involvement.

Messaging Verbs (Two‑Sided RDMA) : RDMA Send, Receive – requires remote CPU participation.

RDMA over TCP (iWARP) works on standard TCP/IP Ethernet, allowing various transport types (network, I/O, file system, block storage, processor) to share the same physical connection.

The iWARP protocol stack consists of three top layers ensuring high‑speed network interoperability.

RoCE (RDMA over Converged Ethernet) allows remote memory access over Ethernet, with two versions: RoCEv1 (link‑layer, limited to the same broadcast domain) and RoCEv2 (IP‑layer, supports routing). RoCE can also operate on non‑converged Ethernet.

Although InfiniBand, Ethernet RoCE, and Ethernet iWARP share a unified API, they differ in physical and link layers. RoCE generally offers lower latency, higher throughput, and lower CPU load compared to iWARP, and is supported by many mainstream solutions, including Windows.

RDMA provides a messaging service that lets applications directly access remote virtual memory, enabling inter‑process communication, remote server communication, and data transfer to storage devices via upper‑layer protocols such as iSER, SRP, SMB, Samba, Lustre, and ZFS.

RoCE and InfiniBand each define how RDMA runs on Ethernet or IB networks respectively; RoCE aims to migrate IB cluster workloads to converged Ethernet, while IB still offers higher bandwidth and lower latency for certain applications.

Congestion Control: RoCE relies on lossless Ethernet flow control or PFC; RoCEv2 adds ECN marking and CNP frames. IB uses a credit‑based algorithm.

Latency: IB switches typically have lower port‑to‑port latency (≈100 ns) than Ethernet switches (≈230 ns).

Configuration: Setting up a DCB Ethernet network is considerably more complex than configuring an IB network.

RoCE uses a connection‑less UDP‑based protocol, while iWARP uses a connection‑oriented TCP‑based protocol. RoCEv1 is limited to a single Layer‑2 broadcast domain; RoCEv2 and iWARP support Layer‑3 routing. iWARP’s many TCP connections can consume significant memory resources, whereas RoCE supports multicast.

Intel’s acquisition of QLogic’s InfiniBand business led to the “True Scale Fabric” solution, encompassing IB and Omni‑Path, with programming interfaces Verbs and PSM (Performance Scaled Messaging) for MPI communication.

Intel integrated Omni‑Path functionality into its CPUs, improving communication efficiency but tying the network to the CPU architecture.

Following Intel’s acquisition of Cray’s interconnect division, Omni‑Path introduced a “1.5‑layer” Link Transport Layer based on Cray’s Aries technology, providing reliable two‑layer packet delivery, flow control, and single‑link control.

Based on the author’s understanding, the RDMA principles, comparisons, and implementation details have been compiled into an ebook, with additional content and refined organization beyond previous posts.

Hot Article Recommendations

DDN aims to complete continuous purchase, which do you prefer: Lustre or Tintri?

Microsoft System Center architecture and storage integration analysis

Warm Reminder: Search for “ICT_Architect” or scan the QR code to follow the public account and click the original link for more technical resources.

Seek knowledge with thirst, remain humble with curiosity

High Performance Computingnetwork performanceRDMAInfiniBandRoCEiWARP
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.