Fundamentals 9 min read

An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations

This article provides a comprehensive overview of Remote Direct Memory Access (RDMA), detailing its underlying principles, performance advantages over traditional TCP/IP, various protocol families such as InfiniBand, RoCE, and iWARP, and their respective hardware and software requirements.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations

Remote Direct Memory Access (RDMA) is a technology that transfers data directly between the memory of two computers without involving the operating system, aiming to reduce CPU usage, memory bandwidth consumption, and latency.

Originally introduced in InfiniBand networks for high‑performance computing clusters, traditional socket‑based TCP/IP communication requires data to be copied between DRAM, CPU caches, and NIC buffers, consuming significant CPU resources; for example, a 40 Gbps TCP/IP flow can saturate a server’s CPU, whereas RDMA can lower CPU usage to around 5 % and reduce latency to sub‑10 µs.

RDMA enables servers to read remote memory directly via specialized NICs, achieving high bandwidth, low latency, and low resource utilization without application‑level involvement beyond specifying memory addresses and initiating transfers.

While early RDMA implementations were limited to InfiniBand hardware from vendors like Mellanox and Intel, the technology has been ported to Ethernet, resulting in two main Ethernet‑based families: iWARP (RDMA over TCP) and RoCE (RDMA over Converged Ethernet), with RoCE further split into RoCEv1 (link‑layer) and RoCEv2 (IP‑layer supporting routing).

RDMA’s programming interface, known as Verbs or the RDMA API, provides two categories of operations: Memory Verbs (one‑sided RDMA such as Reads, Writes, Atomics) that require no remote CPU involvement, and Messaging Verbs (two‑sided RDMA such as Send and Receive) that involve the remote CPU.

iWARP operates over standard TCP/IP networks and does not depend on specific physical layers, allowing it to run on any TCP/IP infrastructure, while RoCE relies on lossless Ethernet (PFC) and, in RoCEv2, adds congestion control via ECN and CNP frames.

Comparatively, RoCE generally offers lower latency and higher throughput than iWARP, though iWARP can be implemented in software and is more flexible on standard Ethernet without requiring specialized NICs.

Intel’s acquisition of QLogic’s InfiniBand business and later Cray’s interconnect division introduced the Omni‑Path architecture, which adds a 1.5‑layer transport model and integrates tightly with CPUs, further expanding high‑performance RDMA solutions.

The article also notes that despite unified RDMA APIs, the underlying physical and link layers differ among InfiniBand, RoCE, and iWARP, influencing deployment choices based on performance, cost, and infrastructure compatibility.

High Performance Computingnetwork protocolsLow LatencyRDMAInfiniBandRoCEiWARP
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.