Fundamentals 13 min read

Understanding RDMA: Principles, Advantages, and Implementation Details

This article explains the challenges of high‑performance computing and big‑data workloads on traditional TCP/IP stacks, introduces RDMA technology, its variants (InfiniBand, RoCE, iWARP), key protocols, hardware components, and how it achieves ultra‑low latency and high throughput with minimal CPU involvement.

Architects' Tech Alliance

Jan 10, 2019

Understanding RDMA: Principles, Advantages, and Implementation Details

Facing high‑performance computing, big‑data analytics, and bursty I/O with high concurrency and low‑latency requirements, the existing TCP/IP hardware/software stack and its CPU‑intensive processing cannot meet application demands, resulting in excessive processing delays, multiple memory copies, interrupt handling, context switches, complex TCP/IP processing, and additional latency caused by packet loss.

RDMA (Remote Direct Memory Access) is a technology that enables direct memory access between remote endpoints. It originally belonged to the InfiniBand architecture, but with the emergence of RoCE and iWARP under the network‑convergence trend, RDMA can now be deployed over the widely used Ethernet while offering ultra‑low latency and minimal CPU usage.

The RDMA Consortium (RDMAC) and the InfiniBand Trade Association (IBTA) drive RDMA development. RDMAC, as an IETF supplement, mainly defines iWARP and iSER, while IBTA standardizes all InfiniBand specifications and has contributed to the standardization of RoCE v1 and v2. The Verbs interface and data‑structure prototypes are defined by the Open Fabric Alliance (OFA).

Compared with traditional DMA on internal buses, RDMA transfers buffers directly over the network between application processes, bypassing the operating system and protocol stack. This enables ultra‑low latency, high‑throughput transfers with virtually no CPU or OS involvement.

InfiniBand achieves sub‑microsecond network latency by using cut‑through forwarding, credit‑based flow control (loss‑free), hardware offload, and minimal buffer sizes to reduce queuing delays.

iWARP (RDMA over TCP/IP) leverages mature IP networks and inherits RDMA’s benefits, though hardware implementation costs are high and packet loss on traditional IP networks can degrade performance.

RoCE delivers performance comparable to InfiniBand, relies on Data Center Bridging (DCB) for loss‑free transmission, and requires Ethernet devices that support DCB; Ethernet switches typically add slightly higher latency than InfiniBand switches.

RoCEv2 improves upon RoCE by introducing IP for scalability (allowing multi‑layer networking) and UDP for ECMP load balancing.

Native RDMA over InfiniBand was standardized in 2000. iWARP (RDMA over TCP/IP) became a standard in 2007, comprising MPA, DDP, and RDMAP sub‑protocols. RoCE (RDMA over Ethernet) was standardized in 2010, using enhanced Ethernet and replacing the transport layer with InfiniBand’s transport.

OFED (OpenFabrics Enterprise Distribution), released by the Open Fabric Alliance, provides Linux and Windows RDMA stacks that seamlessly integrate with existing applications, delivering substantial performance gains.

The software‑transport interface between applications and the RNIC (RDMA‑aware NIC) is called Verbs. OFA defines a set of Verbs APIs and the OFED stack supports multiple RDMA transport protocols.

Beyond providing basic queue services (RNIC, LLP), OFED also offers Upper Layer Protocols (ULPs) that allow applications to use RDMA without directly invoking Verbs APIs, enabling legacy software to run over RDMA transparently.

In the InfiniBand/RDMA model, the core goal is to achieve the simplest, most efficient direct communication between applications. RDMA uses message‑queue‑based point‑to‑point communication, allowing each application to access its messages without OS or protocol‑stack involvement.

Message services are built on Channel‑IO connections created between the local and remote applications. Each channel consists of a pair of Queue Pairs (QP), each containing a Send Queue (SQ) and a Receive Queue (RQ). QPs are mapped into the application’s virtual address space, enabling direct RNIC access. Additionally, a Completion Queue (CQ) notifies the user when work requests have been processed.

RDMA provides a software transport interface for creating Work Requests (WR). A WR describes the data to be transferred and is placed into a Work Queue (WQ). The WR is transformed into a Work Queue Element (WQE) that the RNIC schedules asynchronously, fetching the payload from the buffer referenced by the WQE.

RDMA SEND/RECEIVE are two‑sided operations requiring the remote side’s participation, whereas READ and WRITE are one‑sided and can be performed without remote involvement; consequently, SEND/RECEIVE are typically used for control messages, while data transfers use READ/WRITE.

1. Both hosts create and initialize their QPs and CQs.

2. Each host registers a WQE in its WQ (A uses SQ for sending, B uses RQ for receiving) pointing to the appropriate buffer.

3. A’s RNIC processes the SEND WQE, directly transmitting data from A’s buffer to B; B’s RNIC consumes the corresponding WQE and stores the data.

4. Upon completion, both sides generate CQEs in their CQs indicating send and receive completion.

Dual‑sided operations resemble traditional network buffer pools but differ by providing zero‑copy and kernel‑bypass, making them suitable for short control messages.

For one‑sided operations (e.g., storage), the flow is:

1. A and B establish a connection with initialized QPs.

2. A registers its buffer address (VA) with the RNIC and obtains a local key.

3. A sends the VA and key to B, also posting a WR to receive the status.

4. B receives the VA and remote key, performs an RDMA READ to copy data from A’s VA to B’s VB without any software involvement on either side.

5. B returns a status message to A after the transfer completes.

One‑sided operations are the primary distinction between RDMA and traditional networking, allowing direct remote memory access without remote application participation, which is ideal for bulk data transfers.

Simple Summary

InfiniBand’s success stems from two factors: host‑side RDMA reduces data‑processing latency from tens of microseconds to a few microseconds without consuming CPU, and InfiniBand networks provide high bandwidth (40 G/56 G), sub‑microsecond latency, and loss‑free transmission.

As Ethernet evolves to offer comparable bandwidth and loss‑free capabilities, RoCE (RDMA over Ethernet) becomes inevitable, offering lower deployment cost; future development will see RoCE, iWARP, and InfiniBand‑based RDMA products advance significantly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Protocols RDMA InfiniBand

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.