Fundamentals 21 min read

Understanding DMA and RDMA: High‑Performance Direct Memory Access Explained

This article explains the principles of Direct Memory Access (DMA) and Remote Direct Memory Access (RDMA), compares them with traditional TCP I/O, outlines RDMA’s features, protocol standards, communication pathways, queue mechanisms, and provides example code for setting up RDMA connections using RoCEv2.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Understanding DMA and RDMA: High‑Performance Direct Memory Access Explained

DMA and RDMA

DMA (Direct Memory Access) allows PCIe I/O devices to read and write main memory without CPU involvement. The DMA controller maps device addresses to host physical addresses (ZONE_DMA), enabling both kernel and user processes to access device memory directly.

Before DMA, NICs and the CPU exchanged Ethernet frames by copying data from the NIC Rx/Tx queues to kernel space and then to user space, requiring two CPU copies per frame. With DMA, the NIC includes a DMA engine that maps its queues to ZONE_DMA and transfers frames directly to main memory, eliminating one copy and reducing CPU load. In 32‑bit Linux ZONE_DMA is limited to 16 MiB, while in 64‑bit Linux it can be up to 4 GiB.

DMA diagram
DMA diagram

RDMA Features and Advantages

RDMA characteristics:

Remote : Peer‑to‑peer data transfer between two servers.

Direct : No CPU or kernel participation; control signaling and data transfer are offloaded to the RNIC.

Memory : Direct memory address transfer between applications on two servers, achieving sub‑10 µs latency.

Access : Supports Send/Receive, Read, Write operations.

RDMA feature diagram
RDMA feature diagram

High Bandwidth

In a 100 Gbps scenario, CPU utilization drops from 100 % (TCP) to about 10 % with RDMA, making the NIC hardware the bandwidth bottleneck rather than CPU processing.

TCP at 100 Gbps would require roughly 64 cores of 2.5 GHz each (1 MHz per 1 Mbps).

RDMA eliminates interrupt handling for packet transmission, reducing latency and CPU usage.

Low Latency

RDMA reduces network latency from milliseconds (TCP) to below 10 µs.

TCP incurs multiple memory copies, interrupt handling, and context switches, adding tens of microseconds of fixed latency.

RDMA uses the Verbs API directly from user space, avoiding system‑call overhead and enabling zero‑copy processing.

RDMA Protocol Stack Standards

The two main organizations driving RDMA standards are IBTA (InfiniBand Trade Association) and the RDMA Consortium (RDMAC) under IETF.

IBTA : Originated RDMA with InfiniBand, defining the first RDMA specifications in 2000. It requires specialized NICs and switches.

RDMAC / IETF : Developed RDMA over Ethernet technologies such as RoCE and iWARP.

RDMA standards diagram
RDMA standards diagram

RDMA Operation Principles

Communication Paths

RDMA separates control and data paths:

Control Path : Involves the kernel and uses Socket API to create and manage resources such as Channels, Queue Pairs (QP), and Memory Regions (MR).

Data Path : Bypasses the kernel and uses the Verbs API for actual data transfer.

Communication Model

RDMA uses a message‑queue‑based full‑duplex model with two main queue types:

Work Queue (WQ) : Applications post Work Requests (WR) to the WQ; the RNIC executes them.

Completion Queue (CQ) : After processing a WR, the RNIC writes a Completion Queue Entry (CQE) to the CQ for the application to poll.

Because RDMA supports full‑duplex, the WQ is split into Send Queue (SQ) and Receive Queue (RQ), forming a Queue Pair (QP). Each QP is uniquely identified by a Global Identifier (GID) and a QP number (QPN).

Communication Types

RDMA supports two major communication types:

Bidirectional (Messaging verbs)

Send‑Receive API: Both sides must post matching WRs; the receiver posts a Receive WR before the sender posts a Send WR.

Unidirectional (Memory verbs)

Write API: The sender writes directly to remote memory without notifying the receiver.

Read API: The sender reads remote memory directly.

Before using Read/Write, a Send‑Receive exchange is typically performed to share QP configuration, memory region information, and rkeys.

Memory Registration

Two key concepts are Protection Domain (PD) and Memory Region (MR):

MR : A registered memory buffer that the RNIC can access; it provides virtual‑to‑physical translation, access control via lkey/rkey, and page‑locking.

PD : Groups resources (QP, MR) for isolation; only QPs within the same PD can access the MR.

MR attributes include context, address, length, lkey, and rkey. The RNIC uses MPT (Memory Protection Table) and MTT (Memory Translation Table) to translate virtual addresses to physical addresses.

Memory registration diagram
Memory registration diagram

QP Establishment

Creating a Channel between two QPs involves negotiating parameters such as GID, QPN, virtual address, rkey, and qkey (used for unreliable datagram services).

Typical RDMA Workflow

Application posts a Work Request to SQ or RQ.

RNIC processes the WR and generates a CQE in the CQ.

Application polls the CQ to confirm completion.

Bidirectional Send‑Receive Flow

App B posts a Receive WR to its RQ.

RNIC B fetches the WR and prepares to receive data.

App A posts a Send WR to its SQ.

RNIC A reads the WR, performs DMA to fetch data from main memory, and builds a packet.

RNIC A sends the packet to RNIC B.

RNIC B validates the packet and sends an ACK back to RNIC A.

RNIC B writes the payload to the target memory via DMA and posts a CQE.

App B receives the CQE.

RNIC A receives the ACK, generates a CEQ, and posts it to its CQ.

App A receives the CQE.

Unidirectional Write Flow

Local App posts a Write WR to its SQ.

Local RNIC fetches the WR, translates the virtual address to a physical address, reads the data, and sends a packet.

Remote RNIC receives the packet, translates the address, and writes the payload to main memory via DMA.

Remote RNIC sends an ACK to the local RNIC.

Local RNIC posts a CEQ to its CQ.

Local App polls the CQ for completion.

Unidirectional Read Flow

Local App posts a Read WR to its SQ.

Local RNIC fetches the WR and sends a request packet.

Remote RNIC receives the request, reads the requested data from main memory, and returns it in a packet.

Local RNIC receives the data packet, writes the payload to the designated memory via DMA, and posts a CQE.

Local App polls the CQ for completion.

RDMA Verbs API Programming

Basic Network Connectivity (RoCEv2 example)

RoCEv2 uses UDP/IP for addressing, so standard Ethernet tools apply.

# HostA
ifconfig eth2
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4200
        inet 25.0.0.162  netmask 255.255.255.224  broadcast 25.0.0.191
        inet6 fe80::5aa2:e1ff:fe2d:8578  prefixlen 64  scopeid 0x20<link>
        ether 58:a2:e1:2d:85:78  txqueuelen 1000  (Ethernet)
        RX packets 151  bytes 10282 (10.0 KiB)
        TX packets 231  bytes 17350 (16.9 KiB)

# HostB
ifconfig eth2
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4200
        inet 25.0.0.34  netmask 255.255.255.224  broadcast 25.0.0.63
        inet6 fe80::966d:aeff:fefd:1c2c  prefixlen 64  scopeid 0x20<link>
        ether 94:6d:ae:fd:1c:2c  txqueuelen 1000  (Ethernet)
        RX packets 130  bytes 9416 (9.1 KiB)
        TX packets 180  bytes 13688 (13.3 KiB)

# HostA ping B
ping -I eth2 25.0.0.34
PING 25.0.0.34 (25.0.0.34) from 25.0.0.162 eth2: 56(84) bytes of data.
64 bytes from 25.0.0.34: icmp_seq=1 ttl=63 time=0.108 ms

RDMA C/S Programs

Source code repository: https://github.com/JmilkFan/rdma-example.git

# RDMA Server
./bin/rdma_server -a 172.16.0.4 -p 20886
RDMA connection management CM event channel is created successfully at 0x22b83a0
A RDMA connection id (36414192) for the server is created
Server RDMA CM id is successfully binded
Server is listening successfully at: 172.16.0.4 , port: 20886

A new RDMA client connection id is stored at 0x22bdda0
A new protection domain is allocated at 0x22bc7f0
An I/O completion event channel is created at 0x22b8380
Completion queue (CQ) is created at 0x22be000 with 31 elements
Client QP created at 0x22be268
Receive buffer pre-posting is successful
Waiting for: RDMA_CM_EVENT_ESTABLISHED event
A new connection is accepted from 172.16.0.102
Client side buffer information is received...
---------------------------------------------------------
buffer attr, addr: 0x15f43a0 , len: 9 , stag : 0x1ff8b7
---------------------------------------------------------
The client has requested buffer length of : 9 bytes
Local buffer metadata has been sent to the client
Waiting for cm event: RDMA_CM_EVENT_DISCONNECTED
A disconnect event is received from the client...
Server shut-down is complete

# RDMA Client
./rdma_client -a 172.16.0.4 -p 20886 -s rdma_test
Passed string is : rdma_test , with count 9
RDMA CM event channel is created at : 0x15f4500
waiting for cm event: RDMA_CM_EVENT_ADDR_RESOLVED
RDMA address is resolved
waiting for cm event: RDMA_CM_EVENT_ROUTE_RESOLVED
Trying to connect to server at : 172.16.0.4 port: 20886
protection domain allocated at 0x15f48f0
completion event channel created at : 0x15f48b0
CQ created at 0x15f9d30 with 31 elements
QP created at 0x15fb018
Receive buffer pre-posting is successful
waiting for cm event: RDMA_CM_EVENT_ESTABLISHED
The client is connected successfully
Server sent us its buffer location and credentials, showing
---------------------------------------------------------
buffer attr, addr: 0x22bc850 , len: 9 , stag : 0x1ff9b8
---------------------------------------------------------
Client side WRITE is complete
Client side READ is complete
SUCCESS, source and destination buffers match
Client resource clean up is complete

This article provides a comprehensive overview of DMA and RDMA concepts, their performance benefits, protocol standards, internal mechanisms, and practical programming examples for high‑performance network communication.

NetworkDMAC programmingRDMA
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.