Why RDMA Is the Secret to Lightning‑Fast Data Transfer in Modern Data Centers
This article explains the fundamentals of Remote Direct Memory Access (RDMA), its low‑latency, zero‑copy architecture, core principles, programming interfaces, and how it transforms data‑center networking, high‑performance computing, and distributed storage by bypassing the CPU and kernel.
1. Introduction to RDMA
Many experience high CPU/GPU utilization but see poor data transfer rates between nodes; 100 GbE links rarely achieve half their rated speed, and AI training stalls during parameter synchronization. The root cause is the inefficiency of traditional TCP/IP communication.
Traditional networking requires multiple copies between user space, kernel space, and the NIC, consuming CPU cycles and adding latency. RDMA (Remote Direct Memory Access) eliminates these copies by allowing the NIC to read/write remote memory directly, achieving zero‑copy and low‑latency transfers.
2. Core Principles of RDMA
RDMA uses kernel bypass and zero‑copy techniques to reduce latency and CPU usage. It provides a set of Verbs interfaces that operate from user space, allowing direct access to remote virtual memory.
CPU Offload : No CPU involvement is required for remote memory reads/writes.
Kernel Bypass : Applications use a proprietary Verbs API instead of the TCP/IP socket stack, avoiding context switches.
Zero Copy : Data moves directly between application buffers and the network without intermediate copies.
The overall RDMA architecture includes a RNIC (RDMA Network Interface Card) with a cached page‑table entry that maps virtual pages to physical pages.
3. RDMA Programming Details
3.1 Transfer Operations
RDMA supports two basic operation families:
Memory verbs : read, write, and atomic operations.
Messaging verbs : send and receive.
3.2 Transfer Modes
Four transport modes exist based on reliability and connection type:
Reliable Connected (RC) – similar to TCP, ordered delivery.
Unreliable Connected (UC) – possible packet loss, no retransmission.
Unreliable Datagram (UD) – connection‑less, no ordering, supports multicast (similar to UDP).
3.3 Programming Interfaces
The primary APIs are the Verbs API and the RDMA Connection Manager (CM) API. Typical steps include:
Query and open an RDMA device (ibv_get_device_list, ibv_open_device).
Allocate a protection domain (ibv_alloc_pd).
Register memory (ibv_reg_mr) to obtain a local key (lkey) and remote key (rkey).
Create a completion queue (ibv_create_cq) and a queue pair (ibv_create_qp) with send and receive queues.
Post send/receive work requests (ibv_post_send, ibv_post_recv).
Poll the completion queue for CQEs to determine operation success.
#include <infiniband/verbs.h>
struct ibv_context *ctx;
struct ibv_pd *pd;
struct ibv_mr *mr;
struct ibv_qp *qp;
struct ibv_cq *cq;
void init_rdma() {
struct ibv_device **dev_list = ibv_get_device_list(NULL);
ctx = ibv_open_device(dev_list[0]);
pd = ibv_alloc_pd(ctx);
char *buf = malloc(1024);
mr = ibv_reg_mr(pd, buf, 1024, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ);
cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);
struct ibv_qp_init_attr qp_attr = {
.send_cq = cq,
.recv_cq = cq,
.cap = {.max_send_wr = 10, .max_recv_wr = 10, .max_send_sge = 1, .max_recv_sge = 1},
.qp_type = IBV_QPT_RC
};
qp = ibv_create_qp(pd, &qp_attr);
}
void send_data() {
struct ibv_send_wr wr = {}, *bad_wr;
struct ibv_sge sge = {};
sge.addr = (uint64_t)mr->addr;
sge.length = 1024;
sge.lkey = mr->lkey;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = remote_rkey;
ibv_post_send(qp, &wr, &bad_wr);
}
void poll_cq() {
struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc)) {
if (wc.status == IBV_WC_SUCCESS) {
printf("RDMA operation completed successfully
");
} else {
perror("RDMA operation failed");
}
}
}4. Application Domains
4.1 Data‑Center Networking
RDMA dramatically reduces latency and CPU overhead in large‑scale data centers, accelerating cloud storage uploads/downloads and enabling higher bandwidth utilization.
4.2 High‑Performance Computing (HPC)
In scientific simulations, weather modeling, and genome sequencing, RDMA provides the low‑latency, high‑throughput interconnects required for fast data exchange between compute nodes.
4.3 Distributed Storage
Systems such as Ceph and GlusterFS use RDMA to bypass the kernel, achieving faster reads/writes and improving reliability during large‑scale data migrations and backups.
5. RDMA Communication Process
Typical RDMA operations follow these steps:
The application issues a read/write request from user space; the NIC handles the request without CPU involvement.
The NIC reads the local buffer and transmits it over the network.
The remote NIC validates the memory key and writes directly into the remote buffer.
Completion is reported via a CQE in the completion queue.
Read/Write are one‑sided operations; Send/Receive are two‑sided and are usually used for control messages.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
