Fundamentals 30 min read

Why RDMA Is the Secret to Lightning‑Fast Data Transfer in Modern Data Centers

This article explains the fundamentals of Remote Direct Memory Access (RDMA), its low‑latency, zero‑copy architecture, core principles, programming interfaces, and how it transforms data‑center networking, high‑performance computing, and distributed storage by bypassing the CPU and kernel.

Deepin Linux

Nov 11, 2025

Why RDMA Is the Secret to Lightning‑Fast Data Transfer in Modern Data Centers

1. Introduction to RDMA

Many experience high CPU/GPU utilization but see poor data transfer rates between nodes; 100 GbE links rarely achieve half their rated speed, and AI training stalls during parameter synchronization. The root cause is the inefficiency of traditional TCP/IP communication.

Traditional networking requires multiple copies between user space, kernel space, and the NIC, consuming CPU cycles and adding latency. RDMA (Remote Direct Memory Access) eliminates these copies by allowing the NIC to read/write remote memory directly, achieving zero‑copy and low‑latency transfers.

2. Core Principles of RDMA

RDMA uses kernel bypass and zero‑copy techniques to reduce latency and CPU usage. It provides a set of Verbs interfaces that operate from user space, allowing direct access to remote virtual memory.

CPU Offload : No CPU involvement is required for remote memory reads/writes.

Kernel Bypass : Applications use a proprietary Verbs API instead of the TCP/IP socket stack, avoiding context switches.

Zero Copy : Data moves directly between application buffers and the network without intermediate copies.

The overall RDMA architecture includes a RNIC (RDMA Network Interface Card) with a cached page‑table entry that maps virtual pages to physical pages.

3. RDMA Programming Details

3.1 Transfer Operations

RDMA supports two basic operation families:

Memory verbs : read, write, and atomic operations.

Messaging verbs : send and receive.

3.2 Transfer Modes

Four transport modes exist based on reliability and connection type:

Reliable Connected (RC) – similar to TCP, ordered delivery.

Unreliable Connected (UC) – possible packet loss, no retransmission.

Unreliable Datagram (UD) – connection‑less, no ordering, supports multicast (similar to UDP).

3.3 Programming Interfaces

The primary APIs are the Verbs API and the RDMA Connection Manager (CM) API. Typical steps include:

Query and open an RDMA device (ibv_get_device_list, ibv_open_device).

Allocate a protection domain (ibv_alloc_pd).

Create a completion queue (ibv_create_cq) and a queue pair (ibv_create_qp) with send and receive queues.

Post send/receive work requests (ibv_post_send, ibv_post_recv).

Poll the completion queue for CQEs to determine operation success.

#include <infiniband/verbs.h>

struct ibv_context *ctx;
struct ibv_pd *pd;
struct ibv_mr *mr;
struct ibv_qp *qp;
struct ibv_cq *cq;

void init_rdma() {
    struct ibv_device **dev_list = ibv_get_device_list(NULL);
    ctx = ibv_open_device(dev_list[0]);
    pd = ibv_alloc_pd(ctx);
    char *buf = malloc(1024);
    mr = ibv_reg_mr(pd, buf, 1024, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ);
    cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);
    struct ibv_qp_init_attr qp_attr = {
        .send_cq = cq,
        .recv_cq = cq,
        .cap = {.max_send_wr = 10, .max_recv_wr = 10, .max_send_sge = 1, .max_recv_sge = 1},
        .qp_type = IBV_QPT_RC
    };
    qp = ibv_create_qp(pd, &qp_attr);
}

void send_data() {
    struct ibv_send_wr wr = {}, *bad_wr;
    struct ibv_sge sge = {};
    sge.addr = (uint64_t)mr->addr;
    sge.length = 1024;
    sge.lkey = mr->lkey;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.opcode = IBV_WR_RDMA_WRITE;
    wr.wr.rdma.remote_addr = remote_addr;
    wr.wr.rdma.rkey = remote_rkey;
    ibv_post_send(qp, &wr, &bad_wr);
}

void poll_cq() {
    struct ibv_wc wc;
    while (ibv_poll_cq(cq, 1, &wc)) {
        if (wc.status == IBV_WC_SUCCESS) {
            printf("RDMA operation completed successfully
");
        } else {
            perror("RDMA operation failed");
        }
    }
}

4. Application Domains

4.1 Data‑Center Networking

RDMA dramatically reduces latency and CPU overhead in large‑scale data centers, accelerating cloud storage uploads/downloads and enabling higher bandwidth utilization.

4.2 High‑Performance Computing (HPC)

In scientific simulations, weather modeling, and genome sequencing, RDMA provides the low‑latency, high‑throughput interconnects required for fast data exchange between compute nodes.

4.3 Distributed Storage

Systems such as Ceph and GlusterFS use RDMA to bypass the kernel, achieving faster reads/writes and improving reliability during large‑scale data migrations and backups.

5. RDMA Communication Process

Typical RDMA operations follow these steps:

The application issues a read/write request from user space; the NIC handles the request without CPU involvement.

The NIC reads the local buffer and transmits it over the network.

The remote NIC validates the memory key and writes directly into the remote buffer.

Completion is reported via a CQE in the completion queue.

Read/Write are one‑sided operations; Send/Receive are two‑sided and are usually used for control messages.

high performance computing Networking RDMA Kernel Bypass

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.