Unlocking Ultra‑Low Latency: How RDMA Transforms High‑Performance Networking

This article explains the fundamentals of Remote Direct Memory Access (RDMA), its low‑latency, zero‑copy and kernel‑bypass mechanisms, programming interfaces, and real‑world applications in data‑center networks, high‑performance computing, and distributed storage, providing developers with practical guidance and code examples.

Deepin Linux
Deepin Linux
Deepin Linux
Unlocking Ultra‑Low Latency: How RDMA Transforms High‑Performance Networking

What Is RDMA?

Remote Direct Memory Access (RDMA) is a high‑performance networking technology that enables a computer to read or write memory on a remote machine directly, bypassing the operating‑system kernel and eliminating memory copies. This reduces latency to the microsecond level and lowers CPU overhead.

Core Principles

Direct Memory Access

Traditional TCP/IP communication requires copying data between user space and kernel space and performing context switches. RDMA removes these steps by letting the NIC transfer data directly between the memories of the two hosts.

Zero‑Copy and Kernel Bypass

RDMA uses zero‑copy so data moves between application buffers and the network without intermediate copies. Kernel bypass keeps the data path in user space, avoiding costly kernel‑mode transitions.

RDMA Transport Protocols

InfiniBand (IB): Requires dedicated IB NICs and switches; provides the highest bandwidth and lowest latency.

RDMA over Converged Ethernet (RoCE): Runs on Ethernet hardware that supports lossless Ethernet (PFC) and needs RoCE‑capable NICs.

iWARP: Implements RDMA over TCP/IP, offering broader compatibility at the cost of higher latency.

All three expose the same Verbs API while differing in physical and link‑layer requirements.

Programming Model

Key APIs

The primary interfaces are the Verbs API and the RDMA Connection Manager (CM) API. A typical Verbs workflow is:

Query and open an RDMA device ( ibv_get_device_list, ibv_open_device).

Allocate a protection domain ( ibv_alloc_pd).

Register memory regions ( ibv_reg_mr) to obtain local (lkey) and remote (rkey) keys.

Create a completion queue ( ibv_create_cq).

Create and configure a queue pair ( ibv_create_qp) with send and receive queues.

Post send or receive work requests ( ibv_post_send, ibv_post_recv).

Poll the completion queue ( ibv_poll_cq) for completion events.

Release resources ( ibv_dereg_mr, ibv_destroy_qp, etc.).

#include <infiniband/verbs.h>

// Global RDMA objects
struct ibv_context *ctx;
struct ibv_pd *pd;
struct ibv_mr *mr;
struct ibv_qp *qp;
struct ibv_cq *cq;

void init_rdma() {
    struct ibv_device **dev_list = ibv_get_device_list(NULL);
    if (!dev_list) { perror("Failed to get RDMA device list"); exit(1); }
    ctx = ibv_open_device(dev_list[0]);
    if (!ctx) { perror("Failed to open RDMA device"); exit(1); }
    pd = ibv_alloc_pd(ctx);
    if (!pd) { perror("Failed to allocate protection domain"); exit(1); }
    char *buf = malloc(1024);
    mr = ibv_reg_mr(pd, buf, 1024,
                    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ);
    if (!mr) { perror("Failed to register memory region"); exit(1); }
    cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);
    if (!cq) { perror("Failed to create completion queue"); exit(1); }
    struct ibv_qp_init_attr qp_attr = {
        .send_cq = cq,
        .recv_cq = cq,
        .cap = { .max_send_wr = 10, .max_recv_wr = 10,
                 .max_send_sge = 1, .max_recv_sge = 1 },
        .qp_type = IBV_QPT_RC
    };
    qp = ibv_create_qp(pd, &qp_attr);
    if (!qp) { perror("Failed to create queue pair"); exit(1); }
}

void send_data(uint64_t remote_addr, uint32_t remote_rkey) {
    struct ibv_send_wr wr = {}, *bad_wr;
    struct ibv_sge sge;
    memset(&wr, 0, sizeof(wr));
    sge.addr = (uint64_t)mr->addr;
    sge.length = 1024;
    sge.lkey = mr->lkey;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.opcode = IBV_WR_RDMA_WRITE;
    wr.wr.rdma.remote_addr = remote_addr;
    wr.wr.rdma.rkey = remote_rkey;
    if (ibv_post_send(qp, &wr, &bad_wr)) {
        perror("Failed to post send"); exit(1);
    }
}

void receive_data() {
    struct ibv_recv_wr wr = {}, *bad_wr;
    struct ibv_sge sge;
    memset(&wr, 0, sizeof(wr));
    sge.addr = (uint64_t)mr->addr;
    sge.length = 1024;
    sge.lkey = mr->lkey;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    if (ibv_post_recv(qp, &wr, &bad_wr)) {
        perror("Failed to post receive"); exit(1);
    }
}

void poll_cq() {
    struct ibv_wc wc;
    while (ibv_poll_cq(cq, 1, &wc)) {
        if (wc.status == IBV_WC_SUCCESS) {
            printf("RDMA operation completed successfully
");
        } else {
            perror("RDMA operation failed");
        }
    }
}

void cleanup() {
    ibv_dereg_mr(mr);
    ibv_destroy_qp(qp);
    ibv_destroy_cq(cq);
    ibv_dealloc_pd(pd);
    ibv_close_device(ctx);
}

Memory Registration and Queue Pair

Memory must be registered with ibv_reg_mr to obtain an lkey for local access and an rkey for remote access. Registration pins the region in physical memory, preventing paging and ensuring DMA‑compatible addresses.

A Queue Pair (QP) consists of a Send Queue (SQ) and a Receive Queue (RQ). Work Requests (WR) are posted to these queues; the NIC processes them and generates Completion Queue Elements (CQE) in the Completion Queue (CQ) to signal success or failure.

Transport Modes

Reliable Connected (RC): Provides ordered, reliable delivery similar to TCP.

Unreliable Connected (UC): No retransmission; loss must be handled by the application.

Unreliable Datagram (UD): Connectionless, supports multicast, no ordering guarantees.

Typical Communication Patterns

RDMA Read: The initiator reads a remote memory region without remote CPU involvement.

RDMA Write: The initiator writes data into a remote memory region; the remote side is unaware unless an immediate value is used.

Send/Receive: Paired operations where the sender posts a Send WR and the receiver posts a matching Receive WR; useful for control messages.

Application Domains

Data‑center networking: Replaces TCP/IP to achieve multi‑gigabit throughput with microsecond latency, reducing CPU load.

High‑Performance Computing (HPC): Enables fast exchange of terabytes of data between compute nodes, accelerating simulations.

Distributed storage (e.g., Ceph, GlusterFS): Bypasses the kernel stack for low‑latency reads/writes and efficient replication.

high performance computingnetwork programmingdistributed storagelow-latencyRDMA
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.