Fundamentals 48 min read

Unlocking Linux Performance: A Deep Dive into io_uring and Its Advantages

This comprehensive guide explains why traditional I/O models become bottlenecks in high‑performance computing, introduces the modern io_uring framework with its submission and completion queues, walks through its design goals, core concepts, workflow, performance comparisons, optimization tips, real‑world use cases, and provides complete C examples for practical adoption.

Deepin Linux
Deepin Linux
Deepin Linux
Unlocking Linux Performance: A Deep Dive into io_uring and Its Advantages

Why traditional I/O becomes a bottleneck

Blocking I/O stalls a thread until the operation finishes, consuming CPU and memory. Non‑blocking I/O avoids the stall but forces the application to poll repeatedly, wasting cycles. Multiplexing mechanisms such as select, poll or epoll still require a system call per event and multiple data copies, limiting scalability in high‑performance computing and big‑data analytics.

What is io_uring

Added to the Linux kernel in version 5.1, io_uring provides a unified asynchronous I/O interface that reduces system‑call overhead, eliminates unnecessary copies, and enables true zero‑copy processing for both file and network operations.

Key data structures

Submission Queue (SQ) : a ring buffer in shared memory where the application places I/O requests ( io_uring_sqe entries).

Completion Queue (CQ) : a ring buffer in shared memory where the kernel posts results ( io_uring_cqe entries).

io_uring_sqe : describes a single I/O operation (opcode, file descriptor, buffer address, length, offset, user_data).

io_uring_cqe : contains the result of an operation ( res – bytes transferred or –errno) and the original user_data.

Typical workflow

Initialization

#include <liburing.h>
struct io_uring ring;
int ret = io_uring_queue_init(128, &ring, 0);
if (ret < 0) { perror("io_uring_queue_init"); exit(1); }

Prepare and submit a request

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, BUFFER_SIZE, 0);
sqe->user_data = (unsigned long)ctx;
io_uring_submit(&ring);

Wait for completion

struct io_uring_cqe *cqe;
int rc = io_uring_wait_cqe(&ring, &cqe);
if (rc == 0) {
    if (cqe->res >= 0) {
        /* success */
    } else {
        /* error */
    }
    io_uring_cqe_seen(&ring, cqe);
}

Core advantages over epoll

Batch submission reduces the number of system calls to one per batch.

Shared memory queues eliminate user‑kernel data copies (zero‑copy). IORING_SETUP_SQPOLL enables kernel‑side polling of the SQ, removing the need for explicit notifications.

A single API handles both network and storage I/O, simplifying code.

Performance tips

Queue depth : choose a power‑of‑two size that matches the workload (e.g., 128‑1024 for high‑throughput servers, 64‑128 for memory‑constrained environments).

SQPOLL : enable IORING_SETUP_SQPOLL for ultra‑low latency; optionally bind the poll thread to a specific CPU and set an idle timeout.

Registered buffers : call io_uring_register_buffers once and reuse the buffers to avoid per‑request copies.

Multithreading : multiple threads can obtain SQEs and submit without locks, leveraging the lock‑free design.

Real‑world adoption

High‑performance servers such as Nginx (≥ 1.19.0) and Kong API Gateway report ~30 % higher throughput under 10 k concurrent connections. The Rust‑based Limbo database gains ~40 % transaction throughput. The wcp file‑copy tool achieves up to 70 % speedup over the traditional cp command.

Common pitfalls and mitigation

Kernel version : io_uring requires Linux ≥ 5.1; provide a fallback path for older kernels.

Error handling : always inspect cqe->res; a negative value is –errno and can be translated with strerror(-cqe->res).

Complexity : use the liburing helper functions or higher‑level wrappers to reduce boilerplate.

Minimal example (file read)

#include <liburing.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    struct io_uring ring;
    if (io_uring_queue_init(8, &ring, 0) < 0) { perror("io_uring_queue_init"); return 1; }

    int fd = open("example.txt", O_RDONLY);
    if (fd < 0) { perror("open"); io_uring_queue_exit(&ring); return 1; }

    char *buf = malloc(1024);
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fd, buf, 1024, 0);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    if (io_uring_wait_cqe(&ring, &cqe) == 0) {
        if (cqe->res >= 0)
            printf("Read %d bytes: %.*s
", cqe->res, cqe->res, buf);
        else
            fprintf(stderr, "Read error: %s
", strerror(-cqe->res));
        io_uring_cqe_seen(&ring, cqe);
    }
    close(fd);
    free(buf);
    io_uring_queue_exit(&ring);
    return 0;
}
Performance optimizationio_uringLinuxC programmingasynchronous I/O
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.