Fundamentals 52 min read

Why io_uring Is the Game‑Changer for Linux Asynchronous I/O (And How to Master It)

This article provides a comprehensive, step‑by‑step analysis of Linux's io_uring, covering its architecture, design principles, workflow, performance advantages over traditional models like epoll, practical C code examples, optimization techniques, real‑world use cases, and the challenges developers may face when adopting it.

Deepin Linux
Deepin Linux
Deepin Linux
Why io_uring Is the Game‑Changer for Linux Asynchronous I/O (And How to Master It)

Introduction

Asynchronous I/O is a key differentiator for high‑performance Linux servers. The io_uring interface, added in Linux 5.1, provides a low‑overhead, truly asynchronous I/O model.

Traditional I/O Models

Blocking I/O

A read/write call blocks the process until the operation completes, limiting concurrency.

Non‑blocking I/O

File descriptors are set with O_NONBLOCK. Calls return EAGAIN when data is not ready, requiring the application to poll.

I/O Multiplexing (select/poll/epoll)

epoll

is the most efficient Linux multiplexing API, separating monitoring and waiting phases and maintaining a ready‑list to avoid scanning all descriptors.

io_uring Basics

io_uring uses two shared ring buffers mapped into user space:

Submission Queue (SQ) : the application writes io_uring_sqe entries describing I/O requests.

Completion Queue (CQ) : the kernel writes io_uring_cqe entries with results.

Typical structures:

struct io_uring_sqe {
    __u8  opcode;
    __u8  flags;
    __u16 ioprio;
    __s32 fd;
    __u64 offset;
    __u64 addr;   // user buffer address
    __u32 len;
    __u64 user_data;
};

struct io_uring_cqe {
    __u64 user_data; // matches SQE user_data
    __s32 res;       // bytes transferred or -errno
    __u32 flags;
};

The queues are created with io_uring_queue_init() (or io_uring_queue_init_params()) and are accessed via mmap, eliminating extra copies.

Core Mechanics

Shared Ring Buffers

Both SQ and CQ reside in a memory region shared between user space and the kernel. The application writes SQEs; the kernel reads them directly, performs the I/O, and writes CQEs.

Batch Submission

Multiple SQEs can be prepared and submitted with a single io_uring_submit() call, reducing the number of system calls from N to 1.

Kernel Processing

The kernel consumes SQ entries, executes the requested operation (read, write, accept, send, etc.), and posts a CQE containing the result and the original user_data for correlation.

Completion Handling

Applications retrieve CQEs with io_uring_wait_cqe() (blocking) or io_uring_peek_cqe() (non‑blocking). After processing, io_uring_cqe_seen() marks the entry as consumed.

Comparison with epoll

System‑call count : epoll needs a syscall for each event; io_uring can batch many operations into a single syscall.

Asynchronous capability : epoll only notifies readiness; io_uring performs the I/O in the kernel without further calls.

Supported operations : io_uring can directly issue open, fsync, read, write, accept, send, recv, etc.

Performance Optimizations

Queue depth : Choose a depth that matches workload and memory constraints (e.g., 1024 for high‑throughput servers, 64‑128 for memory‑constrained environments).

SQPoll mode : Enable IORING_SETUP_SQPOLL to let a kernel thread continuously poll the SQ, removing the explicit io_uring_submit() call.

Buffer registration : Register buffers with io_uring_register_buffers() so the kernel can access them directly, avoiding per‑request copies.

Multithreading : io_uring’s lock‑free batch submission works with multiple producer threads; use per‑thread SQEs or protect shared rings with a mutex only when necessary.

Typical C Example (File Read)

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <liburing.h>

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int fd, ret;

    fd = open("example.txt", O_RDONLY);
    if (fd < 0) { perror("open"); return 1; }

    ret = io_uring_queue_init(8, &ring, 0);
    if (ret < 0) { perror("io_uring_queue_init"); close(fd); return 1; }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) { fprintf(stderr, "Could not get sqe
"); io_uring_queue_exit(&ring); close(fd); return 1; }

    char *buf = malloc(1024);
    io_uring_prep_read(sqe, fd, buf, 1024, 0);
    io_uring_submit(&ring);

    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) { perror("io_uring_wait_cqe"); }
    else if (cqe->res < 0) {
        fprintf(stderr, "Async read failed: %s
", strerror(-cqe->res));
    } else {
        printf("Read %d bytes: %.*s
", cqe->res, cqe->res, buf);
    }

    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    close(fd);
    free(buf);
    return 0;
}

Design Principles

Reduce system‑call overhead by batching submissions.

Eliminate unnecessary data copies through shared memory.

Provide true asynchronous execution where the kernel completes I/O independently of the caller.

Multithreading Considerations

When multiple threads submit requests, protect the shared io_uring instance with a mutex or use per‑thread SQEs. Lock‑free submission is possible because each thread writes to distinct SQ slots.

Workflow Overview

Initialization : Call io_uring_queue_init() (or *_params) to allocate SQ/CQ and map them.

Submit I/O : Obtain an SQE with io_uring_get_sqe(), fill it (e.g., io_uring_prep_read()), optionally set user_data, then call io_uring_submit() (or rely on SQPoll).

Kernel processing : The kernel consumes SQEs, performs the operation, and writes CQEs.

Completion handling : Retrieve CQEs via io_uring_wait_cqe() or io_uring_peek_cqe(), check cqe->res, use cqe->user_data to match the request, and finally call io_uring_cqe_seen().

Real‑World Use Cases

High‑performance web servers (e.g., Nginx) achieve ~30 % higher throughput and ~20 % lower latency at 10 k concurrent connections.

API gateways (e.g., Kong) reduce request processing time by >15 %.

Modern databases (e.g., Limbo) gain ~40 % higher transaction throughput and ~35 % lower latency.

Large‑scale file copy tools (e.g., wcp) can be up to 70 % faster than the traditional cp command.

Challenges and Mitigation

Programming complexity : The asynchronous API requires careful tracking of user_data and ordering. Using the liburing helper library or higher‑level wrappers reduces boilerplate.

Compatibility : io_uring requires Linux 5.1+. Ensure the target system runs a recent kernel or provide a fallback to epoll/POSIX‑AIO.

Error handling : Errors are reported in cqe->res. Always check this field, translate negative values with strerror(), and log sufficient context (e.g., request ID from user_data).

Conclusion

io_uring eliminates per‑operation system calls, reduces data copies, and enables true kernel‑side batching. Proper tuning—appropriate queue depth, optional SQPoll, buffer registration, and multithreaded design—delivers substantial latency and throughput gains for servers, databases, and file‑intensive workloads.

performanceKernelC++io_uringLinuxasynchronous I/O
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.