Backend Development 43 min read

Understanding and Using io_uring for High‑Performance Asynchronous I/O in Linux

This article introduces Linux's io_uring framework, explains its design goals and advantages over traditional I/O models, details its core components and system calls, provides step‑by‑step implementation examples for file and network operations, and discusses performance comparisons and practical application scenarios.

Deepin Linux

Feb 3, 2025

Understanding and Using io_uring for High‑Performance Asynchronous I/O in Linux

Linux I/O performance has evolved from simple blocking I/O to non‑blocking I/O, I/O multiplexing, and now to the revolutionary io_uring framework introduced in kernel 5.1, which dramatically improves asynchronous I/O efficiency.

io_uring is a high‑performance asynchronous I/O framework that solves the large system‑call overhead and data‑copy problems of traditional models such as epoll and POSIX AIO. It achieves this by sharing a pair of ring buffers—Submission Queue (SQ) and Completion Queue (CQ)—between user space and the kernel.

The core components are:

Submission Queue (SQ) and Submission Queue Entry (SQE): a ring buffer where the application places I/O requests. Each SQE contains operation type, file descriptor, buffer address, length, offset, etc.

Completion Queue (CQ) and Completion Queue Entry (CQE): a ring buffer where the kernel posts the results of completed I/O operations, including return value, status code and user data.

SQ Ring and CQ Ring: the ring‑buffer structures that hold head/tail indices and size information, enabling lock‑free communication.

Only three system calls are needed:

SYSCALL_DEFINE2(io_uring_setup, u32, entries, struct io_uring_params __user *, params)

Initialises an io_uring instance, allocating SQ, CQ and shared memory via mmap.

SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, u32, min_complete, u32, flags, const void __user *, argp, size_t, argsz)

Submits pending SQEs to the kernel and optionally waits for a minimum number of completions.

SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, void __user *, arg, unsigned int, nr_args)

Registers files, buffers or eventfds with the io_uring instance to avoid per‑request setup overhead.

The typical workflow is:

Create an io_uring object with io_uring_setup (or the liburing helper io_uring_queue_init).

Prepare I/O requests using io_uring_prep_* helpers (e.g., io_uring_prep_read, io_uring_prep_write, io_uring_prep_accept).

Obtain an SQE via io_uring_get_sqe, fill it, and place the SQE index into the SQ.

Submit the queue with io_uring_submit (which internally calls io_uring_enter).

Wait for completions using io_uring_wait_cqe (blocking) or io_uring_peek_batch_cqe (non‑blocking).

Process each CQE, checking cqe->res for success or error, and use the user data field to identify the original request.

Mark the CQE as seen with io_uring_cqe_seen or advance the CQ head with io_uring_cq_advance.

Example 1 – Simple file read/write:

#include <stdio.h><br/>#include <stdlib.h><br/>#include <fcntl.h><br/>#include <string.h><br/>#include <unistd.h><br/>#include <sys/ioctl.h><br/>#include <linux/io_uring.h><br/><br/>int main() {<br/>    struct io_uring ring;<br/>    struct io_uring_sqe *sqe;<br/>    struct io_uring_cqe *cqe;<br/>    int fd, ret;<br/>    fd = open("example.txt", O_RDONLY);<br/>    if (fd < 0) { perror("Failed to open file"); return 1; }<br/>    io_uring_queue_init(8, &ring, 0);<br/>    sqe = io_uring_get_sqe(&ring);<br/>    char *buf = malloc(1024);<br/>    io_uring_prep_read(sqe, fd, buf, 1024, 0);<br/>    io_uring_submit(&ring);<br/>    ret = io_uring_wait_cqe(&ring, &cqe);<br/>    if (cqe->res < 0) { fprintf(stderr, "Async read failed: %s
", strerror(-cqe->res)); }<br/>    else { printf("Read %d bytes: %s
", cqe->res, buf); }<br/>    io_uring_cqe_seen(&ring, cqe);<br/>    io_uring_queue_exit(&ring);<br/>    close(fd); free(buf); return 0;<br/>}

Example 2 – TCP echo server using io_uring:

#include <stdio.h><br/>#include <unistd.h><br/>#include <string.h><br/>#include <sys/socket.h><br/>#include <netinet/in.h><br/>#include <liburing.h><br/><br/>#define ENTRIES_LENGTH 4096<br/>#define MAX_CONNECTIONS 1024<br/>#define BUFFER_LENGTH 1024<br/><br/>char buf_table[MAX_CONNECTIONS][BUFFER_LENGTH] = {0};<br/><br/>enum { READ, WRITE, ACCEPT };<br/><br/>struct conninfo { int connfd; int type; };<br/><br/>void set_read_event(struct io_uring *ring, int fd, void *buf, size_t len, int flags) {<br/>    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);<br/>    io_uring_prep_recv(sqe, fd, buf, len, flags);<br/>    struct conninfo ci = {.connfd = fd, .type = READ};<br/>    memcpy(&sqe->user_data, &ci, sizeof(ci));<br/>}<br/><br/>/* similar helpers for write and accept omitted for brevity */<br/><br/>int main() {<br/>    int listenfd = socket(AF_INET, SOCK_STREAM, 0);<br/>    struct sockaddr_in servaddr, clientaddr;<br/>    servaddr.sin_family = AF_INET;<br/>    servaddr.sin_addr.s_addr = htonl(INADDR_ANY);<br/>    servaddr.sin_port = htons(9999);<br/>    bind(listenfd, (struct sockaddr *)&servaddr, sizeof(servaddr));<br/>    listen(listenfd, 10);<br/>    struct io_uring ring; io_uring_queue_init_params(ENTRIES_LENGTH, &ring, NULL);<br/>    socklen_t clilen = sizeof(clientaddr);<br/>    set_accept_event(&ring, listenfd, (struct sockaddr *)&clientaddr, &clilen, 0);
    while (1) {<br/>        io_uring_submit(&ring);<br/>        struct io_uring_cqe *cqe;<br/>        io_uring_wait_cqe(&ring, &cqe);<br/>        /* process CQEs, re‑arm events, close connections, etc. */<br/>        io_uring_cq_advance(&ring, 1);<br/>    }<br/>    return 0;<br/>}

Performance tests comparing blocking I/O, non‑blocking I/O, epoll and io_uring on a Xeon‑E5 server show that io_uring consistently achieves the highest throughput and lowest latency for both file and network workloads, thanks to reduced system‑call count and zero‑copy data paths.

Typical application scenarios include high‑throughput database engines (e.g., MySQL, PostgreSQL), high‑performance web servers (e.g., Nginx, Caddy), distributed storage systems (e.g., Ceph) and any workload that requires massive concurrent I/O with minimal CPU overhead.

Future directions point to deeper kernel integration, support for emerging storage class memory, expanded device drivers, and broader adoption in IoT edge devices, big‑data processing frameworks (Hadoop, Spark) and AI training pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

io_uring Linux kernel high performance

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.