Backend Development 57 min read

Why io_uring Beats epoll: Unlocking Ultra‑Fast Asynchronous I/O on Linux

This article traces the evolution of Linux I/O from blocking and non‑blocking models through epoll to the modern io_uring framework, explains its design goals, core concepts, system‑call workflow, and provides practical code examples and performance comparisons for high‑concurrency network and storage applications.

Deepin Linux
Deepin Linux
Deepin Linux
Why io_uring Beats epoll: Unlocking Ultra‑Fast Asynchronous I/O on Linux

Part1 Linux I/O 的前世今生

In the Linux world, I/O efficiency has always been a key factor influencing system performance. From the early simple blocking I/O to later non‑blocking I/O and the emergence of I/O multiplexing, each technological shift has broken performance bottlenecks. Among many I/O techniques, epoll once represented high‑performance I/O with its event‑driven model, widely used in Nginx, Redis and other projects. However, with explosive data growth and increasingly complex scenarios, epoll shows limitations in extreme high‑concurrency cases.

At this point, io_uring arrives as a revolutionary Linux kernel asynchronous I/O framework, aiming to break the performance constraints of traditional async I/O models with a disruptive design.

What makes io_uring unique, and how does it achieve breakthroughs compared to epoll? Let’s explore the journey from epoll to io_uring.

1.1 Blocking I/O (Blocking I/O)

Blocking I/O is the most basic and intuitive model. When an application performs an I/O operation (e.g., read or write), the process is blocked until the operation completes. This is like waiting at a restaurant table for food; you cannot do anything else while waiting.

Blocking I/O is easy to program, but in high‑concurrency servers each request may block a thread, leading to low concurrency and poor resource utilization.

1.2 Non‑blocking I/O (Non‑blocking I/O)

To solve the blocking problem, non‑blocking I/O was introduced. If data is not ready, the kernel returns an error (EWOULDBLOCK or EAGAIN) immediately, allowing the application to continue other work and poll later. This improves concurrency, but frequent polling consumes CPU and adds programming complexity.

1.3 I/O Multiplexing

I/O multiplexing builds on non‑blocking I/O, allowing a process to monitor multiple file descriptors (e.g., sockets) simultaneously. When any descriptor becomes ready, the kernel notifies the process. Technologies include select, poll, and epoll. This is like ordering multiple dishes and being notified when each is ready, reducing thread consumption but still facing scaling issues.

1.4 Limitations of Traditional I/O Models

Large system‑call overhead: each I/O operation requires a transition from user to kernel space, which becomes costly under high concurrency.

Multiple data copies: data often moves between user and kernel space repeatedly, increasing latency.

Limited asynchronous capability: non‑blocking and multiplexing still require the application to poll or wait for events, preventing full hardware utilization.

To overcome these limits, Linux introduced io_uring, a new asynchronous I/O model offering higher efficiency and stronger capabilities.

Part2 io_uring 是什么

2.1 Definition and Origin

io_uring is a high‑performance asynchronous I/O framework introduced in Linux 5.1 by Jens Axboe. Before io_uring, traditional async models like epoll or POSIX AIO suffered from high system‑call overhead, many data copies, and limited async capability. io_uring aims to provide a more efficient solution.

2.2 Design Goals and Features

Unified network and disk async I/O: previously network and disk I/O used different mechanisms; io_uring provides a single interface for both.

Complete and unified async API: simplifies async programming by offering a consistent set of functions.

Supports async, polling, lock‑free, zero‑copy: reduces system‑call count, avoids lock contention, and minimizes data copies between user and kernel space.

2.3 io_uring Design Ideas

(1) Solving large system‑call overhead? By batching multiple operations into a limited number of system calls, the overhead becomes constant‑time.

(2) Solving large copy overhead? Sharing memory between user and kernel eliminates unnecessary copies; a pair of shared ring buffers (submission queue and completion queue) enables this.

(3) Solving unfriendly API? Consolidating multiple system calls into one and providing parameter‑driven functions makes the API easier to use.

3.1 Core Concepts

(1) Ring Buffers io_uring uses two ring buffers: Submission Queue (SQ) and Completion Queue (CQ), shared via mmap between kernel and user space.

Submission Queue holds I/O request entries (SQE) containing operation type, file descriptor, buffer address, length, etc.

Completion Queue holds results (CQE) with return values and user data.

The producer‑consumer model lets the application produce SQEs while the kernel consumes them, and the kernel produces CQEs while the application consumes them, reducing system calls and context switches.

(2) Asynchronous I/O Operations After submitting a request to SQ, the kernel processes it asynchronously. The application can continue executing other tasks. When the operation finishes, the kernel writes a CQE to CQ and notifies the application (e.g., via epoll). This allows high concurrency without blocking.

(3) Batch Operations and Extended Support io_uring supports batch submission and processing, further reducing system‑call overhead. It also supports many operation types: read, write, open, close, send, recv, accept, connect, fsync, fdatasync, etc., enabling high‑performance network servers and storage services.

3.2 Working Principle

(1) Submission Queue Workflow

Obtain a free SQE via

io_uring_get_sqe

.

Set request parameters (opcode, fd, offset, buffer address, length, flags, user data).

Place the SQE index into SQ and update the tail pointer.

(2) Completion Queue Workflow

Kernel writes a CQE with result and user data, updates the tail pointer.

Application retrieves CQE via

io_uring_wait_cqe

(blocking) or

io_uring_peek_cqe

(non‑blocking).

Application processes the result.

Mark CQE as seen with

io_uring_cqe_seen

, advancing the head pointer.

(3) Kernel‑User Interaction Shared memory (mmap) maps SQ and CQ into both spaces, eliminating most data copies. Only one system call (

io_uring_enter

) is needed to notify the kernel of new submissions, drastically reducing overhead.

3.3 System Calls Detail

io_uring uses only three syscalls:

io_uring_setup

,

io_uring_enter

, and

io_uring_register

.

io_uring_setup initializes the io_uring instance, allocating SQ, CQ, and control structures, and returns a file descriptor.

SYSCALL_DEFINE2(io_uring_setup, u32, entries, struct io_uring_params __user *, params) {
    return io_uring_setup(entries, params);
}

io_uring_enter submits and optionally waits for I/O operations.

SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, u32, min_complete, u32, flags, const void __user *, argp, size_t, argsz)

io_uring_register registers files, buffers, or eventfds with the ring.

SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, void __user *, arg, unsigned int, nr_args)

3.4 Deep Workflow

(1) Create io_uring object via

io_uring_setup

, mmap the rings, obtain a file descriptor.

(2) Prepare I/O requests using

io_uring_prep_*

helpers (e.g.,

io_uring_prep_read

).

(3) Submit requests with

io_uring_submit

(internally calls

io_uring_enter

).

(4) Wait for completion using

io_uring_wait_cqe

or

io_uring_peek_batch_cqe

.

(5) Retrieve results from CQE, handle success or error.

(6) Release CQE with

io_uring_cqe_seen

so the kernel can reuse the slot.

Part3 io_uring 应用实例

4.1 io_uring Application Scenarios

(1) High‑performance network services Using io_uring, web servers and proxy servers can handle massive concurrent connections with fewer threads, lower latency, and zero‑copy data transfer, outperforming traditional epoll‑based designs.

(2) Database systems Databases like PostgreSQL can batch write requests, reduce copy overhead, and achieve higher throughput for both reads and writes.

(3) Large‑scale file system operations Object storage, block storage, and distributed file systems benefit from async batch I/O, reducing response time for massive file uploads/downloads.

4.2 io_uring Case Studies

① Simple File Read/Write

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/io_uring.h>
int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int fd = open("example.txt", O_RDONLY);
    if (fd < 0) return 1;
    io_uring_queue_init(8, &ring, 0);
    sqe = io_uring_get_sqe(&ring);
    char *buf = malloc(1024);
    io_uring_prep_read(sqe, fd, buf, 1024, 0);
    io_uring_submit(&ring);
    if (io_uring_wait_cqe(&ring, &cqe) < 0) return 1;
    if (cqe->res < 0) fprintf(stderr, "Async read failed: %s
", strerror(-cqe->res));
    else printf("Read %d bytes: %s
", cqe->res, buf);
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    close(fd);
    free(buf);
    return 0;
}

② Network Programming (TCP Echo Server)

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <liburing.h>
#define ENTRIES_LENGTH 4096
#define MAX_CONNECTIONS 1024
#define BUFFER_LENGTH 1024
char buf_table[MAX_CONNECTIONS][BUFFER_LENGTH] = {0};
enum { READ, WRITE, ACCEPT };
struct conninfo { int connfd; int type; };
void set_read_event(struct io_uring *ring, int fd, void *buf, size_t len, int flags) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    io_uring_prep_recv(sqe, fd, buf, len, flags);
    struct conninfo ci = {.connfd = fd, .type = READ};
    memcpy(&sqe->user_data, &ci, sizeof(ci));
}
void set_write_event(struct io_uring *ring, int fd, const void *buf, size_t len, int flags) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    io_uring_prep_send(sqe, fd, buf, len, flags);
    struct conninfo ci = {.connfd = fd, .type = WRITE};
    memcpy(&sqe->user_data, &ci, sizeof(ci));
}
void set_accept_event(struct io_uring *ring, int fd, struct sockaddr *cliaddr, socklen_t *clilen, unsigned flags) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    io_uring_prep_accept(sqe, fd, cliaddr, clilen, flags);
    struct conninfo ci = {.connfd = fd, .type = ACCEPT};
    memcpy(&sqe->user_data, &ci, sizeof(ci));
}
int main() {
    int listenfd = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in servaddr, clientaddr;
    servaddr.sin_family = AF_INET;
    servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    servaddr.sin_port = htons(9999);
    bind(listenfd, (struct sockaddr *)&servaddr, sizeof(servaddr));
    listen(listenfd, 10);
    struct io_uring_params params = {};
    struct io_uring ring = {};
    io_uring_queue_init_params(ENTRIES_LENGTH, &ring, &params);
    socklen_t clilen = sizeof(clientaddr);
    set_accept_event(&ring, listenfd, (struct sockaddr *)&clientaddr, &clilen, 0);
    while (1) {
        struct io_uring_cqe *cqe;
        io_uring_submit(&ring);
        io_uring_wait_cqe(&ring, &cqe);
        struct io_uring_cqe *cqes[10];
        int cqecount = io_uring_peek_batch_cqe(&ring, cqes, 10);
        unsigned count = 0;
        for (int i = 0; i < cqecount; i++) {
            cqe = cqes[i];
            count++;
            struct conninfo ci;
            memcpy(&ci, &cqe->user_data, sizeof(ci));
            if (ci.type == ACCEPT) {
                int connfd = cqe->res;
                char *buffer = buf_table[connfd];
                set_read_event(&ring, connfd, buffer, 1024, 0);
                set_accept_event(&ring, listenfd, (struct sockaddr *)&clientaddr, &clilen, 0);
            } else if (ci.type == READ) {
                int bytes = cqe->res;
                if (bytes <= 0) close(ci.connfd);
                else {
                    char *buffer = buf_table[ci.connfd];
                    set_write_event(&ring, ci.connfd, buffer, bytes, 0);
                }
            } else if (ci.type == WRITE) {
                char *buffer = buf_table[ci.connfd];
                set_read_event(&ring, ci.connfd, buffer, 1024, 0);
            }
        }
        io_uring_cq_advance(&ring, count);
    }
    return 0;
}

4.3 Performance Comparison Tests

(1) Test Environment A server with Intel Xeon E5‑2682 v4 @ 2.5GHz, 16 GB RAM, Linux 5.10 kernel, NVMe SSD, 1 GbE network.

(2) Methodology Using

fio

for file I/O (blocking, non‑blocking, epoll, io_uring) with varying file sizes and access patterns; using an echo server to compare epoll vs. io_uring under increasing concurrent connections (100‑1000).

(3) Results For small files (1 MiB), blocking I/O yields ~50 MiB/s, non‑blocking ~80 MiB/s, epoll ~120 MiB/s, while io_uring reaches higher throughput and lower latency, especially as concurrency grows.

Part4 io_uring 与其他 I/O 模型对比

5.1 与阻塞 I/O 对比

Blocking I/O blocks the thread until the operation finishes, leading to low concurrency and poor CPU utilization. In a high‑concurrency web server, each client would need a dedicated thread, causing performance collapse. io_uring allows the thread to submit requests and continue working, dramatically improving resource utilization.

5.2 与非阻塞 I/O 对比

Non‑blocking I/O returns immediately with EAGAIN/EWOULDBLOCK, requiring the application to poll repeatedly, which wastes CPU cycles and adds complexity. io_uring eliminates polling by using shared SQ/CQ rings; the kernel notifies completion, reducing system‑call overhead and CPU usage.

5.3 与 epoll 对比

epoll monitors many file descriptors and copies ready events from kernel to user space. io_uring uses two shared ring buffers, avoiding most copies and supporting batch submission. In high‑concurrency scenarios (≥1000 connections), io_uring outperforms epoll, reaching ~240 k QPS per core versus epoll’s ~200 k QPS. However, in some special cases (e.g., unpatched Spectre/Meltdown), the advantage may diminish.

Part5 使用 io_uring 的注意事项与挑战

6.1 Kernel Version Requirements

io_uring requires Linux 5.10 or newer for full feature support. Older kernels may lack stability or performance. Upgrading the kernel should be done carefully, with backups and testing.

6.2 Programming Complexity

Direct use of io_uring syscalls involves managing SQ/CQ, setting correct parameters, and handling results, which is error‑prone. The

liburing

library provides higher‑level wrappers (e.g.,

io_uring_queue_init

,

io_uring_get_sqe

,

io_uring_submit

) to simplify development.

6.3 Application Migration Difficulty

Migrating existing codebases (e.g., epoll‑based servers) to io_uring often requires substantial redesign of I/O handling logic, careful testing, and may expose compatibility issues with other libraries.

Performance Optimizationio_uringsystem programmingLinux I/Oasynchronous I/O
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.