Fundamentals 51 min read

Unlock Ultra‑Fast Linux I/O: How io_uring Revolutionizes Asynchronous Operations

This article explores the evolution of Linux I/O models—from blocking and non‑blocking to epoll—and introduces io_uring as a high‑performance asynchronous framework that reduces system‑call overhead, eliminates data copies, and unifies network and disk I/O for modern high‑concurrency applications.

Deepin Linux
Deepin Linux
Deepin Linux
Unlock Ultra‑Fast Linux I/O: How io_uring Revolutionizes Asynchronous Operations

In computer systems, I/O performance is a key factor affecting overall system throughput. Whether it is file read/write, network communication, or database access, efficient I/O handling is crucial. On Linux, the I/O model has evolved from early blocking I/O to the powerful io_uring, each iteration bringing higher efficiency and flexibility for developers.

1. Pain Points of Traditional I/O Models

Before diving into io_uring, let’s review traditional I/O models and the challenges they face under high concurrency and performance demands.

1.1 Blocking I/O

Blocking I/O is the most basic model. When an application performs an I/O operation (e.g., read or write), the process is blocked until the operation completes. This is like waiting at a restaurant table for your food; you cannot do anything else while waiting.

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#define BUFFER_SIZE 1024
int main() {
    int fd = open("example.txt", O_RDONLY);
    if (fd == -1) {
        perror("open");
        return 1;
    }
    char buffer[BUFFER_SIZE];
    ssize_t bytes_read = read(fd, buffer, BUFFER_SIZE);
    if (bytes_read == -1) {
        perror("read");
        close(fd);
        return 1;
    }
    printf("Read %zd bytes: %.*s
", bytes_read, (int)bytes_read, buffer);
    close(fd);
    return 0;
}

In high‑concurrency web servers, each client connection would require a dedicated thread. As connections increase, thread resources are exhausted and performance drops sharply because each blocked thread consumes stack space, registers, and incurs creation/destruction overhead.

1.2 Non‑Blocking I/O

Non‑blocking I/O avoids blocking by returning an error (EWOULDBLOCK or EAGAIN) when data is not ready, allowing the application to continue other work and poll later. This is similar to being told to wait for a delivery and checking back periodically.

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#define BUFFER_SIZE 1024
int main() {
    int fd = open("example.txt", O_RDONLY | O_NONBLOCK);
    if (fd == -1) {
        perror("open");
        return 1;
    }
    char buffer[BUFFER_SIZE];
    while (1) {
        ssize_t bytes_read = read(fd, buffer, BUFFER_SIZE);
        if (bytes_read == -1) {
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                usleep(1000);
                continue;
            } else {
                perror("read");
                close(fd);
                return 1;
            }
        }
        break;
    }
    printf("Read %zd bytes: %.*s
", bytes_read, (int)bytes_read, buffer);
    close(fd);
    return 0;
}

While non‑blocking I/O improves concurrency, frequent polling consumes CPU cycles and adds programming complexity due to extensive error handling.

1.3 I/O Multiplexing

I/O multiplexing (select, poll, epoll) builds on non‑blocking I/O, allowing a process to monitor multiple descriptors and act when any become ready. For example, select notifies the application when one of several sockets has data, similar to a waiter announcing which dishes are ready.

#include <stdio.h>
#include <sys/select.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#define BUFFER_SIZE 1024
#define FD_SETSIZE 1024
int main() {
    int fd1 = open("file1.txt", O_RDONLY | O_NONBLOCK);
    int fd2 = open("file2.txt", O_RDONLY | O_NONBLOCK);
    if (fd1 == -1 || fd2 == -1) {
        perror("open");
        return 1;
    }
    fd_set read_fds;
    FD_ZERO(&read_fds);
    FD_SET(fd1, &read_fds);
    FD_SET(fd2, &read_fds);
    int max_fd = (fd1 > fd2) ? fd1 : fd2;
    int ret = select(max_fd + 1, &read_fds, NULL, NULL, NULL);
    if (ret == -1) {
        perror("select");
        close(fd1);
        close(fd2);
        return 1;
    } else if (ret > 0) {
        char buffer[BUFFER_SIZE];
        if (FD_ISSET(fd1, &read_fds)) {
            ssize_t bytes_read = read(fd1, buffer, BUFFER_SIZE);
            if (bytes_read == -1) perror("read fd1");
            else printf("Read from fd1: %.*s
", (int)bytes_read, buffer);
        }
        if (FD_ISSET(fd2, &read_fds)) {
            ssize_t bytes_read = read(fd2, buffer, BUFFER_SIZE);
            if (bytes_read == -1) perror("read fd2");
            else printf("Read from fd2: %.*s
", (int)bytes_read, buffer);
        }
    }
    close(fd1);
    close(fd2);
    return 0;
}

Even with epoll, the kernel must still copy data between user and kernel space, and each I/O operation incurs system‑call overhead.

2. io_uring Enters the Stage

2.1 What Is io_uring?

io_uring is a high‑performance asynchronous I/O framework introduced in Linux 5.1 (2019) by Jens Axboe. It addresses the inefficiencies of traditional async models (epoll, POSIX AIO) by providing low‑latency, low‑overhead, fully asynchronous I/O.

The core concepts are Submission Queue (SQ), Completion Queue (CQ), Submission Queue Entry (SQE), and Completion Queue Entry (CQE).

Submission Queue (SQ) : a ring buffer in shared memory where the user places I/O requests (SQEs).

Completion Queue (CQ) : a ring buffer where the kernel posts completed request results (CQEs).

SQE : describes a single I/O operation (type, fd, buffer address, length, offset, etc.).

CQE : contains the result of an I/O operation (return value, status, user data).

2.2 Design Goals and Features

io_uring unifies network and disk asynchronous I/O, provides a complete async API, supports async, polling, lock‑free, zero‑copy operations, and reduces system‑call overhead by using shared memory (mmap) between user and kernel.

2.3 Design Ideas

(1) Reducing system‑call overhead : Batch multiple logical operations into a constant number of syscalls.

(2) Eliminating copy overhead : Share memory between user and kernel, avoiding unnecessary data copies.

(3) Friendly API : Consolidate multiple syscalls into a single interface where possible.

One memory region for kernel‑to‑user communication; one for user‑to‑kernel.

Application produces SQEs, kernel consumes them; kernel produces CQEs, application consumes them.

3. Implementation Details of io_uring

io_uring implements asynchronous I/O using a producer‑consumer model:

User process produces I/O requests into the Submission Queue (SQ).

Kernel consumes SQ entries, performs I/O, and places results into the Completion Queue (CQ).

User process consumes CQ entries to retrieve results.

Both queues are created during io_uring_setup and are memory‑mapped into user space to avoid extra copies.

3.1 Core Component Analysis

The SQ and SQE store request details (operation type, fd, buffer address, length, offset). The CQ and CQE store completion results (bytes transferred or error code, user data).

3.2 System Calls

Only three syscalls are needed:

io_uring_setup : Initializes an io_uring instance, allocating SQ and CQ structures and returning a file descriptor.

io_uring_enter : Submits pending SQEs and optionally waits for completions.

io_uring_register : Registers files, buffers, or eventfds with the ring to avoid per‑request setup.

3.3 Working Flow

Initialization : Call io_uring_setup, mmap SQ and CQ.

Prepare I/O : Obtain an SQE via io_uring_get_sqe, fill it with io_uring_prep_* helpers.

Submit : Call io_uring_submit (which invokes io_uring_enter).

Kernel Processing : Kernel consumes SQEs, performs I/O.

Completion Notification : Kernel writes CQEs, updates CQ tail.

User Retrieval : Call io_uring_wait_cqe or io_uring_peek_cqe, process CQE, then io_uring_cqe_seen.

Repeat : Continue submitting and retrieving as needed.

4. Comparison with Other I/O Models

4.1 Blocking I/O vs. io_uring

Blocking I/O ties up a thread per request, leading to high context‑switch overhead under load. io_uring allows a single thread to handle many requests asynchronously, improving resource utilization and reducing latency.

4.2 Non‑Blocking I/O vs. io_uring

Non‑blocking I/O requires active polling, wasting CPU cycles. io_uring’s completion queue notifies the application only when I/O finishes, eliminating unnecessary polling.

4.3 epoll vs. io_uring

epoll still requires a system call per event and separate read/write calls, while io_uring can batch many operations into a single submit and receive completions in bulk, reducing syscall count and enabling true kernel‑side async processing.

5. Application Scenarios

5.1 High‑Performance Network Services

Projects like the Nginx io_uring module replace epoll with io_uring, allowing the server to submit many accept/read/write requests at once, dramatically lowering latency during traffic spikes such as flash sales.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <liburing.h>
#define PORT 8080
#define BUFFER_SIZE 1024
#define QUEUE_DEPTH 1024
/* request context struct omitted for brevity */
int main() {
    /* socket setup */
    /* io_uring initialization */
    /* accept loop using io_uring_get_sqe, io_uring_prep_accept, io_uring_submit */
    /* handle CQEs for accept, recv, send */
    return 0;
}

5.2 Database Systems

Databases benefit from reduced I/O latency. Ceph, for example, sees 20‑30% higher IOPS and lower latency when io_uring is enabled. Transaction processing can use io_uring to batch reads, writes, and log appends.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <liburing.h>
#define QUEUE_DEPTH 256
#define BLOCK_SIZE 4096
#define LOG_ENTRY_SIZE 512
#define MAX_TRANSACTIONS 100
/* db_request struct and helper functions omitted for brevity */
int main() {
    struct io_uring ring;
    /* init io_uring, open db and log files */
    /* submit read/write/log requests for each transaction */
    /* wait for completions and handle them */
    return 0;
}

5.3 Large‑Scale File Transfer

io_uring’s zero‑copy and batch submission make it ideal for moving multi‑gigabyte media files. The following example registers buffers and streams data from source to destination with minimal CPU overhead.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <liburing.h>
#define QUEUE_DEPTH 128
#define BLOCK_SIZE (1024*1024)
/* transfer_request struct and helper functions omitted for brevity */
int main(int argc, char *argv[]) {
    /* parse source/destination pairs, init io_uring, start transfers */
    return 0;
}

6. Code Practice

6.1 Environment Setup

io_uring requires Linux kernel 5.1+. Verify with uname -r. Install liburing from source:

git clone https://git.kernel.dk/liburing
cd liburing
./configure --cc=gcc --cxx=g++
make -j$(nproc)
sudo make install

The library installs headers to /usr/local/include/liburing and the shared object to /usr/local/lib.

6.2 Simple File‑Read Example

#include <liburing.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define QUEUE_DEPTH 1
#define BUFFER_SIZE 4096
int main() {
    struct io_uring ring;
    struct io_uring_cqe *cqe;
    struct io_uring_sqe *sqe;
    int ret, fd;
    char buffer[BUFFER_SIZE];
    fd = open("testfile.txt", O_RDONLY);
    if (fd < 0) { perror("open"); return 1; }
    ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) { perror("io_uring_queue_init"); return 1; }
    sqe = io_uring_get_sqe(&ring);
    if (!sqe) { fprintf(stderr, "io_uring_get_sqe failed
"); return 1; }
    io_uring_prep_read(sqe, fd, buffer, BUFFER_SIZE, 0);
    ret = io_uring_submit(&ring);
    if (ret < 0) { perror("io_uring_submit"); return 1; }
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) { perror("io_uring_wait_cqe"); return 1; }
    if (cqe->res < 0) {
        fprintf(stderr, "I/O error: %s
", strerror(-cqe->res));
        return 1;
    }
    write(STDOUT_FILENO, buffer, cqe->res);
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    close(fd);
    return 0;
}

This program opens a file, initializes an io_uring instance, submits a read request, waits for completion, prints the data, and cleans up.

6.3 Common Issues and Fixes

Initialization failure : Ensure kernel version ≥5.1 and increase file‑descriptor limits ( ulimit -n).

Submit errors : Verify SQ depth and that SQEs are correctly populated (valid fd, buffer address, length).

CQE retrieval failure : Check kernel logs for errors; avoid signal interference in multithreaded programs.

Memory‑lock limits : Increase ulimit -l or edit /etc/security/limits.conf to allow unlimited locked memory for applications that register large buffers.

io_uring architecture diagram
io_uring architecture diagram

The diagram above illustrates the shared memory layout between user space and kernel space for SQ and CQ.

Performanceio_uringzero-copyLinux kernelsystem callsasynchronous I/O
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.