Backend Development 45 min read

Why epoll Beats select and poll: A Deep Dive into Linux’s High‑Performance I/O Multiplexing

This article explains the limitations of traditional I/O models such as blocking I/O, non‑blocking I/O, select and poll, introduces epoll’s design, core data structures, operation modes, system‑call interface, and provides practical code examples and optimisation tips for high‑concurrency network servers.

Open Source Tech Hub

Jul 16, 2025

Why epoll Beats select and poll: A Deep Dive into Linux’s High‑Performance I/O Multiplexing

1. epoll Overview

In modern internet services, servers often face massive concurrent connections, for example during a flash‑sale or an online game with thousands of players. Efficient I/O handling becomes critical, and traditional models (blocking I/O, non‑blocking I/O, select/poll) show serious drawbacks under high load.

1.1 What is epoll?

epoll is the Linux kernel’s enhanced version of poll, introduced in kernel 2.5.44 to solve high‑concurrency I/O inefficiencies. It keeps a shared memory area between kernel and user space, maintains an event table, and returns only ready file descriptors via epoll_wait, avoiding the full descriptor scan required by select/poll.

1.2 Comparison with select/poll

Select is limited to ~1024 descriptors and copies the descriptor set between user and kernel on each call, causing high overhead. Poll removes the hard limit but still copies the entire array and traverses it linearly (O(n)). epoll removes the descriptor limit, uses a red‑black tree for registration and a ready‑list for events, achieving O(1) retrieval of ready descriptors.

1.3 Design ideas

epoll separates two steps that select combines: maintaining the wait queue and blocking the process. The kernel stores the wait queue in a red‑black tree (for fast insert/delete/search) and a doubly‑linked list for ready descriptors, allowing efficient event delivery.

1.4 Working modes

LT (Level‑Triggered) : The default mode. As long as a descriptor remains readable or writable, epoll_wait keeps notifying the application. This is similar to select/poll and is easier to program.

ET (Edge‑Triggered) : The kernel notifies only when the state changes from not‑ready to ready. The application must read/write until it receives EAGAIN/EWOULDBLOCK, otherwise further notifications are suppressed. ET offers higher performance but is harder to use correctly.

1.5 epoll API

int epoll_create(int size);

size

is ignored after Linux 2.6.8 but must be >0. epoll_create1(int flags) adds optional flags such as EPOLL_CLOEXEC.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

op

can be EPOLL_CTL_ADD, EPOLL_CTL_MOD, or EPOLL_CTL_DEL. The event structure contains an events mask (e.g., EPOLLIN, EPOLLOUT, EPOLLERR, EPOLLHUP, EPOLLET, EPOLLONESHOT) and a user data union.

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

The call blocks until at least one event is ready, returns the number of ready events, or 0 on timeout. timeout is in milliseconds ( -1 for infinite, 0 for non‑blocking).

2. Core Implementation Details

2.1 Eventpoll data structures

When epoll_create is called, the kernel allocates an eventpoll object containing:

A spin‑lock and a mutex to protect the ready list.

Two wait queues: one for epoll_wait and one for file->poll.

A doubly‑linked list rdllist that holds ready epitem objects.

A red‑black tree rbr that indexes all registered file descriptors.

Each registered descriptor is represented by an epitem that links the descriptor to both the red‑black tree and the ready list.

2.2 Ready‑list handling

The ready list is a doubly‑linked list because it allows O(1) insertion and removal of descriptors that become ready. When an interrupt handler marks a socket as ready, it adds the corresponding epitem to rdllist. epoll_wait simply checks whether rdllist is empty; if not, it copies the entries to user space and returns.

2.3 Index structure

The red‑black tree provides O(log N) lookup, insertion, and deletion of descriptors, preventing duplicate registrations and enabling fast modifications.

2.4 Adding and removing descriptors

When epoll_ctl(EPOLL_CTL_ADD) is invoked, the kernel inserts the descriptor into the red‑black tree and attaches the epitem to the socket’s wait queue. Deleting removes the epitem from both structures. Closing a descriptor automatically triggers eventpoll_release, which cleans up the associated entries.

2.5 Blocking and waking up processes

A process that calls epoll_wait is placed on the eventpoll ’s wait queue. When an interrupt marks a socket ready, the kernel adds the socket to rdllist and wakes one or more waiting processes (depending on the mode). The awakened process then reads the ready list and continues execution.

3. Practical Usage Example

The following complete program demonstrates a simple TCP echo server built with epoll. It shows socket creation, non‑blocking setup, registration of the listening socket, acceptance of new connections, and read/write handling.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>

#define PORT 8888
#define MAX_EVENTS 10
#define BUFFER_SIZE 1024

int setnonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1) return -1;
    return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}

int main() {
    int listen_fd, conn_fd, epoll_fd;
    struct sockaddr_in server_addr, client_addr;
    socklen_t client_len = sizeof(client_addr);
    struct epoll_event event, events[MAX_EVENTS];
    char buffer[BUFFER_SIZE];

    listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (listen_fd == -1) { perror("socket"); return 1; }

    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = INADDR_ANY;
    server_addr.sin_port = htons(PORT);
    if (bind(listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) == -1) {
        perror("bind"); close(listen_fd); return 1; }
    if (listen(listen_fd, 10) == -1) { perror("listen"); close(listen_fd); return 1; }

    epoll_fd = epoll_create1(0);
    if (epoll_fd == -1) { perror("epoll_create1"); close(listen_fd); return 1; }

    if (setnonblocking(listen_fd) == -1) { perror("setnonblocking"); close(listen_fd); close(epoll_fd); return 1; }

    event.events = EPOLLIN;
    event.data.fd = listen_fd;
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &event) == -1) {
        perror("epoll_ctl add listen_fd"); close(listen_fd); close(epoll_fd); return 1; }

    while (1) {
        int nfds = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
        if (nfds == -1) { perror("epoll_wait"); break; }
        for (int i = 0; i < nfds; ++i) {
            if (events[i].data.fd == listen_fd) {
                conn_fd = accept(listen_fd, (struct sockaddr *)&client_addr, &client_len);
                if (conn_fd == -1) { perror("accept"); continue; }
                if (setnonblocking(conn_fd) == -1) { perror("setnonblocking conn"); close(conn_fd); continue; }
                event.events = EPOLLIN;
                event.data.fd = conn_fd;
                if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, conn_fd, &event) == -1) {
                    perror("epoll_ctl add conn_fd"); close(conn_fd); }
            } else {
                conn_fd = events[i].data.fd;
                ssize_t count = read(conn_fd, buffer, sizeof(buffer));
                if (count == -1) {
                    if (errno != EAGAIN && errno != EWOULDBLOCK) {
                        perror("read"); close(conn_fd); epoll_ctl(epoll_fd, EPOLL_CTL_DEL, conn_fd, NULL);
                    }
                } else if (count == 0) {
                    printf("Client closed connection
");
                    close(conn_fd); epoll_ctl(epoll_fd, EPOLL_CTL_DEL, conn_fd, NULL);
                } else {
                    buffer[count] = '\0';
                    printf("Received: %s
", buffer);
                    if (write(conn_fd, buffer, count) == -1) perror("write");
                }
            }
        }
    }
    close(listen_fd);
    close(epoll_fd);
    return 0;
}

4. Performance Tips and Pitfalls

4.1 Common issues

In ET mode, failing to read until EAGAIN leaves data in the kernel buffer without further notifications, causing data loss. The correct pattern is to loop on read until it returns -1 with EAGAIN.

“Epoll thundering‑herd” occurs when multiple processes or threads wait on the same socket; all are awakened but only one can handle the event, wasting CPU cycles. Using the EPOLLEXCLUSIVE flag (Linux 4.5+) or SO_REUSEPORT with separate sockets mitigates this.

4.2 Optimisation recommendations

Choose an appropriate timeout for epoll_wait based on latency requirements; a small timeout reduces latency but increases system calls.

Batch‑process events returned by epoll_wait to minimise per‑event overhead; consider a thread‑pool or coroutine model for parallel handling.

Use EPOLLONESHOT for sockets shared among multiple workers to guarantee that only one worker processes a given event at a time.

By understanding epoll’s internal structures and correctly applying its API, developers can build scalable, low‑latency network services capable of handling millions of concurrent connections.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux I/O Multiplexing Network programming Event-driven epoll

Written by

Open Source Tech Hub

Sharing cutting-edge internet technologies and practical AI resources.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.