Mastering epoll: High‑Performance Event‑Driven I/O for Millions of Connections

This article explains how epoll solves the inefficiencies of select/poll by using a kernel‑side red‑black tree and ready‑list, details its API, internal structures, trigger modes, reactor model, and provides a complete C demo for building a scalable TCP server.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Mastering epoll: High‑Performance Event‑Driven I/O for Millions of Connections

Imagine a server that must keep one million TCP connections open while only a few hundred are active at any moment; processing such a workload efficiently requires an event‑driven mechanism that avoids scanning all sockets on each poll.

Why select/poll fail at scale

Older Linux kernels (<2.4) implemented select/poll by passing the entire list of file descriptors to the kernel on every call. This caused massive user‑to‑kernel memory copies and forced the kernel to iterate over all one‑million sockets, limiting practical concurrency to a few thousand connections.

epoll’s three‑step design

epoll introduces three system calls:

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Call epoll_create once to obtain an epoll object.

Use epoll_ctl to add, modify, or delete the sockets you want to monitor.

Call epoll_wait to retrieve only the sockets that actually have events.

Because the kernel stores the registered sockets in a red‑black tree and keeps a ready‑list of events, epoll_wait can return immediately without scanning the whole set, making it suitable for millions of connections.

Internal structures

When an epoll object is created, the kernel allocates an eventpoll structure:

struct eventpoll {
    struct rb_root rbr;      // red‑black tree of all registered events
    struct list_head rdllist; // list of ready events to return to userspace
    ...
};

Each monitored socket is represented by an epitem:

struct epitem {
    struct rb_node rbn;          // node in the red‑black tree
    struct list_head rdllink;   // node in the ready list
    struct epoll_filefd ffd;    // file descriptor information
    struct eventpoll *ep;       // back‑pointer to the owning eventpoll
    struct epoll_event event;   // the events we are interested in
    ...
};

When epoll_wait is invoked, the kernel simply checks whether rdllist contains any epitem entries; if so, it copies them to userspace (often via shared memory) and returns the count.

Trigger modes: LT vs. ET

LT (Level‑Triggered) – the default. As long as data remains readable, the socket is reported on every epoll_wait call.

ET (Edge‑Triggered) – “high‑speed” mode. The socket is reported only when a new event arrives. The application must read until recv returns EAGAIN; otherwise further data will not generate another notification.

ET reduces the number of wake‑ups for sockets that generate a lot of ready notifications, but it requires non‑blocking I/O and careful handling of partial reads/writes.

Reactor model with epoll

The classic epoll flow is:

epoll_create();
epoll_ctl();   // register listening fd and client fds
while (1) {
    int n = epoll_wait(efd, events, MAX_EVENTS, timeout);
    for (i = 0; i < n; ++i) {
        struct myevent_s *ev = (struct myevent_s *)events[i].data.ptr;
        if (events[i].events & EPOLLIN && ev->events & EPOLLIN)
            ev->call_back(ev->fd, events[i].events, ev->arg);
        if (events[i].events & EPOLLOUT && ev->events & EPOLLOUT)
            ev->call_back(ev->fd, events[i].events, ev->arg);
    }
}

Each connection is represented by a custom myevent_s structure that stores the fd, interested events, a callback, status flags, buffers, and timestamps. The callbacks handle accepting new connections, reading data, and sending responses, while eventadd and eventdel manage registration in the epoll tree.

Complete C demo

The article provides a full example that:

Creates an epoll instance with epoll_create.

Initialises a non‑blocking listening socket.

Registers the listening fd with eventadd (monitoring EPOLLIN).

Implements acceptconn to accept new clients, set them non‑blocking, and add them to the epoll set.

Implements recvdata to read client data, then switches the fd to EPOLLOUT for echoing.

Implements senddata to write the echo back and re‑arm the fd for reading.

Provides utility functions eventset, eventadd, and eventdel that manipulate the red‑black tree and ready list.

The main loop continuously calls epoll_wait with a 1‑second timeout, processes the returned events, and exits on error.

Key takeaways :

epoll_create builds the red‑black tree and ready list once; subsequent operations only modify these structures.

epoll_ctl adds a socket to the tree, registers a callback, and marks the node as active.

epoll_wait returns instantly with the ready list, avoiding the O(N) scan of select/poll.

Choosing EPOLLET reduces unnecessary wake‑ups but demands careful non‑blocking I/O handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

I/O MultiplexingC programmingEvent-drivenepoll
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.