Mastering epoll: High‑Performance Event‑Driven I/O for Millions of Connections
This article explains how epoll solves the inefficiencies of select/poll by using a kernel‑side red‑black tree and ready‑list, details its API, internal structures, trigger modes, reactor model, and provides a complete C demo for building a scalable TCP server.
Imagine a server that must keep one million TCP connections open while only a few hundred are active at any moment; processing such a workload efficiently requires an event‑driven mechanism that avoids scanning all sockets on each poll.
Why select/poll fail at scale
Older Linux kernels (<2.4) implemented select/poll by passing the entire list of file descriptors to the kernel on every call. This caused massive user‑to‑kernel memory copies and forced the kernel to iterate over all one‑million sockets, limiting practical concurrency to a few thousand connections.
epoll’s three‑step design
epoll introduces three system calls:
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);Call epoll_create once to obtain an epoll object.
Use epoll_ctl to add, modify, or delete the sockets you want to monitor.
Call epoll_wait to retrieve only the sockets that actually have events.
Because the kernel stores the registered sockets in a red‑black tree and keeps a ready‑list of events, epoll_wait can return immediately without scanning the whole set, making it suitable for millions of connections.
Internal structures
When an epoll object is created, the kernel allocates an eventpoll structure:
struct eventpoll {
struct rb_root rbr; // red‑black tree of all registered events
struct list_head rdllist; // list of ready events to return to userspace
...
};Each monitored socket is represented by an epitem:
struct epitem {
struct rb_node rbn; // node in the red‑black tree
struct list_head rdllink; // node in the ready list
struct epoll_filefd ffd; // file descriptor information
struct eventpoll *ep; // back‑pointer to the owning eventpoll
struct epoll_event event; // the events we are interested in
...
};When epoll_wait is invoked, the kernel simply checks whether rdllist contains any epitem entries; if so, it copies them to userspace (often via shared memory) and returns the count.
Trigger modes: LT vs. ET
LT (Level‑Triggered) – the default. As long as data remains readable, the socket is reported on every epoll_wait call.
ET (Edge‑Triggered) – “high‑speed” mode. The socket is reported only when a new event arrives. The application must read until recv returns EAGAIN; otherwise further data will not generate another notification.
ET reduces the number of wake‑ups for sockets that generate a lot of ready notifications, but it requires non‑blocking I/O and careful handling of partial reads/writes.
Reactor model with epoll
The classic epoll flow is:
epoll_create();
epoll_ctl(); // register listening fd and client fds
while (1) {
int n = epoll_wait(efd, events, MAX_EVENTS, timeout);
for (i = 0; i < n; ++i) {
struct myevent_s *ev = (struct myevent_s *)events[i].data.ptr;
if (events[i].events & EPOLLIN && ev->events & EPOLLIN)
ev->call_back(ev->fd, events[i].events, ev->arg);
if (events[i].events & EPOLLOUT && ev->events & EPOLLOUT)
ev->call_back(ev->fd, events[i].events, ev->arg);
}
}Each connection is represented by a custom myevent_s structure that stores the fd, interested events, a callback, status flags, buffers, and timestamps. The callbacks handle accepting new connections, reading data, and sending responses, while eventadd and eventdel manage registration in the epoll tree.
Complete C demo
The article provides a full example that:
Creates an epoll instance with epoll_create.
Initialises a non‑blocking listening socket.
Registers the listening fd with eventadd (monitoring EPOLLIN).
Implements acceptconn to accept new clients, set them non‑blocking, and add them to the epoll set.
Implements recvdata to read client data, then switches the fd to EPOLLOUT for echoing.
Implements senddata to write the echo back and re‑arm the fd for reading.
Provides utility functions eventset, eventadd, and eventdel that manipulate the red‑black tree and ready list.
The main loop continuously calls epoll_wait with a 1‑second timeout, processes the returned events, and exits on error.
Key takeaways :
epoll_create builds the red‑black tree and ready list once; subsequent operations only modify these structures.
epoll_ctl adds a socket to the tree, registers a callback, and marks the node as active.
epoll_wait returns instantly with the ready list, avoiding the O(N) scan of select/poll.
Choosing EPOLLET reduces unnecessary wake‑ups but demands careful non‑blocking I/O handling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
