How epoll Scales to Millions of Connections Using Red‑Black Trees and Ready Lists
This article explains the inner workings of Linux's epoll mechanism, detailing how red‑black trees manage registered file descriptors and how a ready‑list delivers events in O(1) time, covering core APIs, data structures, insertion‑deletion logic, and the full kernel‑space to user‑space flow.
1. Core design of epoll
epoll achieves scalable I/O multiplexing by maintaining two kernel data structures: a red‑black tree that stores every registered file descriptor (fd) and a double‑linked ready list that holds only the fds whose events have become ready. The tree provides O(log n) insertion, deletion and lookup, while the ready list enables O(1) retrieval of ready events regardless of the total number of fds.
1.1 Red‑black tree (connection management)
Each monitored fd is wrapped in an epitem structure. The rbn member links the epitem into a red‑black tree ( eventpoll.rbr). Operations such as epoll_ctl(EPOLL_CTL_ADD) and EPOLL_CTL_DEL walk the tree using the fd as the key, insert or remove the node with rb_link_node / rb_insert_color, and rebalance the tree. All these operations run in O(log n) time.
1.2 Ready list (event dispatch)
When the kernel detects an I/O event on a monitored fd, the poll subsystem invokes the callback registered by epoll_ctl. The callback stores the actual event mask in epitem.revents and, if the rdllink field is not already in the list, adds the epitem to eventpoll.rdllist with list_add_tail. The waiting process is then awakened via wake_up_locked(&eventpoll.wq). epoll_wait simply iterates over the ready list, copies the epoll_event structures to user space, and optionally removes the node for edge‑triggered registrations.
1.3 Core system calls
epoll_create1 – allocates an eventpoll object, initializes the red‑black tree, the ready list, mutexes and wait queues, and returns a dedicated fd.
epoll_ctl – adds, modifies or deletes a monitored fd. It creates an epitem, inserts it into the tree, and registers a poll callback for the fd.
epoll_wait – blocks on eventpoll.wq until the ready list is non‑empty or a timeout expires, then copies up to maxevents entries to the user‑provided array.
2. Key data structures
2.1 struct epitem
struct epitem {
struct rb_node rbn; /* node in the red‑black tree */
struct list_head rdllink; /* node in the ready list */
struct epoll_filefd ffd; /* fd and file information */
struct eventpoll *ep; /* back‑pointer to the epoll instance */
struct epoll_event event; /* user‑requested mask (EPOLLIN, …) */
__poll_t revents; /* actual events reported by the kernel */
/* additional fields omitted for brevity */
};2.2 struct eventpoll
struct eventpoll {
struct mutex mtx; /* protects internal state */
wait_queue_head_t wq; /* processes waiting in epoll_wait */
struct list_head rdllist; /* ready list (double‑linked) */
struct rb_root_cached rbr; /* red‑black tree of all epitems */
struct user_struct *user; /* for mmap zero‑copy support */
bool closed;
/* other fields omitted */
};3. Insertion and deletion flow
Call fget(fd) to obtain the file object.
Allocate an epitem and fill fd, event and data.
Lock the epoll instance, walk the red‑black tree comparing fd values, and insert the node with rb_link_node / rb_insert_color.
Register a poll callback via ep_ptable_queue_proc, which attaches ep_poll_callback to the file’s wait queue.
Deletion performs the reverse: locate the node by fd, remove it from the tree, deregister the callback, and free the epitem.
4. Event detection and dispatch
When a monitored fd becomes ready, the kernel’s poll subsystem calls the previously registered ep_poll_callback:
Lock the epoll instance.
Store the event mask in epitem.revents.
If the item is not already on the ready list, add it with list_add_tail and wake up any process sleeping in epoll_wait.
Unlock the instance. epoll_wait then:
Checks whether the ready list is empty; if so and a timeout is specified, the calling process sleeps on eventpoll.wq.
When woken, iterates over rdllist, copying each epitem ’s event and data to the user‑provided array (using __put_user).
For edge‑triggered registrations ( EPOLLET) the node is removed from the ready list after copying; level‑triggered entries remain.
5. mmap zero‑copy optimization
When an application maps the epoll file descriptor, the kernel links eventpoll.user to a user‑space page that mirrors the kernel’s ready‑event buffer. epoll_wait can then return events without an extra copy, which is especially beneficial when thousands of events are delivered in a single call.
6. Red‑black tree properties
The tree satisfies the classic five properties (node colour, root black, red nodes have black children, every leaf (NIL) is black, and all paths from a node to its leaves contain the same number of black nodes). These guarantees keep the height bounded by 2 × log₂ n, ensuring O(log n) complexity for insert, delete and lookup.
7. Comparison with select/poll
Select/poll must copy the entire fd set into the kernel on each call and then scan all fds to find ready ones – O(n) work per call.
epoll registers each fd once (via the red‑black tree) and relies on callbacks; epoll_wait processes only the ready list – O(1) per ready fd, independent of the total number of registered fds.
8. Full workflow example (C)
int epfd = epoll_create1(0);
struct epoll_event ev;
/* monitor a listening socket for readability */
ev.events = EPOLLIN;
ev.data.fd = listenfd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &ev);
/* event loop */
while (1) {
struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1);
for (int i = 0; i < n; ++i) {
if (events[i].events & EPOLLIN) {
/* handle readable fd */
}
/* other event handling … */
}
}This design—logarithmic management of all fds via a red‑black tree and constant‑time delivery of ready events via a double‑linked list—gives epoll its superior performance for high‑concurrency network servers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
