Deep Dive into the Linux epoll Mechanism and Its Kernel Implementation
The article dissects Linux’s epoll I/O multiplexing, tracing the flow from socket creation with accept through epoll_create, epoll_ctl registration, and epoll_wait sleeping, detailing the kernel’s eventpoll object, red‑black tree, per‑socket wait‑queue callbacks that enable O(log N) registration and O(1) event delivery for tens of thousands of connections.
When a process needs to handle thousands of TCP connections efficiently, Linux provides the I/O multiplexing mechanism known as epoll. This article dissects the internal workings of epoll, starting from the creation of sockets via accept , through the establishment of the epoll object, to the event notification flow.
1. accept – creating a new socket
The accept system call creates a new struct socket and a corresponding struct file . Key source snippets include:
#include
int main(){
listen(lfd, ...);
cfd1 = accept(...);
cfd2 = accept(...);
efd = epoll_create(...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
epoll_wait(efd, ...);
}After allocation, the new socket’s ops pointer is copied from the listening socket, and a file object is attached via sock_alloc_file . The socket_file_ops structure defines callbacks such as .poll = sock_poll .
2. epoll_create – creating the eventpoll object
Calling epoll_create allocates a struct eventpoll and registers it in the process’s file table. The core fields are:
struct eventpoll {
wait_queue_head_t wq; // wait queue for epoll_wait
struct list_head rdllist; // list of ready descriptors
struct rb_root rbr; // red‑black tree of registered fds
...
};Initialization is performed in ep_alloc , which zeroes the structure, sets up the wait queue, the ready list, and the red‑black tree root.
3. epoll_ctl – registering sockets
When EPOLL_CTL_ADD is invoked, the kernel allocates an epitem (a node in the red‑black tree) and links it to the target socket’s file descriptor:
struct epitem {
struct rb_node rbn; // tree node
struct epoll_filefd ffd; // fd and file pointer
struct eventpoll *ep; // back‑reference to eventpoll
struct list_head pwqlist; // list of poll wait queue entries
};The function ep_insert performs three steps:
Allocate and initialize epitem .
Register a poll callback on the socket’s wait queue using init_poll_funcptr(&epq.pt, ep_ptable_queue_proc) .
Insert the epitem into the eventpoll’s red‑black tree via ep_rbtree_insert .
The poll callback ep_ptable_queue_proc creates a wait_queue_t entry whose .func is ep_poll_callback and adds it to the socket’s wait queue.
4. epoll_wait – waiting for events
epoll_wait checks eventpoll->rdllist . If the list is empty, it creates a wait‑queue entry for the calling process and sleeps on eventpoll->wq :
init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
schedule_hrtimeout_range(...);When an I/O event occurs, the kernel wakes the process via the chain of callbacks described next.
5. Data arrival – the callback chain
Incoming packets are processed in tcp_v4_rcv , which locates the socket and calls tcp_rcv_established . The data is queued into sk->sk_receive_queue and the socket’s sk_data_ready function ( sock_def_readable ) is invoked.
sock_def_readable checks the socket’s wait queue and calls wake_up_interruptible_sync_poll , which ultimately executes the ep_poll_callback registered earlier:
static int ep_poll_callback(wait_queue_t *wait, unsigned int mode, int sync, void *key) {
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
list_add_tail(&epi->rdllink, &ep->rdllist);
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
return 1;
}The final wake‑up uses default_wake_function , which calls try_to_wake_up on the sleeping process, moving it back to the runnable queue. When epoll_wait resumes, it copies the ready events from rdllist to user space.
Conclusion
The epoll mechanism combines three kernel data structures—an eventpoll object, a red‑black tree of registered file descriptors, and per‑socket wait‑queue entries—to achieve O(log N) registration and O(1) event retrieval. Understanding the full callback chain (from sock_def_readable → ep_poll_callback → default_wake_function ) clarifies why epoll scales to tens of thousands of connections while keeping CPU usage low.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.