Deep Dive into Linux epoll: Kernel Implementation and Mechanism
This article thoroughly explains how Linux epoll works internally, covering the creation of sockets via accept, the epoll data structures, the implementation of epoll_create, epoll_ctl, and epoll_wait, and how kernel callbacks propagate I/O readiness from network packets to user space.
Linux processes that need to handle thousands of TCP connections must efficiently discover which sockets are readable or writable; the kernel provides this via the I/O multiplexing mechanism known as epoll. The article walks through the complete kernel path, from socket creation with accept to event notification with epoll_wait, illustrating each step with real source snippets.
1. accept creates a new socket
int main(){
listen(lfd, ...);
cfd1 = accept(...);
cfd2 = accept(...);
efd = epoll_create(...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
epoll_wait(efd, ...);
}The kernel function
SYSCALL_DEFINE4(accept4, int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags)allocates a struct socket, copies the listening socket's protocol operations, creates a struct file via sock_alloc_file, and finally links the new file descriptor into the process's file table.
2. epoll_create allocates an eventpoll object
SYSCALL_DEFINE1(epoll_create1, int flags) {
struct eventpoll *ep = NULL;
error = ep_alloc(&ep);
} ep_alloczero‑initialises the structure, sets up a wait queue ( init_waitqueue_head(&ep->wq)), an empty ready list, and a red‑black tree ( ep->rbr = RB_ROOT) to store registered sockets.
3. epoll_ctl registers a socket
SYSCALL_DEFINE4(epoll_ctl, int epfd, int op, int fd, struct epoll_event __user *event) {
struct eventpoll *ep;
struct file *file = fget(epfd);
ep = file->private_data;
struct file *tfile = fget(fd);
switch (op) {
case EPOLL_CTL_ADD:
error = ep_insert(ep, event, tfile, fd);
break;
}
} ep_insertallocates an epitem, links it to the socket’s file descriptor, registers a poll callback ( ep_poll_callback) on the socket’s wait queue, and inserts the epitem into the epoll object’s red‑black tree.
4. epoll_wait blocks until an event is ready
SYSCALL_DEFINE4(epoll_wait, int epfd, struct epoll_event __user *events, int maxevents, int timeout) {
error = ep_poll(ep, events, maxevents, timeout);
}The function checks the ready list ( ep->rdllist); if empty it creates a wait‑queue entry for the current task, adds it to ep->wq, and puts the process to sleep with schedule_hrtimeout_range.
5. Data arrival triggers the callback chain
When a TCP packet is received, tcp_v4_rcv finds the matching socket, queues the data onto sk->sk_receive_queue, and calls sk->sk_data_ready. This pointer was set by sock_init_data to sock_def_readable, which invokes wake_up_interruptible_sync_poll on the socket’s wait queue. The registered ep_poll_callback moves the epitem to the epoll ready list and wakes any process waiting on ep->wq via default_wake_function, finally returning the event to user space.
Conclusion
The article demonstrates that epoll’s efficiency stems from its use of per‑socket wait‑queue callbacks, a red‑black tree for O(log N) management, and a ready list that allows epoll_wait to return immediately when work is available, avoiding the costly per‑socket polling loops of older mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
