Fundamentals 24 min read

Deep Dive into Linux epoll: Kernel Implementation and Mechanism

This article thoroughly explains how Linux epoll works internally, covering the creation of sockets via accept, the epoll data structures, the implementation of epoll_create, epoll_ctl, and epoll_wait, and how kernel callbacks propagate I/O readiness from network packets to user space.

Refining Core Development Skills
Refining Core Development Skills
Refining Core Development Skills
Deep Dive into Linux epoll: Kernel Implementation and Mechanism

Linux processes that need to handle thousands of TCP connections must efficiently discover which sockets are readable or writable; the kernel provides this via the I/O multiplexing mechanism known as epoll. The article walks through the complete kernel path, from socket creation with accept to event notification with epoll_wait, illustrating each step with real source snippets.

1. accept creates a new socket

int main(){
    listen(lfd, ...);
    cfd1 = accept(...);
    cfd2 = accept(...);
    efd = epoll_create(...);
    epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
    epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
    epoll_wait(efd, ...);
}

The kernel function

SYSCALL_DEFINE4(accept4, int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags)

allocates a struct socket, copies the listening socket's protocol operations, creates a struct file via sock_alloc_file, and finally links the new file descriptor into the process's file table.

2. epoll_create allocates an eventpoll object

SYSCALL_DEFINE1(epoll_create1, int flags) {
    struct eventpoll *ep = NULL;
    error = ep_alloc(&ep);
}
ep_alloc

zero‑initialises the structure, sets up a wait queue ( init_waitqueue_head(&ep->wq)), an empty ready list, and a red‑black tree ( ep->rbr = RB_ROOT) to store registered sockets.

3. epoll_ctl registers a socket

SYSCALL_DEFINE4(epoll_ctl, int epfd, int op, int fd, struct epoll_event __user *event) {
    struct eventpoll *ep;
    struct file *file = fget(epfd);
    ep = file->private_data;
    struct file *tfile = fget(fd);
    switch (op) {
        case EPOLL_CTL_ADD:
            error = ep_insert(ep, event, tfile, fd);
            break;
    }
}
ep_insert

allocates an epitem, links it to the socket’s file descriptor, registers a poll callback ( ep_poll_callback) on the socket’s wait queue, and inserts the epitem into the epoll object’s red‑black tree.

4. epoll_wait blocks until an event is ready

SYSCALL_DEFINE4(epoll_wait, int epfd, struct epoll_event __user *events, int maxevents, int timeout) {
    error = ep_poll(ep, events, maxevents, timeout);
}

The function checks the ready list ( ep->rdllist); if empty it creates a wait‑queue entry for the current task, adds it to ep->wq, and puts the process to sleep with schedule_hrtimeout_range.

5. Data arrival triggers the callback chain

When a TCP packet is received, tcp_v4_rcv finds the matching socket, queues the data onto sk->sk_receive_queue, and calls sk->sk_data_ready. This pointer was set by sock_init_data to sock_def_readable, which invokes wake_up_interruptible_sync_poll on the socket’s wait queue. The registered ep_poll_callback moves the epitem to the epoll ready list and wakes any process waiting on ep->wq via default_wake_function, finally returning the event to user space.

Conclusion

The article demonstrates that epoll’s efficiency stems from its use of per‑socket wait‑queue callbacks, a red‑black tree for O(log N) management, and a ready list that allows epoll_wait to return immediately when work is available, avoiding the costly per‑socket polling loops of older mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linux
Refining Core Development Skills
Written by

Refining Core Development Skills

Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.