Unveiling Linux epoll: How the Kernel Detects Ready Sockets in Microseconds
This article provides a deep, step‑by‑step analysis of Linux's epoll mechanism, covering socket creation with accept, the internal structures of eventpoll, how epoll_ctl registers sockets, the wait‑queue interactions, and the exact code paths that move a TCP packet from the NIC to a user‑space process.
1. accept creates a new socket
When a server calls accept, the kernel allocates a new struct socket and a corresponding struct file, copies the protocol operations from the listening socket, and inserts the new file descriptor into the process's file table.
int main() {
listen(lfd, ...);
cfd1 = accept(...);
cfd2 = accept(...);
efd = epoll_create(...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
epoll_wait(efd, ...);
}The three functions involved in the demo are:
epoll_create : creates an epoll object.
epoll_ctl : registers a file descriptor with the epoll object.
epoll_wait : blocks until one of the registered descriptors becomes ready.
2. epoll_create implementation
Calling epoll_create allocates a struct eventpoll and links it to the caller's file table. The core fields are a wait‑queue ( wq), a ready‑list ( rdllist), and a red‑black tree ( rbr) that stores all registered sockets.
SYSCALL_DEFINE1(epoll_create1, int, flags) {
struct eventpoll *ep = NULL;
error = ep_alloc(&ep);
}The allocation routine ep_alloc zero‑initialises the structure and sets up the wait‑queue, ready list, and red‑black tree root.
static int ep_alloc(struct eventpoll **pep) {
struct eventpoll *ep;
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
init_waitqueue_head(&ep->wq);
INIT_LIST_HEAD(&ep->rdllist);
ep->rbr = RB_ROOT;
...
}3. epoll_ctl adds a socket
When EPOLL_CTL_ADD is used, the kernel performs three actions:
Allocate an epitem (a red‑black‑tree node).
Register a wait‑queue entry on the socket; the callback is ep_poll_callback.
Insert the epitem into the epoll object's red‑black tree.
static int ep_insert(struct eventpoll *ep,
struct epoll_event *event,
struct file *tfile, int fd) {
struct epitem *epi;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
/* set up socket wait queue */
struct ep_pqueue epq;
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
/* insert into red‑black tree */
ep_rbtree_insert(ep, epi);
...
}The helper ep_set_ffd stores the file pointer and descriptor number inside the epitem:
static inline void ep_set_ffd(struct epoll_filefd *ffd,
struct file *file, int fd) {
ffd->file = file;
ffd->fd = fd;
}4. epoll_wait waits for events
epoll_waitfirst checks the ready list ( rdllist). If it is empty, the current task is added to the epoll object's wait‑queue and put to sleep.
static int ep_poll(struct eventpoll *ep,
struct epoll_event __user *events,
int maxevents, long timeout) {
if (!ep_events_available(ep)) {
init_waitqueue_entry(&wait, current);
__add_wait_queue_exclusive(&ep->wq, &wait);
for (;;) {
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
timed_out = 1;
...
}
}
/* copy ready events to user buffer */
ep_send_events(ep, events, maxevents);
}The helper ep_events_available returns true when either the ready list is non‑empty or an overflow flag is set.
static inline int ep_events_available(struct eventpoll *ep) {
return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
}5. Data arrives – the kernel path
When a TCP packet reaches the NIC, tcp_v4_rcv looks up the matching socket and calls tcp_v4_do_rcv. For an established connection the data is queued with tcp_queue_rcv and then sk_data_ready is invoked.
int tcp_v4_rcv(struct sk_buff *skb) {
struct sock *sk = __inet_lookup_skb(&tcp_hashinfo, skb,
th->source, th->dest);
if (!sock_owned_by_user(sk)) {
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
}Inside tcp_v4_do_rcv for the ESTABLISHED state:
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) {
if (sk->sk_state == TCP_ESTABLISHED) {
if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
return 0;
return 0;
}
...
} tcp_rcv_establishedqueues the data and calls sk->sk_data_ready:
int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
const struct tcphdr *th, unsigned int len) {
eaten = tcp_queue_rcv(sk, skb, tcp_header_len, &fragstolen);
sk->sk_data_ready(sk, 0);
...
}The socket’s sk_data_ready pointer was set by sock_init_data to sock_def_readable. That function wakes the socket’s wait‑queue, which contains the ep_poll_callback registered by epoll_ctl:
static void sock_def_readable(struct sock *sk, int len) {
struct socket_wq *wq = rcu_dereference(sk->sk_wq);
if (wq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait,
POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
}The wake‑up walks the socket’s wait‑queue, invoking ep_poll_callback:
static int ep_poll_callback(wait_queue_t *wait, unsigned int mode,
int sync, void *key) {
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
list_add_tail(&epi->rdllink, &ep->rdllist);
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
return 0;
}If a process is sleeping in epoll_wait, its wait‑queue entry’s private pointer points to the task struct. The wake‑up eventually calls default_wake_function, which runs try_to_wake_up on that task, making the process runnable again.
int default_wake_function(wait_queue_t *curr, unsigned int mode,
int wake_flags, void *key) {
return try_to_wake_up(curr->private, mode, wake_flags);
}When the process resumes, ep_poll copies the events from rdllist back to user space and returns.
Summary
Linux epoll combines three kernel mechanisms: a red‑black tree for fast insertion/lookup of many sockets, a ready‑list for O(1) retrieval of events, and per‑socket wait‑queues that chain callbacks (
sock_def_readable → ep_poll_callback → default_wake_function) to wake a sleeping epoll_wait. This design lets a single process efficiently monitor tens of thousands of TCP connections with only microseconds of overhead per event.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
