How epoll Handles Millions of Connections Efficiently: A Deep Dive into Linux I/O Multiplexing
This article explains why traditional select/poll struggle with massive connections, how epoll's event-driven design using a red‑black tree and ready‑list dramatically improves scalability, details its two trigger modes, and provides a complete C demo illustrating a high‑performance reactor model.
Imagine a scenario where a server maintains one million TCP connections, but at any moment only a few hundred are active. Traditional select/poll would ask the kernel to scan all one‑million sockets each time, causing massive copying between user and kernel space and limiting scalability to a few thousand connections.
epoll solves this by creating a lightweight file system object that separates the three steps of monitoring: creating an epoll instance, registering sockets, and waiting for events. The process creates a single epoll object at startup and adds or removes sockets as needed, so epoll_wait can retrieve ready events without traversing all connections.
epoll Principle Explained
When a program calls epoll_create, the Linux kernel allocates an eventpoll structure. The key members are a red‑black tree ( rbr) that stores all registered events and a doubly‑linked list ( rdllist) that holds events ready to be returned to user space.
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);The internal structures look like this:
struct eventpoll {
/* root of the red‑black tree storing all added events */
struct rb_root rbr;
/* list of events ready to be returned */
struct list_head rdllist;
...
};Each registered event is represented by an epitem structure:
struct epitem {
struct rb_node rbn; // node in the red‑black tree
struct list_head rdllink; // node in the ready list
struct epoll_filefd ffd; // back‑reference to the owning epoll instance
struct eventpoll *ep; // pointer to the eventpoll object
struct epoll_event event; // event mask and user data
...
};When epoll_wait is invoked, the kernel simply checks whether rdllist contains any epitem entries. If the list is non‑empty, the events are copied to user space (often via shared memory) and the count is returned, making the wait operation extremely fast even with millions of sockets.
epoll Trigger Modes
epoll supports two trigger modes:
LT (Level‑Triggered) – the default mode. As long as a file descriptor has readable data, every call to epoll_wait will return it, prompting the application to read.
ET (Edge‑Triggered) – “high‑speed” mode. The kernel notifies the application only when a new event arrives. After receiving a notification, the application must read until recv returns EAGAIN; otherwise, no further notifications will be generated for the remaining data.
ET reduces the number of wake‑ups caused by descriptors that stay readable, improving efficiency when many sockets are idle.
epoll Reactor Model
The classic epoll flow is:
epoll_create(); // create the red‑black tree
epoll_ctl(); // add listening fd
epoll_wait(); // wait for events
// on accept: create client fd, add to epoll, handle read/write callbacksThe reactor model extends this by dynamically switching callbacks based on the current state of each connection (read → write → read …), allowing a single thread to manage thousands of concurrent connections.
A complete demonstration program is provided below. It defines a myevent_s structure to hold per‑connection state, registers callbacks for accept, read, and write, and runs an event loop that continuously calls epoll_wait to dispatch events.
#include <stdio.h>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <arpa/inet.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#define MAX_EVENTS 1024 /* maximum number of connections */
#define BUFLEN 4096 /* buffer size */
#define SERV_PORT 6666 /* listening port */
struct myevent_s {
int fd; // socket descriptor
int events; // EPOLLIN or EPOLLOUT
void *arg; // user data
void (*call_back)(int, int, void *);
int status; // 1 = in epoll tree, 0 = not
char buf[BUFLEN];
int len;
long last_active; // timestamp of last activity
};
int g_efd; // global epoll fd
struct myevent_s g_events[MAX_EVENTS+1];
void eventset(struct myevent_s *ev, int fd, void (*cb)(int,int,void*), void *arg) {
ev->fd = fd;
ev->call_back = cb;
ev->events = 0;
ev->arg = arg;
ev->status = 0;
ev->len = 0;
ev->last_active = time(NULL);
}
void eventadd(int efd, int events, struct myevent_s *ev) {
struct epoll_event epv = {0, {0}};
int op = (ev->status == 0) ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
ev->status = 1;
epv.data.ptr = ev;
epv.events = ev->events = events;
if (epoll_ctl(efd, op, ev->fd, &epv) < 0)
printf("event add failed [fd=%d], events[%d]
", ev->fd, events);
else
printf("event add OK [fd=%d], events[%0X]
", ev->fd, events);
}
void eventdel(int efd, struct myevent_s *ev) {
if (ev->status != 1) return;
struct epoll_event epv = {0, {0}};
ev->status = 0;
epv.data.ptr = NULL;
epoll_ctl(efd, EPOLL_CTL_DEL, ev->fd, &epv);
}
void acceptconn(int lfd, int events, void *arg) {
struct sockaddr_in cin;
socklen_t len = sizeof(cin);
int cfd = accept(lfd, (struct sockaddr *)&cin, &len);
if (cfd < 0) { perror("accept"); return; }
fcntl(cfd, F_SETFL, O_NONBLOCK);
int i;
for (i = 0; i < MAX_EVENTS; ++i) {
if (g_events[i].status == 0) break;
}
if (i == MAX_EVENTS) { printf("max connections reached
"); close(cfd); return; }
eventset(&g_events[i], cfd, recvdata, &g_events[i]);
eventadd(g_efd, EPOLLIN, &g_events[i]);
printf("new connection [%s:%d]
", inet_ntoa(cin.sin_addr), ntohs(cin.sin_port));
}
void recvdata(int fd, int events, void *arg) {
struct myevent_s *ev = (struct myevent_s *)arg;
int len = recv(fd, ev->buf, sizeof(ev->buf), 0);
eventdel(g_efd, ev);
if (len > 0) {
ev->len = len;
ev->buf[len] = '\0';
printf("C[%d]: %s
", fd, ev->buf);
eventset(ev, fd, senddata, ev);
eventadd(g_efd, EPOLLOUT, ev);
} else {
close(fd);
printf("connection closed [fd=%d]
", fd);
}
}
void senddata(int fd, int events, void *arg) {
struct myevent_s *ev = (struct myevent_s *)arg;
int len = send(fd, ev->buf, ev->len, 0);
eventdel(g_efd, ev);
if (len > 0) {
printf("sent [%d] bytes to fd=%d
", len, fd);
eventset(ev, fd, recvdata, ev);
eventadd(g_efd, EPOLLIN, ev);
} else {
close(fd);
printf("send error on fd=%d
", fd);
}
}
void initlistensocket(int efd, short port) {
int lfd = socket(AF_INET, SOCK_STREAM, 0);
fcntl(lfd, F_SETFL, O_NONBLOCK);
struct sockaddr_in sin = {0};
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = INADDR_ANY;
sin.sin_port = htons(port);
bind(lfd, (struct sockaddr *)&sin, sizeof(sin));
listen(lfd, 20);
eventset(&g_events[MAX_EVENTS], lfd, acceptconn, &g_events[MAX_EVENTS]);
eventadd(efd, EPOLLIN, &g_events[MAX_EVENTS]);
}
int main() {
g_efd = epoll_create(MAX_EVENTS + 1);
if (g_efd <= 0) { perror("epoll_create"); return -1; }
initlistensocket(g_efd, SERV_PORT);
struct epoll_event events[MAX_EVENTS + 1];
printf("server running on port %d
", SERV_PORT);
while (1) {
int nfd = epoll_wait(g_efd, events, MAX_EVENTS + 1, 1000);
if (nfd < 0) { perror("epoll_wait"); break; }
for (int i = 0; i < nfd; ++i) {
struct myevent_s *ev = (struct myevent_s *)events[i].data.ptr;
if ((events[i].events & EPOLLIN) && (ev->events & EPOLLIN))
ev->call_back(ev->fd, events[i].events, ev->arg);
if ((events[i].events & EPOLLOUT) && (ev->events & EPOLLOUT))
ev->call_back(ev->fd, events[i].events, ev->arg);
}
}
return 0;
}In summary, epoll combines a red‑black tree for fast registration/deregistration with a ready‑list for instant event retrieval, and its edge‑triggered mode further reduces unnecessary wake‑ups, enabling servers to efficiently handle hundreds of thousands to millions of concurrent connections.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
