Unlocking Epoll: A Deep Dive into Linux’s High‑Performance I/O Mechanism
This article explores Linux’s epoll interface in depth, covering its core architecture, LT and ET trigger modes, underlying red‑black tree and linked‑list data structures, callback workflow, practical code examples, and best‑practice guidelines for high‑concurrency network applications.
Epoll Core Working Principle
Epoll is a high‑performance I/O multiplexing interface introduced in Linux 2.6 that outperforms select/poll in high‑concurrency scenarios by using an event‑driven kernel mechanism.
1.1 What is Epoll
Epoll enhances select/poll by notifying the application only about active file descriptors, avoiding linear scans of all descriptors. This dramatically reduces CPU usage when many descriptors are idle.
1.2 Core Interfaces
Epoll provides three essential system calls:
epoll_create / epoll_create1 – creates an epoll instance and returns a file descriptor.
epoll_ctl – adds, modifies, or removes a file descriptor and the events to monitor.
epoll_wait – blocks until one or more registered events occur.
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);Example of creating an epoll instance:
int epfd = epoll_create1(0);
if (epfd == -1) {
perror("epoll_create1");
return 1;
}Adding a descriptor:
struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = STDIN_FILENO;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, STDIN_FILENO, &ev) == -1) {
perror("epoll_ctl");
close(epfd);
return 1;
}Waiting for events:
struct epoll_event events[10];
int nfds = epoll_wait(epfd, events, 10, -1);
if (nfds == -1) {
perror("epoll_wait");
close(epfd);
return 1;
}1.3 Underlying Data Structures
Epoll relies on a red‑black tree to manage all registered file descriptors and a double‑linked list to store ready sockets. The same node participates in both structures, enabling O(log n) operations for insert/delete/search and O(1) for ready‑list updates.
struct epitem {
RB_ENTRY(epitem) rbn; /* red‑black tree node */
LIST_ENTRY(epitem) rdlink;/* ready list node */
int rdy; /* in ready list */
int sockfd;
struct epoll_event event;
};
struct eventpoll {
struct rb_root rbr; /* red‑black tree root */
LIST_HEAD(, epitem) rdlist;/* ready list head */
int rbcnt;
int rdnum;
pthread_mutex_t mtx; /* protect tree */
pthread_spinlock_t lock; /* protect ready list */
pthread_cond_t cond; /* wake up waiters */
};LT vs ET Modes Detailed
2.1 LT (Level‑Triggered) Mode
LT notifies the application as long as a descriptor remains readable or writable. The event is repeatedly reported until the condition is cleared, making it simple to implement.
#include <sys/epoll.h>
#define MAX_EVENTS 10
#define BUFFER_SIZE 1024
int main() {
int epfd = epoll_create1(0);
// add listening socket with EPOLLIN
// loop: nfds = epoll_wait(...);
// handle readable events, possibly partial reads
}2.2 ET (Edge‑Triggered) Mode
ET triggers only when the state changes (e.g., from empty to non‑empty). The application must read/write until EAGAIN/EWOULDBLOCK to avoid missing data.
#include <fcntl.h>
void setnonblocking(int fd) {
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
int main() {
int epfd = epoll_create1(0);
// add socket with EPOLLIN | EPOLLET
// in event loop, read in a while loop until recv returns -1 && (errno==EAGAIN || errno==EWOULDBLOCK)
}2.3 Comparison
LT may generate multiple notifications for the same data; ET generates a single notification.
LT is easier to code; ET requires non‑blocking I/O and full buffer drainage.
ET offers higher efficiency in high‑concurrency, low‑payload workloads.
Callback Mechanism Explained
When a monitored descriptor becomes ready, the kernel invokes a registered callback (e.g., ep_poll_callback) that inserts the corresponding epitem into the ready list, allowing epoll_wait to return the event without scanning the entire tree.
// Simplified handling after epoll_wait
struct epoll_event events[MAX_EVENTS];
int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nfds; ++i) {
if (events[i].events & EPOLLIN) {
char buf[BUFFER_SIZE];
int n = recv(events[i].data.fd, buf, sizeof(buf)-1, 0);
if (n > 0) {
buf[n] = '\0';
printf("Received: %s
", buf);
}
}
}Application Scenarios & Selection Strategy
4.1 When to Use LT
Suitable for simple services with low concurrency (e.g., small web servers, internal APIs) where ease of development outweighs raw performance.
4.2 When to Use ET
Ideal for high‑throughput servers (e.g., Nginx, real‑time trading systems) where minimizing wake‑ups and system calls is critical.
4.3 Choosing Between Them
Consider concurrency level, latency requirements, and developer expertise: LT for low‑complexity, ET for performance‑critical workloads.
Usage Tips & Performance Optimizations
5.1 Common Pitfalls
In ET mode, failing to drain the socket completely leads to lost data. Always loop on recv until EAGAIN. Epoll “thundering herd” can be mitigated with EPOLLEXCLUSIVE (Linux 4.5+) or SO_REUSEPORT for load‑balanced sockets.
5.2 Optimization Recommendations
Set an appropriate epoll_wait timeout based on workload latency needs.
Batch‑process multiple ready events to reduce overhead.
Use EPOLLONESHOT in multithreaded designs to avoid duplicate handling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
