Backend Development 61 min read

Understanding epoll: High‑Performance I/O Multiplexing in Linux

This article explains the principles, advantages, and implementation details of Linux's epoll I/O multiplexing mechanism, compares it with select and poll, describes its level‑ and edge‑triggered modes, and provides practical C and Python examples for building high‑concurrency network servers.

Deepin Linux

Dec 29, 2024

Understanding epoll: High‑Performance I/O Multiplexing in Linux

In Linux, traditional I/O multiplexing methods like select and poll become inefficient when handling a large number of concurrent connections, whereas epoll is designed to excel in high‑concurrency scenarios.

1. Introduction to epoll

epoll is an enhanced version of poll that significantly improves CPU utilization when only a few descriptors are active among many, because it only traverses the ready list instead of the entire descriptor set.

It supports both level‑triggered (LT) and edge‑triggered (ET) modes, allowing user‑space programs to cache I/O state and reduce the number of epoll_wait / epoll_pwait calls.

1.1 First impression of epoll

Compared with select (limited to 1024 descriptors and requiring full user‑kernel copying) and poll (still traverses all descriptors), epoll uses a red‑black tree for fast O(log n) insert/delete operations and a ready list (usually a doubly linked list) for O(1) access to active descriptors.

1.2 Why use epoll?

File descriptor limits : epoll has no hard limit; it scales with system memory (e.g., 20 000 descriptors on a 2 GB Ubuntu system).

Efficiency : epoll processes only ready descriptors, avoiding linear scans of all descriptors.

Memory handling : epoll uses mmap to share memory between kernel and user space, reducing copy overhead.

2. Core principles of epoll

2.1 Working modes

epoll provides two modes:

LT (Level Triggered) : default mode; the kernel notifies the application as long as a descriptor remains ready.

ET (Edge Triggered) : notifies only when the state changes; the application must drain the descriptor completely to avoid missing events.

2.2 Using epoll

Typical usage requires three system calls:

#include <sys/epoll.h>

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

After creating an epoll instance, file descriptors are added with epoll_ctl, and events are retrieved with epoll_wait.

2.3 Implementation details

When an epoll instance is created, the kernel allocates an eventpoll structure containing a red‑black tree (to store all monitored descriptors) and a ready list (to store descriptors with pending events). Each monitored descriptor is represented by an epitem that links the descriptor to the tree and the ready list.

The kernel registers a poll callback for each descriptor; when the descriptor becomes ready, the callback adds the corresponding epitem to the ready list and wakes any process sleeping in epoll_wait.

2.4 epoll workflow

epoll_create

creates an anonymous file descriptor representing the epoll instance. epoll_ctl adds, modifies, or removes descriptors, inserting the associated epitem into the red‑black tree. epoll_wait checks the ready list; if empty, the calling process sleeps on a wait queue until the poll callback wakes it.

When awakened, epoll_wait copies ready events to user space and, for LT mode, may re‑insert the descriptor into the ready list.

3. epoll source code overview

The kernel source shows that epoll does not rely on shared memory; instead, it uses copy_from_user and __put_user for kernel‑user data transfer. Key structures include eventpoll, epitem, and various wait‑queue helpers.

4. Practical examples

4.1 C example (Python style)

import socket
import select

EOL1 = b'

'
EOL2 = b'

'
response = b'HTTP/1.0 200 OK
Date: Mon, 1 Jan 1996 01:01:01 GMT
Content-Type: text/plain
Content-Length: 13

Hello, world!'

serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
serversocket.bind(('0.0.0.0', 8080))
serversocket.listen(1)
serversocket.setblocking(0)

epoll = select.epoll()
epoll.register(serversocket.fileno(), select.EPOLLIN)

try:
    connections = {}
    requests = {}
    responses = {}
    while True:
        events = epoll.poll(1)
        for fileno, event in events:
            if fileno == serversocket.fileno():
                connection, address = serversocket.accept()
                connection.setblocking(0)
                epoll.register(connection.fileno(), select.EPOLLIN)
                connections[connection.fileno()] = connection
                requests[connection.fileno()] = b''
                responses[connection.fileno()] = response
            elif event & select.EPOLLIN:
                requests[fileno] += connections[fileno].recv(1024)
                if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
                    epoll.modify(fileno, select.EPOLLOUT)
                    print('-' * 40 + '
' + requests[fileno].decode()[:-2])
            elif event & select.EPOLLOUT:
                byteswritten = connections[fileno].send(responses[fileno])
                responses[fileno] = responses[fileno][byteswritten:]
                if len(responses[fileno]) == 0:
                    epoll.modify(fileno, 0)
                    connections[fileno].shutdown(socket.SHUT_RDWR)
            elif event & select.EPOLLHUP:
                epoll.unregister(fileno)
                connections[fileno].close()
                del connections[fileno]
finally:
    epoll.unregister(serversocket.fileno())
    epoll.close()
    serversocket.close()

This demonstrates a minimal HTTP‑like server using epoll for non‑blocking I/O.

4.2 Integration with Tornado

Tornado, a high‑performance Python async framework, uses epoll on Linux (kqueue on BSD/macOS). The following snippet shows how Tornado registers a socket with its IOLoop, which internally relies on epoll.

import errno
import functools
import tornado.ioloop
import socket

def handle_connection(connection, address):
    data = connection.recv(2014)
    print(data)
    connection.send(data)

def connection_ready(sock, fd, events):
    while True:
        try:
            connection, address = sock.accept()
        except socket.error as e:
            if e.args[0] not in (errno.EWOULDBLOCK, errno.EAGAIN):
                raise
            return
        connection.setblocking(0)
        handle_connection(connection, address)

if __name__ == '__main__':
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sock.setblocking(0)
    sock.bind(("", 5000))
    sock.listen(128)
    io_loop = tornado.ioloop.IOLoop.current()
    callback = functools.partial(connection_ready, sock)
    io_loop.add_handler(sock.fileno(), callback, io_loop.READ)
    io_loop.start()

5. Building a TCP server with epoll

The article provides a complete C example that creates a Unix‑domain socket, wraps it with an epoll instance, registers the listening socket, accepts new connections, and reads data from multiple clients concurrently. The server prints received messages and demonstrates how to add newly accepted client descriptors to the epoll set.

Running the server with three client threads shows the epoll loop handling connection events and data events efficiently, confirming that a single thread can manage many simultaneous connections.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux I/O multiplexing epoll edge-triggered

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.