Mastering epoll: Deep Dive into Linux I/O Multiplexing
This article thoroughly examines Linux's epoll mechanism, detailing its SLAB memory management, middle‑layer design, edge and level triggering, comparison with select/poll, and related advanced polling technologies such as /dev/poll and kqueue, while also discussing C10K/C10M challenges and practical solutions.
epoll技术补充
1. SLAB内存管理
SLAB内存管理特点
使用连续的内存地址空间来存储epitem/epoll,避免内存碎片
epitem/epoll释放后放入对象池重复利用,减少创建销毁的性能开销
内存分配原理如下:
epoll创建对象源码
// eventpoll.c
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
// slab.h
static inline void *kzalloc(size_t size, gfp_t gfp)
{
return kmalloc(size, gfp | __GFP_ZERO);
}
/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;
/* Slab cache used to allocate "struct eppoll_entry" */
static struct kmem_cache *pwq_cache __read_mostly;epoll通过SLAB机制创建对象,避免碎片并使用对象池提升性能。
2. epoll设计思想
采用中间层设计思想
epoll空间以及epitem部分源代码
struct eventpoll {
/* Wait queue used by sys_epoll_wait() */
wait_queue_head_t wq;
/* Wait queue used by file->poll() */
wait_queue_head_t poll_wait;
/* List of ready file descriptors */
struct list_head rdllist;
/* Lock which protects rdllist and ovflist */
rwlock_t lock;
/* RB tree root used to store monitored fd structs */
struct rb_root_cached rbr;
/* Single linked list of epitem that happened while transferring ready events */
struct epitem *ovflist;
};
struct epitem {
union {
/* RB tree node links this structure to the eventpoll RB tree */
struct rb_node rbn;
/* Used to free the struct epitem */
struct rcu_head rcu;
};
/* List header used to link this structure to the eventpoll ready list */
struct list_head rdllink;
/* The file descriptor information this item refers to */
struct epoll_filefd ffd;
/* The "container" of this item */
struct eventpoll *ep;
/* wakeup_source used when EPOLLWAKEUP is set */
struct wakeup_source __rcu *ws;
/* The structure that describes the interested events and the source fd */
struct epoll_event event;
};epoll使用中间层将socket绑定到epitem,并通过红黑树和单链表管理就绪事件。
3. epoll其他技术要点
边缘与条件触发
边缘触发:当socket缓冲区收到数据时触发;水平触发:只要缓冲区非空就持续可读。
// 默认水平触发 EPOLLONESHOT, 边缘触发 EPOLLET
list_for_each_entry_safe(epi, tmp, head, rdllink) {
if (esed->res >= esed->maxevents)
break;
// 执行唤醒逻辑
ws = ep_wakeup_source(epi);
if (ws) {
if (ws->active)
__pm_stay_awake(ep->ws);
__pm_relax(ws);
}
// 移除epitem下的ready_list
list_del_init(&epi->rdllink);
// 重新轮询事件收集就绪事件
revents = ep_item_poll(epi, &pt, 1);
if (!revents)
continue;
// 将就绪事件拷贝到用户空间中
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data)) {
list_add(&epi->rdllink, head);
ep_pm_stay_awake(epi);
if (!esed->res)
esed->res = -EFAULT;
return 0;
}
esed->res++;
uevent++;
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
else if (!(epi->event.events & EPOLLET)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
}
}
#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)水平触发在每次调用 epoll_wait 时都会检查并读取剩余数据,边缘触发则仅在新数据到达时触发。
高级轮询技术
/dev/poll
struct dvpoll {
struct pollfd* dp_fds; // 链表形式的缓冲区
int dp_nfds; // 缓冲区大小
int timeout;
}
wfd = open("/dev/poll", O_RDWR, 0);
write(wfd, pollfd, MAX_SIZE); // pollfd 为 poll 结构体数组
ioctl(wfd, DP_POLL, &dvpoll);/dev/poll 在 Solaris 上提供可扩展的轮询,预先设置文件描述符列表后循环等待事件。
kqueue技术
// 返回一个新的 kqueue 描述符
int kqueue(void);
// 注册或获取事件
int kevent(int kq,
const struct kevent *changelist, int nchanges,
struct kevent *eventlist, int nevents,
const struct timespec *timeout);
// 设置事件
void EV_SET(struct kevent *kev, uintptr_t ident, short filter,
u_short flags, u_int fflags, intptr_t data, void *udata);
// kevent 结构体
struct kevent {
uintptr_t ident;
short filter;
u_short flags;
u_int fflags;
intptr_t data;
void *udata;
};kqueue 与 epoll 原理相似,但支持更多事件类型,主要用于 FreeBSD。
C10K问题与解决方案
C10K 指支持一万并发连接的服务,常见解决方案包括单线程 + IO 复用(select/poll/epoll/kqueue)、边缘触发、AIO、线程池以及使用 Nginx、libevent、Netty 等框架。
成熟技术方案如 Nginx、libevent、Netty 已广泛用于高并发场景。
本文至此结束,欢迎转发和点赞。
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaokun's Architecture Exploration Notes
10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
