How io_uring Integration Boosts Netpoll Throughput and Slashes Latency
This article examines the integration of Linux io_uring into ByteDance's high‑performance Netpoll NIO library, detailing architectural changes, receive/send workflows, benchmarking methodology, and results that show over 10% higher throughput and 20‑40% lower latency while eliminating system calls.
Introduction
Netpoll is a high‑performance NIO network library developed by ByteDance on top of epoll, focused on RPC scenarios. Compared with Go's native net package, Netpoll provides greater control over the network layer, enabling pre‑business optimizations.
Why io_uring?
io_uring, introduced in Linux 5.15, offers reduced system calls through batch processing and a flexible asynchronous I/O framework that can handle various I/O types, improving scalability.
Integration Design
The integration replaces traditional epoll‑based I/O with io_uring for both receive and send paths. The design keeps Netpoll's poller‑context model (one poller per 20 CPUs) and adds dedicated uring rings: one receive uring and multiple send urings, each created with SQ‑poll threads.
Poller Contexts
During initialization, Netpoll creates a main server poller that accepts new connections and distributes them across a set of poller contexts, each handling a subset of connections.
Receive Flow
New connections trigger EPOLLIN events; the poller allocates input buffers, performs recv , and dispatches a goroutine from the pool to run the user‑registered handler.
Send Flow
Applications call connection.Flush to transmit data. The send path uses sendmsg either directly in the user context or via a kernel SQ‑poll thread, depending on the configuration.
io_uring Model
Each io_uring instance consists of a Submission Queue (SQ), Completion Queue (CQ), and a buffer ring (PBQ). Entries can be singleshot (one‑off) or multishot (reused for multiple completions), the latter being especially useful for receive operations.
Batching Differences
Both epoll and io_uring rely on vfs_poll , but io_uring processes batches inside the kernel via a task‑work chain, reducing latency and system‑call overhead.
Benchmark Setup
A Go echo server (1 KB messages) was built with Netpoll using io_uring. The server ran on a 30‑CPU machine (19 CPUs for the app, remaining for SQ‑poll threads). Tests covered 10–1000 concurrent connections, 50 million operations per run, and measured throughput, latency (P99, P9999), and system‑call counts.
Results
Throughput increased 10‑15% for >200 connections, while latency dropped 20‑40% (P99) and 10‑20% (P9999) compared with epoll. System‑call count fell to zero in polling mode and up to 15× lower otherwise.
Conclusion
Integrating io_uring into Netpoll yields a measurable performance boost: >10% higher throughput, 20‑40% lower latency, and the elimination of system‑call overhead, achieving near‑zero system calls while preserving Netpoll's lightweight goroutine‑based design.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.