Does Netty’s io_uring Make the 2× CPU Thread Rule Obsolete?
A benchmark on an 8‑core Linux 6.6 system shows that switching Netty from epoll to io_uring lets a half‑sized thread pool achieve 3 % higher throughput, more than double per‑thread efficiency, and a 67 % reduction in CPU migrations, challenging the traditional rule of using twice‑the‑core thread counts.
Background
Inspired by an article that claimed halving the worker thread count can raise CPU utilisation and throughput, the author set out to verify this claim on Netty 4.2 using both the traditional epoll transport and the newer io_uring transport introduced in Netty 4.2.
Test Configurations
Three modes were benchmarked on openEuler 24.03, Linux 6.6, JDK 25, and an 8‑core Xeon Cascadelake server: epoll-2n: epoll transport with CPU cores × 2 worker threads (16 threads). io_uring-n: io_uring transport with CPU cores worker threads (8 threads). io_uring-2n: io_uring transport with CPU cores × 2 worker threads (16 threads).
All tests used a C/S model with 200 concurrent long‑lived connections, three server containers bound to the same CPU core, and a 300‑second run (including a 10‑second warm‑up). The binary protocol consisted of a 4‑byte length field, a 4‑byte CRC32, an 8‑byte sequence ID, and a payload of 64–1024 bytes.
Key Implementation Details
Server Architecture
The server entry point ServerMain selects the transport via the --mode flag. For io_uring a custom IoUringIoHandlerConfig is built, configuring ring size, CQ size, multishot flags, and a buffer‑ring for zero‑copy.
IoHandlerFactory ioHandlerFactory;
if (config.useIoUring()) {
IoUringIoHandlerConfig ioUringConfig = new IoUringIoHandlerConfig();
int defaultRingSize = Math.max(256,
Integer.highestOneBit(Math.max(1, 200 / workerThreads) * 8) << 1);
int ringSize = Integer.getInteger("bench.ringSize", Math.min(4096, defaultRingSize));
ioUringConfig.setRingSize(ringSize);
boolean multishotActive = IoUring.isAcceptMultishotEnabled()
|| IoUring.isRecvMultishotEnabled()
|| IoUring.isPollAddMultishotEnabled();
if (multishotActive) {
ioUringConfig.setCqSize(ringSize * 4);
}
if (IoUring.isRegisterBufferRingSupported()) {
ioUringConfig.setBufferRingConfig(IoUringBufferRingConfig.builder()
.bufferGroupId((short)0)
.allocator(new IoUringAdaptiveBufferRingAllocator(ByteBufAllocator.DEFAULT, 128, 512, 2048, true))
.bufferRingSize((short)2048)
.batchAllocation(true).batchSize(1024).build());
}
ioHandlerFactory = IoUringIoHandler.newFactory(ioUringConfig);
} else {
ioHandlerFactory = EpollIoHandler.newFactory();
}The two transports differ in factory creation, channel type ( EpollServerSocketChannel vs IoUringServerSocketChannel), system‑call model ( epoll_wait vs shared‑memory SQ/CQ ring with io_uring_enter), and tunable parameters (ring size, CQ size, multishot, buffer ring, etc.).
Thread Model
A single EventLoopGroup is used for both boss and worker duties; each worker thread owns its own ring (for io_uring) or epoll fd. No dedicated boss thread means accept events are distributed among workers, avoiding a dedicated ring per boss.
Server Handler
The BenchmarkHandler extends SimpleChannelInboundHandler<ByteBuf>. It reuses a single java.util.zip.CRC32 instance per handler (no concurrency) and a pre‑allocated 8‑byte ByteBuffer for the sequence ID. CRC calculation is performed directly on the underlying ByteBuffer to avoid extra byte‑array copies.
public class BenchmarkHandler extends SimpleChannelInboundHandler<ByteBuf> {
private static final int HEADER_SIZE = 4 + 8; // CRC32 + SeqId
private final CRC32 crc32 = new CRC32();
private final ByteBuffer seqBuf = ByteBuffer.allocate(8);
@Override
protected void channelRead0(ChannelHandlerContext ctx, ByteBuf msg) {
int receivedCrc = msg.readInt();
long seqId = msg.readLong();
int payloadLen = msg.readableBytes() - HEADER_SIZE;
crc32.reset();
seqBuf.clear();
seqBuf.putLong(seqId);
seqBuf.flip();
crc32.update(seqBuf);
if (payloadLen > 0) {
crc32.update(msg.nioBuffer(msg.readerIndex(), payloadLen));
}
long computedCrc = crc32.getValue();
boolean match = computedCrc == (receivedCrc & 0xFFFFFFFFL);
ByteBuf response = ctx.alloc().buffer(HEADER_SIZE + 1);
response.writeInt((int) computedCrc);
response.writeLong(seqId);
response.writeByte(match ? 0 : 1);
ctx.writeAndFlush(response);
}
}Client Design (not the focus)
Uses NioSocketChannel with a thread count equal to CPU cores.
Pre‑generates 2000 messages (64–1024 B payload) and reuses them via retainedDuplicate() for zero‑allocation sends.
Writes are batched and flushed once per loop to collapse syscalls.
Back‑pressure is applied with Thread.yield() when in‑flight requests exceed 256 × connections.
Latency is measured per‑sequence ID using a ring buffer and LongAdder histograms.
JVM Options
ENTRYPOINT [
"java",
"-XX:+UseZGC", // generational ZGC, <2% pause
"-XX:+PreserveFramePointer", // keep frame pointers for perf
"-Xms4g",
"-Xmx4g",
"--enable-native-access=ALL-UNNAMED",
"-jar",
"/app/server.jar"
]Benchmark Results
Throughput and Latency
Metric epoll-2n io_uring-n io_uring-2n
-------------------------------------------------------
Total Requests 193,966,200 206,501,000 200,146,600
QPS ~646,554 ~688,337 ~667,155
Avg Latency 1,835 µs 2,126 µs 2,112 µs
P50 Latency 1,545 µs 1,610 µs 1,611 µs
P90 Latency 3,386 µs 4,178 µs 4,183 µs
P99 Latency 6,989 µs 8,885 µs 9,161 µs
P999 Latency 21,868 µs 25,150 µs 21,451 µsThroughput: io_uring-n leads by 6.5 % over epoll-2n. Latency: epoll-2n is uniformly better (P50 ‑ 4 %, P90 ‑ 19 %, P99 ‑ 21 %).
Per‑Thread Efficiency
Metric epoll-2n (16 thr) io_uring-n (8 thr) io_uring-2n (16 thr)
-----------------------------------------------------------------------------------
Total Requests 193,966,200 206,501,000 200,146,600
Requests / Thread ~12.1 M ~25.8 M ~12.5 M
Efficiency Ratio 1.00× 2.13× 1.04×Halving the thread count for io_uring more than doubles per‑thread request handling, thanks to multishot recv and the buffer‑ring zero‑copy path.
Context Switches
Mode Total Switches Switches/s CPU Migrations Migrations/s Migration %
--------------------------------------------------------------------------------
epoll-2n 1,348,957 1,249 66,122 61.2 4.9%
io_uring-n 2,322,700 2,150 21,341 19.8 0.92%
io_uring-2n2,854,520 2,643 54,251 50.2 1.9%Although io_uring-n performs more switches per second, its CPU‑migration rate is only 32 % of epoll-2n, reducing the expensive cross‑CPU cache/TLB flushes.
Scheduling Overhead (__schedule frames)
Mode __schedule frames / total frames %
---------------------------------------------------
epoll-2n 148 / 7,270 2.04%
io_uring-n 70 / 4,174 1.68%
io_uring-2n127 / 5,300 2.40%The lower frame share for io_uring-n reflects fewer costly migrations despite a higher switch count.
Flame‑Graph Insights
Full‑stack flame graphs show that epoll_pwait occupies only 2.7 % of unique stacks, while io_uring_enter dominates ~48 % of stacks, confirming that a single io_uring_enter syscall does more work (SQE submission, CQE consumption, DEFER_TASKRUN).
Conclusions
On an 8‑core, 200‑connection TCP request‑response workload, io_uring-n yields 6.5 % higher throughput than the traditional epoll-2n configuration.
Latency is still better with epoll-2n (up to 21 % lower at the 99th percentile).
Per‑thread efficiency more than doubles for io_uring-n because multishot receive and buffer‑ring eliminate per‑request SQE reposts.
CPU‑migration drops from 61 /s to 19.8 /s (‑68 %) when halving the thread pool, which is the main reason for the latency advantage.
Doubling the io_uring thread count (16 thr) hurts both throughput (‑3 %) and latency (worse P99) while raising CPU migrations by 154 % and increasing __schedule frame share from 1.68 % to 2.40 %.
The findings suggest that the long‑standing rule “use 2 × CPU cores threads” is no longer optimal for Netty with io_uring; a 1 × core thread pool often delivers better overall performance.
All measurements were taken with the same binary protocol, identical JVM flags, and the same hardware configuration. The analysis demonstrates how kernel‑level I/O mechanisms, thread‑affinity, and scheduling behaviour interact to shape real‑world server performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
