Does Netty’s io_uring Make the 2× CPU Thread Rule Obsolete?

A benchmark on an 8‑core Linux 6.6 system shows that switching Netty from epoll to io_uring lets a half‑sized thread pool achieve 3 % higher throughput, more than double per‑thread efficiency, and a 67 % reduction in CPU migrations, challenging the traditional rule of using twice‑the‑core thread counts.

Tech Musings
Tech Musings
Tech Musings
Does Netty’s io_uring Make the 2× CPU Thread Rule Obsolete?

Background

Inspired by an article that claimed halving the worker thread count can raise CPU utilisation and throughput, the author set out to verify this claim on Netty 4.2 using both the traditional epoll transport and the newer io_uring transport introduced in Netty 4.2.

Test Configurations

Three modes were benchmarked on openEuler 24.03, Linux 6.6, JDK 25, and an 8‑core Xeon Cascadelake server: epoll-2n: epoll transport with CPU cores × 2 worker threads (16 threads). io_uring-n: io_uring transport with CPU cores worker threads (8 threads). io_uring-2n: io_uring transport with CPU cores × 2 worker threads (16 threads).

All tests used a C/S model with 200 concurrent long‑lived connections, three server containers bound to the same CPU core, and a 300‑second run (including a 10‑second warm‑up). The binary protocol consisted of a 4‑byte length field, a 4‑byte CRC32, an 8‑byte sequence ID, and a payload of 64–1024 bytes.

Key Implementation Details

Server Architecture

The server entry point ServerMain selects the transport via the --mode flag. For io_uring a custom IoUringIoHandlerConfig is built, configuring ring size, CQ size, multishot flags, and a buffer‑ring for zero‑copy.

IoHandlerFactory ioHandlerFactory;
if (config.useIoUring()) {
    IoUringIoHandlerConfig ioUringConfig = new IoUringIoHandlerConfig();
    int defaultRingSize = Math.max(256,
        Integer.highestOneBit(Math.max(1, 200 / workerThreads) * 8) << 1);
    int ringSize = Integer.getInteger("bench.ringSize", Math.min(4096, defaultRingSize));
    ioUringConfig.setRingSize(ringSize);
    boolean multishotActive = IoUring.isAcceptMultishotEnabled()
        || IoUring.isRecvMultishotEnabled()
        || IoUring.isPollAddMultishotEnabled();
    if (multishotActive) {
        ioUringConfig.setCqSize(ringSize * 4);
    }
    if (IoUring.isRegisterBufferRingSupported()) {
        ioUringConfig.setBufferRingConfig(IoUringBufferRingConfig.builder()
            .bufferGroupId((short)0)
            .allocator(new IoUringAdaptiveBufferRingAllocator(ByteBufAllocator.DEFAULT, 128, 512, 2048, true))
            .bufferRingSize((short)2048)
            .batchAllocation(true).batchSize(1024).build());
    }
    ioHandlerFactory = IoUringIoHandler.newFactory(ioUringConfig);
} else {
    ioHandlerFactory = EpollIoHandler.newFactory();
}

The two transports differ in factory creation, channel type ( EpollServerSocketChannel vs IoUringServerSocketChannel), system‑call model ( epoll_wait vs shared‑memory SQ/CQ ring with io_uring_enter), and tunable parameters (ring size, CQ size, multishot, buffer ring, etc.).

Thread Model

A single EventLoopGroup is used for both boss and worker duties; each worker thread owns its own ring (for io_uring) or epoll fd. No dedicated boss thread means accept events are distributed among workers, avoiding a dedicated ring per boss.

Server Handler

The BenchmarkHandler extends SimpleChannelInboundHandler<ByteBuf>. It reuses a single java.util.zip.CRC32 instance per handler (no concurrency) and a pre‑allocated 8‑byte ByteBuffer for the sequence ID. CRC calculation is performed directly on the underlying ByteBuffer to avoid extra byte‑array copies.

public class BenchmarkHandler extends SimpleChannelInboundHandler<ByteBuf> {
    private static final int HEADER_SIZE = 4 + 8; // CRC32 + SeqId
    private final CRC32 crc32 = new CRC32();
    private final ByteBuffer seqBuf = ByteBuffer.allocate(8);
    @Override
    protected void channelRead0(ChannelHandlerContext ctx, ByteBuf msg) {
        int receivedCrc = msg.readInt();
        long seqId = msg.readLong();
        int payloadLen = msg.readableBytes() - HEADER_SIZE;
        crc32.reset();
        seqBuf.clear();
        seqBuf.putLong(seqId);
        seqBuf.flip();
        crc32.update(seqBuf);
        if (payloadLen > 0) {
            crc32.update(msg.nioBuffer(msg.readerIndex(), payloadLen));
        }
        long computedCrc = crc32.getValue();
        boolean match = computedCrc == (receivedCrc & 0xFFFFFFFFL);
        ByteBuf response = ctx.alloc().buffer(HEADER_SIZE + 1);
        response.writeInt((int) computedCrc);
        response.writeLong(seqId);
        response.writeByte(match ? 0 : 1);
        ctx.writeAndFlush(response);
    }
}

Client Design (not the focus)

Uses NioSocketChannel with a thread count equal to CPU cores.

Pre‑generates 2000 messages (64–1024 B payload) and reuses them via retainedDuplicate() for zero‑allocation sends.

Writes are batched and flushed once per loop to collapse syscalls.

Back‑pressure is applied with Thread.yield() when in‑flight requests exceed 256 × connections.

Latency is measured per‑sequence ID using a ring buffer and LongAdder histograms.

JVM Options

ENTRYPOINT [
  "java",
  "-XX:+UseZGC",            // generational ZGC, <2% pause
  "-XX:+PreserveFramePointer", // keep frame pointers for perf
  "-Xms4g",
  "-Xmx4g",
  "--enable-native-access=ALL-UNNAMED",
  "-jar",
  "/app/server.jar"
]

Benchmark Results

Throughput and Latency

Metric          epoll-2n      io_uring-n      io_uring-2n
-------------------------------------------------------
Total Requests   193,966,200   206,501,000    200,146,600
QPS            ~646,554      ~688,337       ~667,155
Avg Latency    1,835 µs      2,126 µs       2,112 µs
P50 Latency    1,545 µs      1,610 µs       1,611 µs
P90 Latency    3,386 µs      4,178 µs       4,183 µs
P99 Latency    6,989 µs      8,885 µs       9,161 µs
P999 Latency   21,868 µs     25,150 µs      21,451 µs

Throughput: io_uring-n leads by 6.5 % over epoll-2n. Latency: epoll-2n is uniformly better (P50 ‑ 4 %, P90 ‑ 19 %, P99 ‑ 21 %).

Per‑Thread Efficiency

Metric                epoll-2n (16 thr)   io_uring-n (8 thr)   io_uring-2n (16 thr)
-----------------------------------------------------------------------------------
Total Requests        193,966,200          206,501,000          200,146,600
Requests / Thread    ~12.1 M               ~25.8 M               ~12.5 M
Efficiency Ratio      1.00×                2.13×                1.04×

Halving the thread count for io_uring more than doubles per‑thread request handling, thanks to multishot recv and the buffer‑ring zero‑copy path.

Context Switches

Mode        Total Switches   Switches/s   CPU Migrations   Migrations/s   Migration %
--------------------------------------------------------------------------------
epoll-2n    1,348,957        1,249        66,122           61.2           4.9%
io_uring-n  2,322,700        2,150        21,341           19.8           0.92%
io_uring-2n2,854,520        2,643        54,251           50.2           1.9%

Although io_uring-n performs more switches per second, its CPU‑migration rate is only 32 % of epoll-2n, reducing the expensive cross‑CPU cache/TLB flushes.

Scheduling Overhead (__schedule frames)

Mode        __schedule frames / total frames   %
---------------------------------------------------
epoll-2n    148 / 7,270                       2.04%
io_uring-n  70 / 4,174                        1.68%
io_uring-2n127 / 5,300                      2.40%

The lower frame share for io_uring-n reflects fewer costly migrations despite a higher switch count.

Flame‑Graph Insights

Full‑stack flame graphs show that epoll_pwait occupies only 2.7 % of unique stacks, while io_uring_enter dominates ~48 % of stacks, confirming that a single io_uring_enter syscall does more work (SQE submission, CQE consumption, DEFER_TASKRUN).

Conclusions

On an 8‑core, 200‑connection TCP request‑response workload, io_uring-n yields 6.5 % higher throughput than the traditional epoll-2n configuration.

Latency is still better with epoll-2n (up to 21 % lower at the 99th percentile).

Per‑thread efficiency more than doubles for io_uring-n because multishot receive and buffer‑ring eliminate per‑request SQE reposts.

CPU‑migration drops from 61 /s to 19.8 /s (‑68 %) when halving the thread pool, which is the main reason for the latency advantage.

Doubling the io_uring thread count (16 thr) hurts both throughput (‑3 %) and latency (worse P99) while raising CPU migrations by 154 % and increasing __schedule frame share from 1.68 % to 2.40 %.

The findings suggest that the long‑standing rule “use 2 × CPU cores threads” is no longer optimal for Netty with io_uring; a 1 × core thread pool often delivers better overall performance.

All measurements were taken with the same binary protocol, identical JVM flags, and the same hardware configuration. The analysis demonstrates how kernel‑level I/O mechanisms, thread‑affinity, and scheduling behaviour interact to shape real‑world server performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaPerformanceNettyio_uringbenchmarkepoll
Tech Musings
Written by

Tech Musings

Capturing thoughts and reflections while coding.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.