Implementing and Optimizing a High‑Concurrency Long‑Connection Service with Netty
This article explains how to build a scalable long‑connection server using Netty, discusses the underlying bottlenecks such as Linux kernel limits, CPU and GC issues, and provides practical code examples and tuning techniques to achieve hundreds of thousands of connections and high QPS.
Push Service Background
About a year and a half ago we needed an Android push service; unlike iOS there is no unified push platform in China, so we relied on polling before adopting JPush's long‑connection solution, which handled 500k‑1M concurrent connections.
Two years later we were tasked with optimizing our own long‑connection server.
What Is Netty
Netty (http://netty.io/) is an asynchronous event‑driven network framework that promises high performance, zero‑copy, native Linux sockets, and compatibility with Java NIO2 and older NIO APIs.
High‑performance, highly scalable architecture.
Zero‑Copy to minimize memory copying.
Native Linux socket support.
Works with Java 1.7 NIO2 and earlier NIO.
Pooled Buffers reduce pressure on buffer allocation and release.
Bottlenecks
The two main goals of a long‑connection service are more connections and higher QPS. The real bottlenecks are not in Netty code but in OS configuration (max open files, process limits) and later in CPU, data structures, and GC.
More Connections
Both Java NIO and Netty can handle millions of connections because they use non‑blocking I/O and do not create a thread per connection.
Java NIO Example
ServerSocketChannel ssc = ServerSocketChannel.open();
Selector sel = Selector.open();
ssc.configureBlocking(false);
ssc.socket().bind(new InetSocketAddress(8080));
SelectionKey key = ssc.register(sel, SelectionKey.OP_ACCEPT);
while (true) {
sel.select();
Iterator it = sel.selectedKeys().iterator();
while (it.hasNext()) {
SelectionKey skey = (SelectionKey) it.next();
it.remove();
if (skey.isAcceptable()) {
ch = ssc.accept();
}
}
}This code only accepts connections and does nothing else, illustrating the basic NIO pattern.
Netty Example
NioEventLoopGroup bossGroup = new NioEventLoopGroup();
NioEventLoopGroup workerGroup = new NioEventLoopGroup();
ServerBootstrap bootstrap = new ServerBootstrap();
bootstrap.group(bossGroup, workerGroup);
bootstrap.channel(NioServerSocketChannel.class);
bootstrap.childHandler(new ChannelInitializer
() {
@Override
protected void initChannel(SocketChannel ch) throws Exception {
ChannelPipeline pipeline = ch.pipeline();
// todo: add handlers
}
});
bootstrap.bind(8080).sync();Again, the Netty bootstrap is straightforward and does not require special tricks to reach a million connections.
Where the Real Bottleneck Lies
With non‑blocking I/O the bottleneck moves to the Linux kernel configuration – the default limits on maximum open files and process resources must be increased.
How to Verify Capacity
We built a Netty client that opens up to 60,000 connections (limited by root privileges) and repeatedly connects to the server:
NioEventLoopGroup workerGroup = new NioEventLoopGroup();
Bootstrap b = new Bootstrap();
b.group(workerGroup);
b.channel(NioSocketChannel.class);
b.handler(new ChannelInitializer
() {
@Override
public void initChannel(SocketChannel ch) throws Exception {
ChannelPipeline pipeline = ch.pipeline();
// todo: add handler
}
});
for (int k = 0; k < 60000; k++) {
b.connect(127.0.0.1, 8080);
}Running this client on a machine with tuned kernel parameters validates the server’s ability to hold many connections.
Finding More Machines
Since a single host can hold ~60k connections, we need multiple hosts. Using virtual machines with bridged networking and multiple VMs per physical server allowed us to reach the million‑connection target with only four physical machines.
Trick to Inflate Connection Count
By disabling keep‑alive on the server, repeatedly forcing a VM to crash, changing its MAC address, and reconnecting, the server perceives new connections while keeping old ones alive, effectively inflating the connection count.
Higher QPS
Because Netty and NIO are non‑blocking, QPS does not degrade with more connections as long as memory is sufficient. The real QPS bottleneck is often the data‑structure design.
Data‑Structure Optimization
Complex projects require careful selection and combination of collections. For example, frequent calls to ConcurrentLinkedQueue.size() caused a CPU hotspot because the method traverses the whole list each time. Replacing it with an AtomicInteger counter eliminated the issue while preserving eventual consistency.
CPU Bottleneck Diagnosis
Use VisualVM (Sample mode) to identify methods with the highest self‑time. In our case, ConcurrentLinkedQueue.size() was the top offender.
GC Bottleneck
Excessive Old‑generation GC was observed due to the default 1:2 NewRatio. Adjusting -XX:NewRatio reduced Old GC frequency. In production, where many long‑lived connection objects exist, allocating a larger old generation is advisable.
Other Optimizations
Refer to "Netty Best Practices" and the book "Netty in Action" for additional tweaks that boosted our overall QPS.
Running on a 16‑core, 120 GB RAM machine with only 8 GB JVM heap, Java 1.6 achieved 600 k connections and 200 k QPS; further gains are possible with more memory and Java 1.7+.
Final Outcome
After weeks of stress testing and tuning, we reached 600 k concurrent connections and 200 k QPS on a single server, with low system load, indicating that the remaining bottleneck lies in I/O rather than CPU or memory.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.