Why Your Java Server Hits Connection Errors: Understanding TCP Backlog and Queue Overflows
This article explains how TCP three‑way handshake, half‑ and full‑connection queues, and the backlog setting can cause intermittent client‑server connection failures, and shows how to detect and resolve queue overflows using netstat, ss, and proper backlog configuration.
Problem Description
Scenario: Java client and server communicate via NIO sockets. The server uses a single selector. Intermittently, the client completes the three‑way handshake but the server's selector does not see the connection.
Symptoms include:
Three‑way handshake completes on the client side, but the server does not register the connection.
Many connections exhibit this issue simultaneously.
The selector is never destroyed or rebuilt; the same one is reused.
The problem appears at startup and then intermittently.
Analysis of the Problem
Normal TCP three‑way handshake
Step 1: client sends SYN to server.
Step 2: server replies with SYN+ACK.
Step 3: client replies with ACK; the connection is now established.
The observed behavior resembles a full‑connection (accept) queue overflow. To verify, the author ran:
netstat -s | egrep "listen"and observed the overflowed counter continuously increasing, confirming that the server's accept queue was full.
Further investigation of the OS handling showed the kernel parameter tcp_abort_on_overflow. When set to 0, if the accept queue is full during the third handshake step, the server discards the client's ACK, treating the connection as never established.
By changing tcp_abort_on_overflow to 1, the server sends a RST packet when the accept queue is full, causing the client to see a "connection reset by peer" error, which matches the observed client exception.
Increasing the Java backlog (default 50) and re‑running a 12‑hour stress test eliminated the error and stopped the overflow counter from growing.
Deep Dive into TCP Handshake Queues
The handshake involves two queues:
Syn‑queue (half‑connection queue)
Accept queue (full‑connection queue)
During step 1, the server places the SYN information into the half‑connection queue and replies with SYN+ACK. In step 3, if the accept queue is not full, the connection moves from the half‑connection queue to the accept queue; otherwise, the behavior follows tcp_abort_on_overflow.
If the accept queue is full and tcp_abort_on_overflow is 0, the server retries sending SYN+ACK. A client with a short timeout will then encounter an exception.
The retry count for the second step defaults to 2 on the examined OS (CentOS defaults to 5).
Metrics to Detect Queue Overflows
Key indicators: netstat -s shows the overflowed counter (e.g., 667399 times) – a rising value indicates frequent accept‑queue overflow. ss -lnt displays Send‑Q (the configured backlog) and Recv‑Q (current usage) for listening sockets.
The accept‑queue size is min(backlog, somaxconn). backlog is set when the socket is created; somaxconn is a system‑wide limit.
Half‑connection queue size is max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog).
Practical Verification
Reducing Java's backlog to 10 caused the accept queue to overflow quickly during stress testing. The ss output showed a maximum of 10 but 11 connections waiting, confirming overflow.
Tomcat and Nginx Accept Queue Parameters
Typical defaults:
Tomcat (Ali‑tomcat) Accept count: 200 (default Tomcat: 100)
Nginx accept queue (listen backlog): 511
Nginx runs multiple worker processes, each listening on the same port to reduce context switches.
Summary
Accept‑queue and half‑connection‑queue overflows are easy to overlook but critical, especially for short‑lived connections (e.g., Nginx, PHP‑FPM). When the accept queue overflows, the server may appear healthy while client‑side latency spikes and connections fail.
Frameworks like JDK and Netty often use a small default backlog, which can limit performance under load.
Understanding the TCP handshake, queue mechanics, and relevant metrics helps quickly diagnose and resolve such issues.
Thought‑Provoking Questions
Does a full accept queue affect the half‑connection queue?
What is the relationship between the overflowed and ignored counters shown by netstat -s?
If the client completes the third handshake step and believes the connection is established, but the server has not yet placed the connection into the accept queue, how does the server handle data sent by the client?
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
