Analyzing TCP Connection States and Resolving TIME_WAIT, CLOSE_WAIT, and SYN_RECV Issues in a Java/Tomcat/HBase System
This article walks through a real‑world incident where sudden traffic drops were traced to abnormal TCP states—TIME_WAIT, CLOSE_WAIT, and SYN_RECV—by examining monitoring data, explaining the TCP handshake, reviewing relevant kernel parameters, and debugging Java/ZooKeeper/HBase code to identify and fix the root cause.
1. Fault Phenomenon
Through monitoring we discovered that business traffic suddenly drops, can be restored by a restart, but after about 30 minutes the traffic drops again, with the low point coinciding with the server restart time.
Business Monitoring
Machine metrics revealed several abnormal monitors:
TCP TIME_WAIT Count
TCP CLOSE_WAIT Count
TCP SYN_RECV Count
Other JVM, CPU, LOAD metrics are normal. Can these monitors help us locate the problem?
2. Problem Localization
We approach the issue from the perspective of TCP connection states to see if they can help pinpoint the cause.
2.1 Problem Statement
Based on the monitoring data we raise several questions:
Is the large number of TIME_WAIT connections normal?
Is the continuous growth of CLOSE_WAIT normal?
Why does CLOSE_WAIT rise to 200, SYN_RECV starts increasing, and TIME_WAIT begins to drop?
Is the increase of SYN_RECV normal?
2.2 Answer
Understanding these questions requires a review of the TCP three‑way handshake and four‑way termination process, as shown in the classic diagram:
2.2.1 Is a Large TIME_WAIT Count Normal?
TIME_WAIT is a normal state kept by the side that actively closes the connection; it persists for 2*MSL (typically 1 minute on Linux) before the socket is released.
Why Keep TIME_WAIT for 2*MSL?
One reason is that the client cannot guarantee the final ACK is successfully delivered. If the server does not receive the ACK, it will retransmit FIN. Keeping TIME_WAIT allows the client to resend the ACK and prevents old packets from interfering with new connections.
How to Optimize TIME_WAIT Quantity?
Note! Many online guides recommend changing kernel parameters, but we do not recommend those settings here.
Modify /etc/sysctl.conf :
net.ipv4.tcp_tw_reuse = 1 # enable reuse of TIME_WAIT sockets (default 0)
net.ipv4.tcp_tw_recycle = 1 # enable fast recycling (default 0)
net.ipv4.tcp_fin_timeout = 30Be aware that tcp_fin_timeout is not the MSL; MSL is hard‑coded (usually 30 s on Linux). The parameter controls the timeout from FIN_WAIT2 to TIME_WAIT.
What Is a Normal TIME_WAIT Quantity?
On a server, the number of TIME_WAIT sockets should stabilize and consume little memory. For a client, the count should not exceed the number of available ports (check with cat /proc/sys/net/ipv4/ip_local_port_range , e.g., 32768‑61000 ≈ 28 k ports). For a server, it should not exceed ip_conntrack_max (default 65536). If TIME_WAIT sockets are created faster than they are reclaimed, they will accumulate.
How to Solve Continuous TIME_WAIT Growth?
Increase the number of four‑tuples (add more IPs or ports).
Prefer long‑lived connections and configure reasonable keepalive where possible.
Conclusion
High TIME_WAIT is normal as long as it stays within reasonable limits.
Avoid blindly setting reuse / recycle ; 2*MSL is reasonable.
2.2.2 CLOSE_WAIT Continuous Growth
If a socket stays in CLOSE_WAIT, it means the remote side has closed the connection but the server has not sent the final ACK, usually because the application failed to close the socket.
How to Solve CLOSE_WAIT Growth?
Inspect the code!
What Causes CLOSE_WAIT?
Resource leaks due to bugs (e.g., forgetting to release HTTP resources).
Client timeout closes the connection, but the server does not release it promptly.
In our case, an exception in ZooKeeper caused the client to keep retrying for 5 minutes, exhausting threads and leading to CLOSE_WAIT accumulation.
ZooKeeperWatcher.keeperException:445[keeperException] quorum=${ZKHOST}:${ZKPORT}, baseZNode=/hbase/xxxx Received unexpected KeeperException, re‑throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: ConnectionLoss for /hbase/xxxx at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.getData(ZooKeeper.java:1151) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:683) at org.apache.hadoop.hbase.zookeeper.ZKUtil.blockUntilAvailable(ZKUtil.java:1835) at org.apache.hadoop.hbase.zookeeper.MetaRegionTracker.blockUntilAvailable(MetaRegionTracker.java:183) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1087) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1181) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1090) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:888) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:780)After the failure was identified, the root cause was a temporary ZK blacklist that prevented configuration retrieval, causing the client to retry for 5 minutes, exhausting Tomcat threads and generating CLOSE_WAIT sockets.
Conclusion
CLOSE_WAIT indicates a bug or misuse in the server code.
It can quickly fill up server threads and make the service unavailable.
2.2.3 Why Does CLOSE_WAIT Rise to 200, SYN_RECV Increases, and TIME_WAIT Decreases?
Tomcat’s connector configuration (maxThreads=200) explains the 200‑thread limit. When the thread pool is full, new requests cannot be accepted, leading to the observed behavior.
<Connector port="8080" protocol="HTTP/1.1" maxThreads="200" connectionTimeout="20000" enableLookups="false" compression="on" redirectPort="8443" URIEncoding="UTF-8" />Conclusion
CLOSE_WAIT keeps connections from being released.
When the Tomcat thread pool is saturated, no new requests are accepted.
TIME_WAIT sockets are released after 2*MSL (default 1 min on Linux), so the count drops to zero.
2.2.4 SYN_RECV Increase – Is It Normal?
SYN_RECV corresponds to the kernel’s half‑connection queue. If this queue fills up, the server cannot move connections to the full‑connection (accept) queue, causing SYN_RECV to accumulate.
How to Detect a Full Accept Queue?
>>> ss -lnt State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 50 *:20077 *:*
LISTEN 0 100 *:8080 *:*
LISTEN 0 50 *:20889 *:*When the accept queue overflows, the kernel logs “listen queue overflow” (e.g., via netstat -s | grep listen ).
Relation Between Accept Queue and Service Threads
Connection establishment: SYN → SYN‑ACK (half‑queue) → ACK (full‑queue) → service thread.
If the full‑queue backs up, service threads are saturated and new connections cannot be processed.
Conclusion
SYN_RECV accumulation indicates the accept queue is full and service threads are blocked.
Use ss -lnt and netstat -s | grep listen to diagnose.
Adjust the accept queue size if necessary, but the core issue is thread‑level throughput.
3. Fault Summary
We traced the problem from TCP states, identified the root cause, and summarized the findings.
3.1 Fault Manifestation
Business traffic drops.
CLOSE_WAIT continuously rises, TIME_WAIT falls, SYN_RECV rises.
3.2 Fault Cause
Zookeeper temporarily black‑listed the service IP, preventing configuration retrieval.
The client retried for 5 minutes (TIME_OUT) after failing to get data.
Long retry held Tomcat threads, causing CLOSE_WAIT accumulation.
Upstream system timed out after 2 s, closing the connection and creating more CLOSE_WAIT sockets.
When Tomcat’s 200‑thread pool filled, new requests were rejected, leading to traffic drop.
3.3 Fault Resolution
Set a reasonable ZK connection timeout and avoid excessive retry periods.
4. Conclusion
The article demonstrates how to use TCP state monitoring to diagnose server‑side problems, understand the semantics of TIME_WAIT, CLOSE_WAIT, and SYN_RECV, and apply appropriate kernel and application‑level fixes.
TIME_WAIT is normal if it stays within a stable range; do not blindly enable reuse or recycle .
CLOSE_WAIT is a serious sign of server‑side bugs; investigate network‑related code.
Excessive CLOSE_WAIT can exhaust server threads and cause outages.
SYN_RECV growth indicates a saturated accept queue and blocked service threads.
References:
https://mp.weixin.qq.com/s?_biz=MjM5ODI5Njc2MA==&mid=2655819541&idx=1&sn=2a37c206702b7b67c3ae30bf4c3f677b&chkm=bd74dac28a0353d45fe0a3d09ee57c181eb5b70d022e3c718842a57463db864161c96dddc9ec&scene=21#wechat
https://www.cnblogs.com/sunxucool/p/3449068.html
https://blog.csdn.net/Howinfun/article/details/81010177
https://blog.csdn.net/achejq/article/details/51784212?utm_source=blogxgwz0
https://www.cnblogs.com/javawebsoa/archive/2013/05/18/3086034.html
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.