Operations 20 min read

Analyzing TCP Connection States and Resolving TIME_WAIT, CLOSE_WAIT, and SYN_RECV Issues in a Java/Tomcat/HBase System

This article walks through a real‑world incident where sudden traffic drops were traced to abnormal TCP states—TIME_WAIT, CLOSE_WAIT, and SYN_RECV—by examining monitoring data, explaining the TCP handshake, reviewing relevant kernel parameters, and debugging Java/ZooKeeper/HBase code to identify and fix the root cause.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Analyzing TCP Connection States and Resolving TIME_WAIT, CLOSE_WAIT, and SYN_RECV Issues in a Java/Tomcat/HBase System

1. Fault Phenomenon

Through monitoring we discovered that business traffic suddenly drops, can be restored by a restart, but after about 30 minutes the traffic drops again, with the low point coinciding with the server restart time.

Business Monitoring

Machine metrics revealed several abnormal monitors:

TCP TIME_WAIT Count

TCP CLOSE_WAIT Count

TCP SYN_RECV Count

Other JVM, CPU, LOAD metrics are normal. Can these monitors help us locate the problem?

2. Problem Localization

We approach the issue from the perspective of TCP connection states to see if they can help pinpoint the cause.

2.1 Problem Statement

Based on the monitoring data we raise several questions:

Is the large number of TIME_WAIT connections normal?

Is the continuous growth of CLOSE_WAIT normal?

Why does CLOSE_WAIT rise to 200, SYN_RECV starts increasing, and TIME_WAIT begins to drop?

Is the increase of SYN_RECV normal?

2.2 Answer

Understanding these questions requires a review of the TCP three‑way handshake and four‑way termination process, as shown in the classic diagram:

2.2.1 Is a Large TIME_WAIT Count Normal?

TIME_WAIT is a normal state kept by the side that actively closes the connection; it persists for 2*MSL (typically 1 minute on Linux) before the socket is released.

Why Keep TIME_WAIT for 2*MSL?

One reason is that the client cannot guarantee the final ACK is successfully delivered. If the server does not receive the ACK, it will retransmit FIN. Keeping TIME_WAIT allows the client to resend the ACK and prevents old packets from interfering with new connections.

How to Optimize TIME_WAIT Quantity?

Note! Many online guides recommend changing kernel parameters, but we do not recommend those settings here.

Modify /etc/sysctl.conf :

net.ipv4.tcp_tw_reuse = 1   # enable reuse of TIME_WAIT sockets (default 0)
net.ipv4.tcp_tw_recycle = 1 # enable fast recycling (default 0)
net.ipv4.tcp_fin_timeout = 30

Be aware that tcp_fin_timeout is not the MSL; MSL is hard‑coded (usually 30 s on Linux). The parameter controls the timeout from FIN_WAIT2 to TIME_WAIT.

What Is a Normal TIME_WAIT Quantity?

On a server, the number of TIME_WAIT sockets should stabilize and consume little memory. For a client, the count should not exceed the number of available ports (check with cat /proc/sys/net/ipv4/ip_local_port_range , e.g., 32768‑61000 ≈ 28 k ports). For a server, it should not exceed ip_conntrack_max (default 65536). If TIME_WAIT sockets are created faster than they are reclaimed, they will accumulate.

How to Solve Continuous TIME_WAIT Growth?

Increase the number of four‑tuples (add more IPs or ports).

Prefer long‑lived connections and configure reasonable keepalive where possible.

Conclusion

High TIME_WAIT is normal as long as it stays within reasonable limits.

Avoid blindly setting reuse / recycle ; 2*MSL is reasonable.

2.2.2 CLOSE_WAIT Continuous Growth

If a socket stays in CLOSE_WAIT, it means the remote side has closed the connection but the server has not sent the final ACK, usually because the application failed to close the socket.

How to Solve CLOSE_WAIT Growth?

Inspect the code!

What Causes CLOSE_WAIT?

Resource leaks due to bugs (e.g., forgetting to release HTTP resources).

Client timeout closes the connection, but the server does not release it promptly.

In our case, an exception in ZooKeeper caused the client to keep retrying for 5 minutes, exhausting threads and leading to CLOSE_WAIT accumulation.

ZooKeeperWatcher.keeperException:445[keeperException] quorum=${ZKHOST}:${ZKPORT}, baseZNode=/hbase/xxxx Received unexpected KeeperException, re‑throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: ConnectionLoss for /hbase/xxxx
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.getData(ZooKeeper.java:1151)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:683)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.blockUntilAvailable(ZKUtil.java:1835)
at org.apache.hadoop.hbase.zookeeper.MetaRegionTracker.blockUntilAvailable(MetaRegionTracker.java:183)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1087)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1181)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1090)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:888)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:780)

After the failure was identified, the root cause was a temporary ZK blacklist that prevented configuration retrieval, causing the client to retry for 5 minutes, exhausting Tomcat threads and generating CLOSE_WAIT sockets.

Conclusion

CLOSE_WAIT indicates a bug or misuse in the server code.

It can quickly fill up server threads and make the service unavailable.

2.2.3 Why Does CLOSE_WAIT Rise to 200, SYN_RECV Increases, and TIME_WAIT Decreases?

Tomcat’s connector configuration (maxThreads=200) explains the 200‑thread limit. When the thread pool is full, new requests cannot be accepted, leading to the observed behavior.

<Connector port="8080" protocol="HTTP/1.1" maxThreads="200" connectionTimeout="20000" enableLookups="false" compression="on" redirectPort="8443" URIEncoding="UTF-8" />

Conclusion

CLOSE_WAIT keeps connections from being released.

When the Tomcat thread pool is saturated, no new requests are accepted.

TIME_WAIT sockets are released after 2*MSL (default 1 min on Linux), so the count drops to zero.

2.2.4 SYN_RECV Increase – Is It Normal?

SYN_RECV corresponds to the kernel’s half‑connection queue. If this queue fills up, the server cannot move connections to the full‑connection (accept) queue, causing SYN_RECV to accumulate.

How to Detect a Full Accept Queue?

>>> ss -lnt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 50 *:20077 *:*
LISTEN 0 100 *:8080 *:*
LISTEN 0 50 *:20889 *:*

When the accept queue overflows, the kernel logs “listen queue overflow” (e.g., via netstat -s | grep listen ).

Relation Between Accept Queue and Service Threads

Connection establishment: SYN → SYN‑ACK (half‑queue) → ACK (full‑queue) → service thread.

If the full‑queue backs up, service threads are saturated and new connections cannot be processed.

Conclusion

SYN_RECV accumulation indicates the accept queue is full and service threads are blocked.

Use ss -lnt and netstat -s | grep listen to diagnose.

Adjust the accept queue size if necessary, but the core issue is thread‑level throughput.

3. Fault Summary

We traced the problem from TCP states, identified the root cause, and summarized the findings.

3.1 Fault Manifestation

Business traffic drops.

CLOSE_WAIT continuously rises, TIME_WAIT falls, SYN_RECV rises.

3.2 Fault Cause

Zookeeper temporarily black‑listed the service IP, preventing configuration retrieval.

The client retried for 5 minutes (TIME_OUT) after failing to get data.

Long retry held Tomcat threads, causing CLOSE_WAIT accumulation.

Upstream system timed out after 2 s, closing the connection and creating more CLOSE_WAIT sockets.

When Tomcat’s 200‑thread pool filled, new requests were rejected, leading to traffic drop.

3.3 Fault Resolution

Set a reasonable ZK connection timeout and avoid excessive retry periods.

4. Conclusion

The article demonstrates how to use TCP state monitoring to diagnose server‑side problems, understand the semantics of TIME_WAIT, CLOSE_WAIT, and SYN_RECV, and apply appropriate kernel and application‑level fixes.

TIME_WAIT is normal if it stays within a stable range; do not blindly enable reuse or recycle .

CLOSE_WAIT is a serious sign of server‑side bugs; investigate network‑related code.

Excessive CLOSE_WAIT can exhaust server threads and cause outages.

SYN_RECV growth indicates a saturated accept queue and blocked service threads.

References:

https://mp.weixin.qq.com/s?_biz=MjM5ODI5Njc2MA==&mid=2655819541&idx=1&sn=2a37c206702b7b67c3ae30bf4c3f677b&chkm=bd74dac28a0353d45fe0a3d09ee57c181eb5b70d022e3c718842a57463db864161c96dddc9ec&scene=21#wechat

https://www.cnblogs.com/sunxucool/p/3449068.html

https://blog.csdn.net/Howinfun/article/details/81010177

https://blog.csdn.net/achejq/article/details/51784212?utm_source=blogxgwz0

https://www.cnblogs.com/javawebsoa/archive/2013/05/18/3086034.html

operationsZookeeperTCPHBaseTomcatTIME_WAITCLOSE_WAITSYN_RECV
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.