Why Enabling tcp_tw_recycle Can Crash Your Web Service and How to Fix It
This article explains how an unstable response time caused by misconfigured public load balancers and the kernel parameters net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse led to frequent monitoring alerts, details the diagnostic steps taken, and provides concrete remediation recommendations.
Problem Phenomenon
In daily work the author observed that the company website sometimes responded instantly and other times took more than 8‑9 seconds, causing user‑experience issues and frequent false alarms from the availability monitoring system, which had a 5‑second timeout.
Environment Introduction and Analysis
Environment Introduction
Each business host runs an Nginx instance behind a public load balancer (LB) that terminates SSL and distributes traffic to the hosts. A second public LB with a different SSL certificate also forwards traffic to the same hosts. The monitoring system runs inside a container cluster.
Analysis
The author considered several possible causes:
Public LB configuration error
One application host processing requests too slowly
Network issues in the cluster where the monitoring system is deployed
Application host system parameters (eventually identified as the root cause)
Testing
Public LB configuration error
Checking LB timeout and cache settings yielded no result. The author examined HAProxy logs and found entries such as:
<code>Jun 15 16:45:29 18.19.1.12 haproxy[30952]: 139.1.2.3:61653 [15/Jun/2018:16:45:08.784] lbl-ckv7ynro~ lbl-ckv7ynro_default/lbb-izjpmxrh 327/15003/-1/-1/20331 503 213 - - sCNN 4/3/0/0/+3 0/0 "HEAD /sessions/auth?return_to=%2F HTTP/1.1"</code>The log fields were explained in detail, highlighting the significance of the
4/3/0/0/+3, sCNNvalues.
One application host processing requests too long
Disabling all but one backend host during low‑traffic periods did not eliminate the problem, indicating the issue was not isolated to a single host.
Monitoring system cluster network issue
External AB tests were performed from various networks (China Unicom, China Telecom, Google Cloud, internal VPC). Results showed that LB1 performed poorly in the VPC while LB2 performed normally, suggesting a problem between the LB and the internal network.
Further investigation revealed that the application hosts had the kernel parameters
net.ipv4.tcp_tw_recycleand
net.ipv4.tcp_tw_reuseenabled to accelerate TIME‑WAIT socket reuse.
Root Cause Analysis
Enabling
tcp_tw_recyclemakes the kernel aggressively drop SYN packets whose timestamps are not strictly increasing for a given client IP. In NAT environments many clients share a single public IP, so the kernel may incorrectly consider legitimate connections as replayed and drop them, leading to intermittent connection failures and monitoring alerts.
Why enable tcp_tw_recycle?
The parameters were set to improve TCP connection performance under high concurrency by reducing the TIME‑WAIT duration, but they introduced the described issues.
Why does it block some clients?
When multiple NAT clients connect simultaneously, the server sees only one source IP. If the timestamp of a new SYN is not greater than the last recorded timestamp for that IP, the kernel discards the packet, causing silent connection drops that appear as time‑outs in the load balancer.
Verification
After disabling the parameters with:
<code>sysctl -w net.ipv4.tcp_tw_recycle=0</code>AB tests from inside the VPC returned to normal, and monitoring graphs showed stable response times.
Summary and Recommendations
Do not enable
tcp_tw_recycle(repeat three times). Maintain proper configuration management to track system changes. When troubleshooting, consider not only software and network layers but also host kernel settings. Useful tools include
tcpdump,
ss, and log analysis.
Set a reasonable
net.ipv4.ip_local_port_range(e.g., 10000‑63000).
Configure the load balancer with more client IPs.
Assign multiple external IPs to the load balancer.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.