Operations 19 min read

Why Enabling tcp_tw_recycle Can Crash Your Web Service and How to Fix It

This article explains how an unstable response time caused by misconfigured public load balancers and the kernel parameters net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse led to frequent monitoring alerts, details the diagnostic steps taken, and provides concrete remediation recommendations.

Efficient Ops

Jul 14, 2020

Why Enabling tcp_tw_recycle Can Crash Your Web Service and How to Fix It

Problem Phenomenon

In daily work the author observed that the company website sometimes responded instantly and other times took more than 8‑9 seconds, causing user‑experience issues and frequent false alarms from the availability monitoring system, which had a 5‑second timeout.

Environment Introduction and Analysis

Environment Introduction

Each business host runs an Nginx instance behind a public load balancer (LB) that terminates SSL and distributes traffic to the hosts. A second public LB with a different SSL certificate also forwards traffic to the same hosts. The monitoring system runs inside a container cluster.

Analysis

The author considered several possible causes:

Public LB configuration error

One application host processing requests too slowly

Network issues in the cluster where the monitoring system is deployed

Application host system parameters (eventually identified as the root cause)

Testing

Public LB configuration error

Checking LB timeout and cache settings yielded no result. The author examined HAProxy logs and found entries such as:

Jun 15 16:45:29 18.19.1.12 haproxy[30952]: 139.1.2.3:61653 [15/Jun/2018:16:45:08.784] lbl-ckv7ynro~ lbl-ckv7ynro_default/lbb-izjpmxrh 327/15003/-1/-1/20331 503 213 - - sCNN 4/3/0/0/+3 0/0 "HEAD /sessions/auth?return_to=%2F HTTP/1.1"

The log fields were explained in detail, highlighting the significance of the 4/3/0/0/+3, sCNN values.

One application host processing requests too long

Disabling all but one backend host during low‑traffic periods did not eliminate the problem, indicating the issue was not isolated to a single host.

Monitoring system cluster network issue

External AB tests were performed from various networks (China Unicom, China Telecom, Google Cloud, internal VPC). Results showed that LB1 performed poorly in the VPC while LB2 performed normally, suggesting a problem between the LB and the internal network.

Further investigation revealed that the application hosts had the kernel parameters net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse enabled to accelerate TIME‑WAIT socket reuse.

Root Cause Analysis

Enabling tcp_tw_recycle makes the kernel aggressively drop SYN packets whose timestamps are not strictly increasing for a given client IP. In NAT environments many clients share a single public IP, so the kernel may incorrectly consider legitimate connections as replayed and drop them, leading to intermittent connection failures and monitoring alerts.

Why enable tcp_tw_recycle?

The parameters were set to improve TCP connection performance under high concurrency by reducing the TIME‑WAIT duration, but they introduced the described issues.

Why does it block some clients?

When multiple NAT clients connect simultaneously, the server sees only one source IP. If the timestamp of a new SYN is not greater than the last recorded timestamp for that IP, the kernel discards the packet, causing silent connection drops that appear as time‑outs in the load balancer.

Verification

After disabling the parameters with: sysctl -w net.ipv4.tcp_tw_recycle=0 AB tests from inside the VPC returned to normal, and monitoring graphs showed stable response times.

Summary and Recommendations

Do not enable tcp_tw_recycle (repeat three times). Maintain proper configuration management to track system changes. When troubleshooting, consider not only software and network layers but also host kernel settings. Useful tools include tcpdump, ss, and log analysis.

Set a reasonable net.ipv4.ip_local_port_range (e.g., 10000‑63000).

Configure the load balancer with more client IPs.

Assign multiple external IPs to the load balancer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

network troubleshooting TIME-WAIT Load Balancer tcp_tw_recycle Kernel Tuning

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.