Why Your API Service Hits 200k TIME_WAIT Connections and How to Fix It
This article explains why high‑traffic Linux services can exhaust TCP connections with massive TIME_WAIT and CLOSE_WAIT counts, shows how to diagnose the problem using netstat/ss commands, and provides concrete kernel‑parameter tweaks, connection‑pool strategies, and monitoring scripts to restore stability.
Overview
A sudden alert about massive API timeouts turned out to be caused by over 200,000 TIME_WAIT sockets, preventing new connections. The incident motivated a deep dive into the Linux kernel TCP stack, its connection states, relevant sysctl parameters, and production‑grade monitoring practices.
TCP Connection State Quick Review
Understanding the TCP state machine (CLOSED, SYN_SENT, ESTABLISHED, FIN_WAIT, TIME_WAIT, CLOSE_WAIT, etc.) is essential before tuning, because each state requires a different mitigation strategy.
Environment Information
Operating Systems: CentOS 7.x / RHEL 8.x / Ubuntu 20.04+
Kernel Versions: 3.10+, 4.x, 5.x
Applicable Scenarios: high‑concurrency web services, API gateways, micro‑service architectures
Connection‑State Diagnosis
Quick Diagnostic Commands
# Count connections per state
netstat -ant | awk '{++state[$NF]} END {for(k in state) print k, state[k]}' | sort -k2 -rn
# Faster alternative
ss -ant | awk '{++state[$1]} END {for(k in state) print k, state[k]}' | sort -k2 -rnTypical output shows the distribution of ESTABLISHED, TIME_WAIT, CLOSE_WAIT, SYN_RECV, etc.
Normal Ranges and Warning Signals
ESTABLISHED : depends on business load; a sudden surge indicates connection leaks.
TIME_WAIT : should stay < 50 k; > 100 k signals short‑connection overload or insufficient port range.
CLOSE_WAIT : normally < 100; continuous growth points to application bugs that never close sockets.
SYN_RECV : < 1 k; a spike usually means a SYN‑Flood attack.
FIN_WAIT2 : < 500; persistent accumulation suggests the peer never sends FIN.
TIME_WAIT Deep Dive
The TIME_WAIT state guarantees that delayed packets from an old connection are not mistakenly accepted by a new one. Linux keeps each TIME_WAIT for 2 MSL (default 60 s). With 10 k short connections per second, 60 s yields 600 k TIME_WAIT sockets, quickly exhausting the default local‑port range (32768‑60999).
# View local port range
cat /proc/sys/net/ipv4/ip_local_port_range
# Check TIME_WAIT count
ss -ant state time-wait | wc -l
# Show some TIME_WAIT entries
ss -ant state time-wait | head -20CLOSE_WAIT Diagnosis
CLOSE_WAITis more dangerous because it indicates that the remote side has closed the connection but the application has not called close(). A buggy Java client that never releases connections can accumulate tens of thousands of CLOSE_WAIT sockets and eventually OOM.
# Find processes holding CLOSE_WAIT sockets
ss -antp state close-wait
# Example output (Java process)
# CLOSE-WAIT 1 0 10.0.0.100:8080 10.0.0.50:45678 users:("java",pid=12345,fd=89)
# Further analysis
lsof -p 12345 | grep CLOSE | wc -lKernel Parameter Tuning in Practice
TIME_WAIT Tuning
# /etc/sysctl.conf
# Enable reuse of TIME_WAIT sockets (important)
net.ipv4.tcp_tw_reuse = 1
# Increase bucket limit (how many TIME_WAIT can be kept)
net.ipv4.tcp_max_tw_buckets = 200000
# Reduce FIN timeout (affects FIN_WAIT2 and TIME_WAIT)
net.ipv4.tcp_fin_timeout = 15
# Expand local port range
net.ipv4.ip_local_port_range = 1024 65535Note: tcp_tw_recycle is deprecated and can break connections behind NAT; avoid enabling it.
Connection‑Establishment Tuning
# Increase SYN backlog (half‑open queue)
net.ipv4.tcp_max_syn_backlog = 65535
# Increase accept queue size
net.core.somaxconn = 65535
# Reduce SYN retry counts
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
# Enable SYN cookies to mitigate SYN‑Flood
net.ipv4.tcp_syncookies = 1Remember that the effective accept queue size is min(backlog, somaxconn), where backlog is the value passed to listen() by the application.
Keepalive Parameters
# TCP keepalive settings (effective only if SO_KEEPALIVE is set)
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 3Memory and Buffer Settings
# Memory limits (values are in pages, usually 4 KB each)
net.ipv4.tcp_mem = 262144 524288 1048576
# Per‑socket receive buffer
net.ipv4.tcp_rmem = 4096 87380 16777216
# Per‑socket send buffer
net.ipv4.tcp_wmem = 4096 65536 16777216
# System‑wide limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144Complete Production Configuration Example
# /etc/sysctl.d/99-tcp-tuning.conf
fs.file-max = 2000000
fs.nr_open = 2000000
net.netfilter.nf_conntrack_max = 2000000
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_established = 1200
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 200000
net.ipv4.tcp_fin_timeout = 15
net.ipv4.ip_local_port_range = 1024 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 786432 1048576 1572864
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.route.gc_timeout = 100
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1Apply with sysctl --system and verify with sysctl -a | grep -E "tcp_tw_reuse|somaxconn|tcp_max_syn".
Special‑Scenario Handling
Scenario 1: Short‑Connection API Gateways
Prefer long‑lived connections; configure Nginx upstream keepalive (e.g., keepalive 1000;).
Let the server close first to avoid the client holding TIME_WAIT (adjust keepalive_timeout).
Use HTTP client connection pools in every language.
Scenario 2: Micro‑service Mesh
Monitor inter‑service connection counts with ss -ant | awk '{print $4, $5}' | sort | uniq -c | sort -rn.
Check for missing connection pools, unreasonable pool sizes, or leaks.
Scenario 3: Proxy Servers (Nginx/HAProxy)
Both client and server sides consume ports; double the pressure.
Expand the local port range or use multiple source IPs.
Configure upstream blocks with multiple backend IPs to spread the load.
Monitoring and Alerting
Shell Script for Real‑Time TCP Metrics
#!/bin/bash
while true; do
ts=$(date '+%Y-%m-%d %H:%M:%S')
est=$(ss -ant state established | wc -l)
tw=$(ss -ant state time-wait | wc -l)
cw=$(ss -ant state close-wait | wc -l)
sr=$(ss -ant state syn-recv | wc -l)
echo "$ts ESTABLISHED=$est TIME_WAIT=$tw CLOSE_WAIT=$cw SYN_RECV=$sr"
if [ $tw -gt 100000 ]; then echo "ALERT: TIME_WAIT exceeds 100000!"; fi
if [ $cw -gt 1000 ]; then echo "ALERT: CLOSE_WAIT exceeds 1000, check application!"; fi
sleep 10
donePrometheus Alert Rules
groups:
- name: tcp_alerts
rules:
- alert: HighTimeWaitConnections
expr: node_netstat_Tcp_TimeWait > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "TIME_WAIT connections too high"
description: "Current TIME_WAIT: {{ $value }}"
- alert: CloseWaitAccumulation
expr: node_netstat_Tcp_CloseWait > 1000
for: 10m
labels:
severity: critical
annotations:
summary: "CLOSE_WAIT accumulation"
description: "Possible connection leak, current CLOSE_WAIT: {{ $value }}"
- alert: SynFloodSuspected
expr: rate(node_netstat_Tcp_SynRecv[1m]) > 10000
for: 2m
labels:
severity: critical
annotations:
summary: "Possible SYN Flood attack"Quick‑Check Checklist
# 1. View connection‑state distribution
ss -ant | awk '{++state[$1]} END {for(k in state) print k, state[k]}'
# 2. Look for queue overflows
netstat -s | grep -E "overflow|prune|SYN"
# 3. Inspect local port usage
ss -ant | awk '{print $4}' | grep -oP ':\d+$' | sort | uniq -c | sort -rn | head -10
# 4. Check file‑descriptor usage
cat /proc/sys/fs/file-nr
# 5. If using conntrack, examine its counters
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# 6. Verify current kernel parameters
sysctl -a | grep -E "somaxconn|tcp_max_syn_backlog|tcp_tw|tcp_fin"Failure Case Reviews
Case 1: E‑commerce Spike – TIME_WAIT Exhaustion
During a Double‑11 sale, API latency jumped from 50 ms to 5 s, with many 504 errors. Diagnostics revealed > 210 k TIME_WAIT sockets and local ports exhausted beyond 65 k.
# Emergency fixes
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sysctl -w net.ipv4.tcp_fin_timeout=15
# Long‑term: enable keepalive connections in Nginx upstreamCase 2: Java Application – CLOSE_WAIT Leak
After a week of running, a Java service OOM‑crashed. Investigation showed > 85 k CLOSE_WAIT sockets, all pointing to a Redis client pool that failed to return connections on error.
// Correct usage with try‑with‑resources
try (Jedis jedis = jedisPool.getResource()) {
// use jedis
} // automatically returns to poolCase 3: SYN Flood Attack
New connections stopped establishing while existing ones stayed alive. SYN_RECV count hit the backlog limit (65 k). The source IPs were few, indicating a targeted SYN‑Flood.
# Emergency mitigation
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=262144
# Rate‑limit SYN packets at the firewall
iptables -A INPUT -p tcp --syn -m limit --limit 100/s --limit-burst 200 -j ACCEPT
iptables -A INPUT -p tcp --syn -j DROPConclusion
Key Parameter Quick Reference
tcp_tw_reuse : enable TIME_WAIT reuse – set to 1 (short‑connection services).
tcp_max_tw_buckets : upper limit for TIME_WAIT – 200 000 (high‑concurrency).
tcp_fin_timeout : FIN timeout – 15‑30 s (all scenarios).
somaxconn : accept queue size – 65 535 (high‑concurrency).
tcp_max_syn_backlog : SYN backlog – 65 535 (high‑concurrency).
ip_local_port_range : local port range – 1024‑65535 (proxy servers).
Tuning Principles
Monitor first; base changes on data.
Change one parameter at a time and observe the effect.
Backup original configurations before any modification.
Test on the target kernel version; behavior may differ.
Application‑level optimizations (keep‑alive, connection pools) outweigh kernel tweaks.
Advanced Directions
Use eBPF for deep TCP analysis.
Tune congestion‑control algorithms (BBR, CUBIC).
Explore user‑space stacks like DPDK.
Specialize tuning for container networking.
References
"TCP/IP Illustrated, Volume 1" – W. Richard Stevens
Linux kernel documentation: https://www.kernel.org/doc/Documentation/networking/
Red Hat Performance Tuning Guide: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/
Brendan Gregg’s Linux performance analysis site: http://www.brendangregg.com/linuxperf.html
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
