Why Did TCP Connections Skyrocket from 15K to 65K? Full Diagnosis and Fix
This postmortem details a production outage caused by a sudden surge of TCP connections from 15 K to 65 K, explains how to reproduce the issue, walks through step‑by‑step diagnostics, root‑cause analysis, and permanent remediation using Linux kernel tuning, connection‑pool configuration, and monitoring alerts.
Applicable Scenarios & Prerequisites
Applicable scenarios :
High‑concurrency Web/API services (QPS 5000+)
Improper TCP connection‑pool management or connection leaks
Large numbers of TIME_WAIT/CLOSE_WAIT connections
Traffic spikes or slow‑client attacks causing connection buildup
Prerequisites :
Linux kernel 4.4+ (5.10+ recommended for eBPF diagnostics)
Root or sudo privileges to view system parameters and network statistics
Installed diagnostic tools: ss, netstat, tcpdump, strace, perf Access to monitoring (Prometheus/Grafana) or application logs
Environment & Version Matrix
Key components include RHEL 8+/Ubuntu 22.04+, kernel 5.10+, Nginx 1.20+, Java 11+, HikariCP connection pool, Prometheus 2.40+, and eBPF‑enabled tools.
Quick Checklist
Step 1: Confirm symptoms (abnormal connections, CPU/memory spikes, request timeouts)
Step 2: Capture real‑time TCP connection distribution
Step 3: Identify leak source (application/Nginx/database)
Step 4: Analyze kernel parameters and system limits
Step 5: Emergency stop‑gap (restart services, rate‑limit, scale out)
Step 6: Locate root cause (code/config/architecture)
Step 7: Implement permanent fix (parameter tuning, code changes, architecture improvements)
Step 8: Set up monitoring alerts and load‑test verification
Incident Timeline
Incident time : 2025‑10‑15 14:32:00 – 23 minutes (14:32‑14:55) affecting all API services.
Implementation Steps
Step 1: Confirm Failure Symptoms
Goal: Quickly determine whether the issue is TCP‑related.
# View total TCP connections
ss -s
# Example output during failure:
# Total: 65535
# TCP: 58324 (estab 2341, closed 35123, orphaned 2048, timewait 18234)
# Compare with normal (<20000)
# TCP: 15234 (estab 8432, closed 3421, orphaned 12, timewait 2987)
# Check connections for a specific process (PID 12345)
ss -antp | grep :8080 | wc -l
# Failure output: 42318 (abnormally high)
# System load and CPU
uptime
# load average: 45.32, 38.21, 25.67 (normal <5)
top -bn1 | head -20
# CPU: 78%us, 12%sy, 0%ni, 2%id, 8%wa (high iowait)Key metric thresholds : closed > 30000 (normal < 5000) orphaned > 1000 (indicates orphan sockets) timewait > 15000 (slow TIME_WAIT recycling)
Step 2: Capture Real‑Time TCP State Distribution
Goal: Pinpoint which connection states are abnormal.
# Count connections per state
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn
# Sample output during failure:
# 35123 CLOSE_WAIT ← application not closing sockets
# 18234 TIME_WAIT
# 2341 ESTABLISHED
# 1021 FIN_WAIT2
# 412 SYN_RECV
# Show details of CLOSE_WAIT connections (top 10)
ss -antp state close-wait | head -10
# Example line:
# CLOSE-WAIT 0 1 192.168.1.10:8080 192.168.1.100:52341 users:("java",pid=12345,fd=8923)
# Identify remote IPs with most connections (possible attack)
ss -antp state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
# Sample:
# 8234 192.168.1.100 ← 8K+ connections from a single IP
# 3421 192.168.1.101CLOSE_WAIT interpretation :
Definition : Remote side closed (FIN) but local process has not called close().
Impact : Sockets linger, consuming file descriptors and memory.
Root cause : Application code fails to close connections (common in HTTP clients).
Step 3: Identify Leak Source
Goal: Determine whether the leak originates from the application, Nginx, or the database pool.
Check application process file descriptors
# Count open file descriptors for PID 12345
ls /proc/12345/fd | wc -l
# Failure output: 45231 (near ulimit)
# Count socket descriptors
ls -l /proc/12345/fd | awk '{print $11}' | grep socket | wc -l
# Output: 42318 (almost all sockets)
# Verify system fd limit
ulimit -n
# Output: 65535 (hard limit reached)
# Show process limits
cat /proc/12345/limits | grep "open files"
# Max open files 65535 65535 filesInspect application connection‑pool configuration (Spring Boot example)
# Grep Hikari/ datasource settings
grep -A5 "hikari\|datasource" /app/config/application.yml
# Output excerpt:
# spring:
# datasource:
# hikari:
# maximum-pool-size: 50 ← pool too small
# connection-timeout: 30000 ← 30 s too long
# idle-timeout: 600000 ← idle kept 10 min
# Search application logs for connection errors
tail -1000 /app/logs/application.log | grep -i "connection\|socket\|timeout"
# Sample lines:
# [ERROR] HikariPool-1 - Connection is not available, request timed out after 30000ms
# [WARN] java.net.SocketException: Too many open files
# [ERROR] Failed to obtain JDBC Connection: Could not open connectionCheck Nginx connection pool
# Show upstream connections
ss -antp | grep nginx | grep ESTABLISHED | wc -l
# Output: 3421 (backend connections normal)
# Verify keepalive settings
grep -A5 "upstream\|keepalive" /etc/nginx/nginx.conf
# Sample:
# upstream backend {
# server 192.168.1.10:8080;
# keepalive 32; ← only 32 connections (recommend 128+)
# keepalive_timeout 60s;
# }Step 4: Analyze Kernel Parameters & System Limits
Goal: Confirm whether system‑level settings cause bottlenecks.
# Show key TCP sysctl values
sysctl -a | grep -E "tcp_max_orphans|tcp_fin_timeout|tcp_tw_reuse|net.ipv4.ip_local_port_range"
# Output:
# net.ipv4.tcp_max_orphans = 8192 ← low orphan limit (recommend 65536)
# net.ipv4.tcp_fin_timeout = 60 ← FIN_WAIT2 timeout too long (recommend 10‑30)
# net.ipv4.tcp_tw_reuse = 0 ← TIME_WAIT reuse disabled
# net.ipv4.ip_local_port_range = 32768 60999 ← only ~28 K ports available (recommend 10000‑65000)
# Current TIME_WAIT count
ss -ant | grep TIME_WAIT | wc -l
# Output: 18234 (high, blocks new connections)
# System‑wide file descriptor limit
cat /proc/sys/fs/file-max
# Output: 1048576 (sufficient, issue is per‑process ulimit)
# Conntrack table (NAT environments)
cat /proc/sys/net/netfilter/nf_conntrack_max
# Output: 65536
cat /proc/sys/net/netfilter/nf_conntrack_count
# Output: 62341 (near limit, may cause NAT drops)Step 5: Emergency Stop‑Gap Measures
Goal: Quickly restore service and prevent further damage.
Option 1: Force‑close CLOSE_WAIT sockets (high risk)
# List PIDs owning CLOSE_WAIT sockets
ss -antp state close-wait | awk '{print $6}' | grep -oP 'pid=\K\d+' | sort -u
# Example output: 12345 12346 12347
# Use gdb to close sockets (requires gdb, use with caution)
for fd in $(ls /proc/12345/fd | head -1000); do
gdb -p 12345 -batch -ex "call close($fd)" 2>/dev/null
done
# Verify connection count drops
ss -antp | grep :8080 | wc -lOption 2: Restart application (recommended)
# Rolling restart in Kubernetes
kubectl rollout restart deployment/api-service -n production
# Or classic systemd restart
systemctl restart api-service
# Verify connections after 30 s
sleep 30
ss -s | grep TCP
# Expected: TCP: 12345 (estab 8000, closed 2000, ...)Option 3: Temporarily raise system limits
# Increase per‑process fd limit (requires restart)
echo "* soft nofile 100000" >> /etc/security/limits.conf
echo "* hard nofile 100000" >> /etc/security/limits.conf
# Immediate shell limit
ulimit -n 100000
# Raise kernel TCP parameters
sysctl -w net.ipv4.tcp_max_orphans=65536
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.ip_local_port_range="10000 65000"Step 6: Root‑Cause定位 (Code Review)
Goal: Find the exact code path leaking connections.
Use jstack to analyze thread stacks
# Capture thread dump for PID 12345
jstack 12345 > /tmp/jstack-$(date +%s).txt
# Search for threads waiting on sockets
grep -A10 "waiting\|blocked" /tmp/jstack-*.txt | grep -i "socket\|connection"
# Sample line:
# "http-nio-8080-exec-234" #456 daemon prio=5 waiting on condition
# java.net.SocketInputStream.socketRead0(Native Method)
# com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
# at com.example.service.UserService.queryUser(UserService.java:45)Problematic code snippet (connection leak)
public User queryUser(Long userId) {
Connection conn = dataSource.getConnection(); // ← acquire
PreparedStatement stmt = conn.prepareStatement("SELECT * FROM users WHERE id = ?");
stmt.setLong(1, userId);
ResultSet rs = stmt.executeQuery();
if (rs.next()) {
return new User(rs.getString("name"), rs.getString("email"));
}
// BUG: rs/stmt/conn not closed → leak
return null;
}Fixed version using try‑with‑resources
public User queryUser(Long userId) {
String sql = "SELECT * FROM users WHERE id = ?";
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement(sql)) {
stmt.setLong(1, userId);
try (ResultSet rs = stmt.executeQuery()) {
if (rs.next()) {
return new User(rs.getString("name"), rs.getString("email"));
}
}
} catch (SQLException e) {
log.error("Failed to query user", e);
throw new RuntimeException(e);
}
return null;
}Use eBPF to trace socket creation/close (kernel 5.10+)
# bpftrace one‑liner
bpftrace -e 'tracepoint:syscalls:sys_enter_socket { @socket_count[comm] = count(); }
tracepoint:syscalls:sys_enter_close { @close_count[comm] = count(); }'
# After 10 s (Ctrl+C) sample output:
# @socket_count[java]: 15234
# @close_count[java]: 8921 ← sockets created far more than closedStep 7: Permanent Fix Implementation
Goal: Resolve the issue at code, configuration, and architecture layers.
Code fix (already shown in Step 6)
# Commit the corrected UserService
git add src/main/java/com/example/service/UserService.java
git commit -m "Fix connection leak in UserService.queryUser()"
git push origin hotfix/connection-leak
# Deploy hotfix
./deploy.sh --env production --version hotfix-20251015Connection‑pool configuration improvement
spring:
datasource:
hikari:
maximum-pool-size: 200 # increase (CPU × 2‑4)
minimum-idle: 20
connection-timeout: 5000 # 5 s
idle-timeout: 300000 # 5 min
max-lifetime: 1800000 # 30 min
leak-detection-threshold: 60000 # warn if held >60 sExplanation of leak‑detection:
If a connection is held longer than leak-detection-threshold, HikariCP logs a stack trace.
Check logs with grep "Connection leak detection" /app/logs/application.log.
Persist kernel parameters
# Append to /etc/sysctl.d/99-tcp-tuning.conf
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
# Apply immediately
sysctl -p /etc/sysctl.d/99-tcp-tuning.confNginx connection‑pool tuning
# /etc/nginx/nginx.conf (relevant section)
http {
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
keepalive 256; # larger pool
keepalive_requests 1000; # max requests per connection
keepalive_timeout 60s; # idle timeout
}
server {
location /api/ {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # enable reuse
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
}
}
# Test and reload
nginx -t && nginx -s reloadStep 8: Monitoring, Alerting & Load‑Test Validation
Prometheus alert rules
# prometheus-rules.yaml
groups:
- name: tcp_connection_alerts
interval: 30s
rules:
- alert: TCPConnectionSurge
expr: node_netstat_Tcp_CurrEstab > 30000
for: 5m
labels:
severity: critical
annotations:
summary: "TCP ESTABLISHED connections exceed 30K"
- alert: TCPCloseWaitHigh
expr: node_sockstat_TCP_tw > 20000
for: 3m
labels:
severity: warning
annotations:
summary: "CLOSE_WAIT connections exceed 20K"
- alert: FileDescriptorNearLimit
expr: process_open_fds / process_max_fds > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Process FD usage > 80%"Grafana panels (example queries)
# TCP connection state trend
sum by (state) (node_netstat_Tcp_*)
# Application pool usage
hikaricp_connections_active{pool="HikariPool-1"} / hikaricp_connections{pool="HikariPool-1"} * 100
# New connections per second
rate(node_netstat_Tcp_PassiveOpens[1m]) + rate(node_netstat_Tcp_ActiveOpens[1m])Load‑test to verify fix
# wrk high‑concurrency test (10 threads, 1000 connections, 5 min)
wrk -t10 -c1000 -d300s --latency http://api.example.com/health
# Monitor connections during test
watch -n1 'ss -s | grep TCP'
# Expected results:
# - TCP connections stable at 15K‑20K
# - CLOSE_WAIT < 500
# - P99 latency < 500 ms
# - No "Too many open files" errorsMonitoring & Alerts
Node Exporter TCP metrics
# Install Node Exporter (example version 1.6.0)
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xf node_exporter-1.6.0.linux-amd64.tar.gz
sudo cp node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
sudo systemctl enable --now node_exporter
# Verify metrics
curl -s http://localhost:9100/metrics | grep node_netstat_Tcp
# Example keys: node_netstat_Tcp_CurrEstab, node_netstat_Tcp_InSegs, node_netstat_Tcp_OutSegs, node_netstat_Tcp_RetransSegsReal‑time monitoring script
#!/bin/bash
# tcp-monitor.sh – continuous TCP stats
while true; do
echo "=== TCP Connection Stats $(date) ==="
ss -s | grep TCP
echo -e "
=== Top 5 Connection States ==="
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn | head -5
echo -e "
=== Top 5 Remote IPs by Connection Count ==="
ss -antp state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -5
echo -e "
=== Application FD Usage ==="
APP_PID=$(pgrep -f java | head -1)
if [ -n "$APP_PID" ]; then
FD_COUNT=$(ls /proc/$APP_PID/fd 2>/dev/null | wc -l)
FD_LIMIT=$(cat /proc/$APP_PID/limits 2>/dev/null | grep "open files" | awk '{print $4}')
echo "PID $APP_PID: $FD_COUNT / $FD_LIMIT FDs"
fi
echo "========================================="
sleep 10
donePerformance & Capacity Planning
Connection capacity calculation
Maximum theoretical connections = local port range × remote IP count. Practical limit = min(system fd limit, kernel limits, pool config, memory (≈4 KB per socket)). Example on an 8 CPU / 16 GB server: port range 10000‑65000 (55 K), memory allows ~2.8 M sockets, fd limit 100 K → recommend 50 K concurrent connections with 50 % headroom.
Benchmarking
# Test max concurrent connections with tcpkali
tcpkali -c 50000 -T 60s --connect-rate 1000 192.168.1.10:8080
# Expected: Total connections 50000, traffic ~12.5 Gbps, 0 errors
# Test slow‑connection handling with wrk
wrk -t4 -c1000 -d60s --latency --timeout 60s http://api.example.com/slow-api
# Monitor CLOSE_WAIT growth:
watch -n1 'ss -ant | grep CLOSE_WAIT | wc -l'Security & Compliance
SYN Flood protection
# Enable SYN cookies
sysctl -w net.ipv4.tcp_syncookies=1
# Increase SYN backlog and socket backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.core.somaxconn=8192
# Reduce SYN‑ACK retries
sysctl -w net.ipv4.tcp_synack_retries=2Connection rate limiting with iptables
# Limit concurrent connections per IP (prevent slow‑loris)
iptables -A INPUT -p tcp --dport 8080 -m connlimit --connlimit-above 100 -j REJECT
# Limit new connections per second per IP
iptables -A INPUT -p tcp --dport 8080 -m state --state NEW -m recent --set
iptables -A INPUT -p tcp --dport 8080 -m state --state NEW -m recent --update --seconds 1 --hitcount 20 -j DROPCommon Issues & Troubleshooting Table
Key symptoms, diagnostic commands, possible causes, quick fixes, and permanent solutions are summarized (e.g., CLOSE_WAIT buildup → restart app or fix code; TIME_WAIT excess → enable tcp_tw_reuse and keep‑alive; “Too many open files” → raise ulimit -n).
Change & Rollback Scripts
Apply TCP tuning
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/root/tcp-config-backup-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
echo "==> Backing up current sysctl settings"
sysctl -a | grep -E "tcp_|ip_local_port" > $BACKUP_DIR/sysctl-before.txt
cp /etc/sysctl.d/99-tcp-tuning.conf $BACKUP_DIR/ 2>/dev/null || true
echo "==> Applying new configuration"
sysctl -p /etc/sysctl.d/99-tcp-tuning.conf
echo "==> Verifying configuration"
sysctl net.ipv4.tcp_tw_reuse | grep -q "= 1" || { echo "tcp_tw_reuse not applied"; exit 1; }
sysctl net.ipv4.tcp_fin_timeout | grep -q "= 15" || { echo "tcp_fin_timeout not applied"; exit 1; }
echo "==> Configuration applied successfully, backup stored at $BACKUP_DIR"Rollback script
#!/bin/bash
BACKUP_FILE="/root/tcp-config-backup-20251015/sysctl-before.txt"
while IFS='=' read -r key value; do
sysctl -w "${key}=${value}"
done < "$BACKUP_FILE"
echo "Configuration rolled back"Best Practices (10 items)
Force connection‑pool usage for all external resources (DB/Redis/HTTP); never create raw sockets.
Three‑layer timeout protection: connection 5 s, read 10 s, total 30 s.
Enable HikariCP leak detection ( leak-detection-threshold=60000) and review logs regularly.
Standardize system TCP parameters: tcp_tw_reuse=1, tcp_fin_timeout=15, port range 10000‑65000.
Set per‑process ulimit -n to at least twice the expected concurrent connections.
Monitor four key metrics: ESTABLISHED, CLOSE_WAIT, TIME_WAIT, and FD usage; set P90 alerts.
Validate every config or code change with load‑testing (wrk/tcpkali).
Protect against slow clients in Nginx: client_body_timeout 10s, send_timeout 10s.
Enable HTTP keep‑alive on clients and Nginx upstream pools.
Weekly health checks: ss -s, FD usage per process, and dmesg | grep -i tcp.
Appendix: Full Diagnostic Script
#!/bin/bash
# tcp-full-diagnostic.sh – comprehensive TCP diagnostics
echo "=== System Info ==="
uname -r
uptime
echo -e "
=== TCP Connection Summary ==="
ss -s
echo -e "
=== Connection State Distribution ==="
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn
echo -e "
=== Top 10 Remote IPs ==="
ss -antp state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
echo -e "
=== Application FD Usage ==="
for pid in $(pgrep -f "java\|nginx\|python"); do
comm=$(ps -p $pid -o comm=)
fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
fd_limit=$(cat /proc/$pid/limits 2>/dev/null | grep "open files" | awk '{print $4}')
echo "$comm (PID $pid): $fd_count / $fd_limit FDs"
done
echo -e "
=== System Limits ==="
ulimit -n
cat /proc/sys/fs/file-max
echo -e "
=== Key TCP Parameters ==="
sysctl net.ipv4.tcp_max_orphans net.ipv4.tcp_fin_timeout net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range
echo -e "
=== Network Errors ==="
netstat -s | grep -E "failed|dropped|error" | head -10
echo -e "
=== Recent Kernel Messages ==="
dmesg | grep -i tcp | tail -20Root‑cause summary : The outage was triggered by a code defect (missing close() in UserService.queryUser()), an undersized HikariCP pool (max 50, timeout 30 s), and kernel parameters that kept TIME_WAIT sockets and limited port range. After fixing the code, enlarging the pool, and applying TCP tuning, connections stabilized at 15 K‑18 K, P99 latency dropped to ~300 ms, and CLOSE_WAIT fell below 200.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
