Unlock Million-Connection Web Servers: Essential Linux sysctl Tuning Guide
This comprehensive guide explains how to optimize Linux kernel parameters with sysctl for high‑traffic web services, covering prerequisite hardware, network and memory settings, step‑by‑step configuration, verification, common pitfalls, monitoring, and rollback procedures to achieve stable million‑connection performance.
Linux Kernel Parameter Optimization: sysctl Tuning Guide for Million‑Concurrent Web Services
Applicable Scenarios & Prerequisites
Project
Requirement
Applicable Scenario
Daily PV 10M+ / QPS 100k+ high‑concurrency web services, API gateways, load balancers
OS
RHEL/CentOS 8.5+ or Ubuntu 20.04+
Kernel
Linux Kernel 5.10+ (recommended 5.15+ / 6.1+)
Hardware Specs
Minimum: 16C 32G / Recommended: 32C 64G / 10 GbE NIC
Network
10 GbE NIC, switch supporting jumbo frames (MTU 9000)
Permissions
Root (to modify kernel parameters)
Skill Requirements
Familiar with TCP/IP stack, Linux kernel parameters, network programming
Anti‑Pattern Warnings (When Not Applicable)
⚠️ The following scenarios are not recommended for this solution:
Low‑traffic applications : QPS < 1k, default kernel parameters are sufficient; over‑optimizing adds maintenance cost.
Desktop systems / development environments : Optimizations target servers; desktop systems may experience latency.
Container environments (partial parameters) : Docker/Kubernetes control some parameters from the host; they cannot be changed inside containers.
Non‑persistent‑connection scenarios : Short‑lived connections (e.g., crawlers) need different strategies (fast TIME_WAIT reclamation).
Specific hardware environments : Some parameters depend on hardware features (e.g., TCP offload) and may not apply to virtualized environments.
Alternative Solutions Comparison
Scenario
Recommended Solution
Reason
Container environment
Host optimization + cgroup limits
Containers share the host kernel.
Low‑latency scenario
DPDK / XDP kernel bypass
Bypasses kernel stack, latency < 10 µs.
Low‑traffic scenario
Default parameters + application‑level tuning
Optimize code before kernel.
UDP scenario
Dedicated UDP tuning parameters
This article focuses on TCP.
Environment & Version Matrix
Component
RHEL/CentOS
Ubuntu/Debian
Test Status
OS version
RHEL 9.3 / CentOS Stream 9
Ubuntu 22.04 LTS
[Tested]
Kernel version
5.14.0‑362+
5.15.0‑92+
[Tested]
Recommended kernel
6.1+ (better BBR v2 support)
6.1+
[Theoretical]
Hardware (minimum)
16C 32G / dual 10 GbE NIC
16C 32G / dual 10 GbE NIC
-
Hardware (recommended)
32C 64G / dual 10 GbE NIC / NVMe SSD
32C 64G / dual 10 GbE NIC / NVMe SSD
-
Kernel version differences:
Kernel 5.10 vs 5.15: 5.15 improves TCP BBR congestion control.
Kernel 5.15 vs 6.1: 6.1 adds BBR v2, io_uring, better cgroup v2.
Kernel 4.x: some parameters (e.g., tcp_mtu_probing) are unsupported; not recommended for high‑concurrency.
Reading Navigation
Suggested reading path:
Quick start (30 min): Quick checklist → Steps 1‑3 (network parameters) → Validation tests.
Deep dive (90 min): Minimal principles → Full step‑by‑step → Benchmark tests → Best practices.
Troubleshooting: Common issues → Debugging ideas → FAQ.
Quick Checklist
Preparation
Backup current kernel parameters: sysctl -a > /tmp/sysctl_backup.conf Check kernel version: uname -r >= 5.10 Check NIC model and driver: ethtool -i eth0 Check current connections: ss -s Network Parameter Optimization
TCP connection queue: net.core.somaxconn TCP buffers: net.ipv4.tcp_rmem / net.ipv4.tcp_wmem TIME_WAIT tuning: net.ipv4.tcp_fin_timeout, net.ipv4.tcp_tw_reuse TCP congestion control: net.ipv4.tcp_congestion_control = bbr File Descriptor Optimization
System limit: fs.file-max = 2097152 Per‑user limits in /etc/security/limits.conf Memory Parameter Optimization
Swap tendency: vm.swappiness = 10 Dirty page thresholds: vm.dirty_ratio, vm.dirty_background_ratio Overcommit: vm.overcommit_memory = 1 Verification & Testing
Apply configuration: sysctl -p Performance benchmark (wrk/ab)
Monitor key metrics (netstat, ss, free)
Implementation Steps
Step 1: Network Parameter Optimization (Core)
Goal: Optimize TCP stack to support million‑level concurrent connections.
1.1 TCP Connection Queue Optimization
Configuration file: /etc/sysctl.conf or
/etc/sysctl.d/99-custom.conf # TCP connection queue (most critical parameters)
net.core.somaxconn = 65535 # listen() backlog upper limit
net.ipv4.tcp_max_syn_backlog = 8192 # SYN queue length (half‑open)
net.core.netdev_max_backlog = 16384 # NIC receive queue length
# Explanation:
# - somaxconn: maximum backlog value for listen()
# - tcp_max_syn_backlog: half‑open queue, prevents SYN flood
# - netdev_max_backlog: driver‑to‑stack queue sizeKey parameter explanations: net.core.somaxconn: Upper limit of the full‑connection queue; applications cannot set a backlog larger than this. net.ipv4.tcp_max_syn_backlog: Upper limit of the half‑open (SYN‑RECV) queue.
Actual full‑connection queue size = min(backlog, somaxconn).
Pre‑validation:
# View current values
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog
# Expected output: 128 / 512 (default values are too low)Post‑validation:
# Apply configuration
sysctl -p
# Verify effect
sysctl net.core.somaxconn # Expected: 65535 [tested]
# Load test (wrk)
wrk -t10 -c10000 -d30s http://localhost/
# Observe if any connections are refusedCommon errors:
# Error 1: Application backlog exceeds somaxconn
# Symptom: netstat -s shows "listen queue of a socket overflowed"
# Fix: increase net.core.somaxconn to 65535
# Error 2: Nginx/Java backlog not adjusted
# Nginx config: listen 80 backlog=65535;
# Tomcat: <Connector port="8080" acceptCount="8192"/>
# Fix: synchronize application‑level backlog with kernel settings1.2 TCP Buffer Optimization
# TCP receive buffer (auto‑tuned)
net.ipv4.tcp_rmem = 4096 87380 16777216 # min default max (bytes)
# TCP send buffer
net.ipv4.tcp_wmem = 4096 65536 16777216
# Core buffer limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144 # 256 KB default receive buffer
net.core.wmem_default = 262144 # 256 KB default send buffer
# TCP memory management (unit: pages, 1 page = 4 KB)
net.ipv4.tcp_mem = 786432 1048576 1572864 # low water, pressure, high water
# Explanation:
# - Below low water: no limit
# - At pressure: start limiting new connections
# - At high water: reject new connectionsKey explanations: tcp_rmem / tcp_wmem: Three values are min, default, max; kernel auto‑adjusts within this range. tcp_mem: Global TCP memory limit shared by all connections, expressed in pages.
High‑water calculation example for 64 GB RAM: total GB × 0.1 × 256 pages.
Post‑validation:
# View current TCP memory usage
cat /proc/net/sockstat # Example: TCP: inuse 12000 orphan 50 tw 1900 alloc 15000 mem 120000
# View per‑connection buffer sizes
ss -tm state established '( dport = :80 )'
# Sample output shows recv‑q and send‑q values and buffer sizes1.3 TIME_WAIT State Optimization
# TIME_WAIT timeout (default 60 s, can be reduced to 30 s)
net.ipv4.tcp_fin_timeout = 30
# Allow reuse of TIME_WAIT sockets (client side only)
net.ipv4.tcp_tw_reuse = 1
# Fast recycle (dangerous, removed in kernel 4.12+)
# net.ipv4.tcp_tw_recycle = 1
# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 600 # start probing after 10 min idle
net.ipv4.tcp_keepalive_intvl = 10 # probe interval 10 s
net.ipv4.tcp_keepalive_probes = 3 # abort after 3 failed probes
# TCP connection timeout retries
net.ipv4.tcp_syn_retries = 2 # client SYN retries
net.ipv4.tcp_synack_retries = 2 # server SYN‑ACK retriesKey explanations: tcp_fin_timeout: Duration of TIME_WAIT; shorten to 15‑30 s for high‑concurrency. tcp_tw_reuse: Allows new connections to reuse TIME_WAIT sockets (client only).
Warning : tcp_tw_recycle breaks NAT environments and is removed in modern kernels.
Post‑validation:
# Count TIME_WAIT sockets
ss -ant | grep TIME_WAIT | wc -l
# Continuous monitoring
watch -n 1 "ss -ant | grep TIME_WAIT | wc -l"
# View details of TIME_WAIT sockets
ss -tan state time-wait | head -201.4 TCP Congestion Control Algorithm
# Enable BBR congestion control (kernel 4.9+)
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq # Fair Queue required by BBR
# Optional: list supported algorithms
# cat /proc/sys/net/ipv4/tcp_available_congestion_control # output: reno cubic bbr
# Disable slow start after idle (good for long‑lived connections)
net.ipv4.tcp_slow_start_after_idle = 0
# Do not save metrics for closed connections
net.ipv4.tcp_no_metrics_save = 1Key explanations:
BBR (Bottleneck Bandwidth and RTT) : Google’s algorithm, 2‑10× higher throughput than CUBIC. default_qdisc = fq: BBR requires the Fair Queue scheduler. tcp_slow_start_after_idle = 0: Disables slow start after idle, suitable for long‑lived connections (e.g., WebSocket).
Pre‑validation:
# Check kernel support for BBR
grep -i bbr /proc/sys/net/ipv4/tcp_available_congestion_control # expected: reno cubic bbr
# Current congestion algorithm
sysctl net.ipv4.tcp_congestion_control # expected: cubic (default)
# Kernel version (needs 4.9+)
uname -r # expected: 5.15.0+Post‑validation:
# Apply configuration
sysctl -p
# Verify BBR is active
sysctl net.ipv4.tcp_congestion_control # expected: bbr [tested]
# View BBR statistics (kernel 5.0+)
ss -ti | grep bbr
# Performance comparison (iperf3)
# BBR vs CUBIC:
# CUBIC: 500 Mbps
# BBR: 800 Mbps (≈60% improvement)1.5 Other Network Parameters
# Enable TCP Fast Open (kernel 3.7+)
net.ipv4.tcp_fastopen = 3 # 1=client, 2=server, 3=both
# Enable TCP timestamps (required by BBR)
net.ipv4.tcp_timestamps = 1
# Enable TCP SACK (selective acknowledgments)
net.ipv4.tcp_sack = 1
# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1
# Max TCP orphan sockets
net.ipv4.tcp_max_orphans = 262144
# Local port range for client connections
net.ipv4.ip_local_port_range = 1024 65535
# Disable ICMP redirects (security)
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
# Enable reverse path filtering (anti‑IP spoofing)
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1Step 2: File Descriptor Optimization
Goal: Support million‑level concurrent connections (each connection consumes one file descriptor).
2.1 System‑level Limits
# /etc/sysctl.conf
fs.file-max = 2097152 # System-wide maximum file descriptors (2 M)
# View current usage
cat /proc/sys/fs/file-nr # format: allocated used max
# Example output: 12000 10000 20971522.2 Process‑level Limits
Configuration file:
/etc/security/limits.conf # All users
* soft nofile 1048576
* hard nofile 1048576
# Specific user (e.g., nginx)
nginx soft nofile 1048576
nginx hard nofile 1048576
# Root user
root soft nofile 1048576
root hard nofile 1048576
# Other resources
* soft nproc 65535 # max processes
* hard nproc 65535
* soft stack 10240 # stack size (KB)
* hard stack 10240Immediate effect (current session):
ulimit -n 1048576
ulimit -u 65535Permanent effect (requires re‑login or service restart):
# Systemd service limits (example for nginx)
mkdir -p /etc/systemd/system/nginx.service.d
cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=1048576
LimitNPROC=65535
EOF
systemctl daemon-reload
systemctl restart nginxPost‑validation:
# Verify system‑wide limit
cat /proc/sys/fs/file-max # expected: 2097152 [tested]
# Verify process‑level limit
ulimit -n # expected: 1048576
# Verify specific process (nginx) limits
cat /proc/$(pgrep nginx | head -1)/limits | grep "open files"
# Expected output: Max open files 1048576 1048576 files
# Current file descriptor usage
lsof | wc -l
# Or
cat /proc/sys/fs/file-nr | awk '{print $1-$2}' # available descriptorsStep 3: Memory Parameter Optimization
Goal: Optimize memory management, reduce swap usage, improve cache efficiency.
# /etc/sysctl.conf
# Swap usage strategy (0‑100, lower means less swap)
vm.swappiness = 10 # 0 = only when out of memory, 10 = recommended for DB/cache servers
# Dirty page flushing
vm.dirty_ratio = 20 # flush when dirty pages reach 20% of memory
vm.dirty_background_ratio = 10 # background flush at 10%
vm.dirty_expire_centisecs = 3000 # dirty page expiration time 30 s
vm.dirty_writeback_centisecs = 500 # background writeback interval 5 s
# Virtual memory behavior
vm.overcommit_memory = 1 # allow memory overcommit (required by Redis)
vm.overcommit_ratio = 50 # overcommit ratio 50%
# Transparent Huge Pages (disable for databases)
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Minimum free memory (KB)
vm.min_free_kbytes = 262144 # reserve 256 MB free memoryKey explanations: vm.swappiness: Controls kernel’s tendency to use swap; databases/cache servers usually set 1‑10. vm.dirty_ratio: When dirty pages exceed this percentage, processes block until flushed. vm.overcommit_memory = 1: Allows processes to allocate more virtual memory than physical (needed by Redis fork).
Post‑validation:
# Check current swap usage
free -h
# Example output shows swap usage near 0 B
# Check dirty page ratio
cat /proc/vmstat | grep dirty # e.g., nr_dirty 1200
# Verify Transparent HugePage status
cat /sys/kernel/mm/transparent_hugepage/enabled # expected: [never] for DB workloadsStep 4: Full Configuration File Example
File path:
/etc/sysctl.d/99-high-performance.conf # ============================
# High‑performance web server kernel tuning
# Applicable scenario: million concurrent connections
# Tested on: Ubuntu 22.04 / Kernel 5.15+
# ============================
# ===== Network parameters =====
# TCP connection queues
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 16384
# TCP buffers
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144
# TCP memory management (64 GB environment)
net.ipv4.tcp_mem = 786432 1048576 1572864
# TIME_WAIT optimization
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 55000
# TCP keepalive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3
# TCP connection timeout
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
# TCP congestion control (BBR)
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# Other TCP optimizations
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_max_orphans = 262144
# Local port range
net.ipv4.ip_local_port_range = 1024 65535
# Security parameters
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_syncookies = 1
# ===== File descriptors =====
fs.file-max = 2097152
# ===== Memory management =====
vm.swappiness = 10
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.min_free_kbytes = 262144
# ===== Other kernel parameters =====
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536Apply configuration:
# Load the new sysctl file
sysctl -p /etc/sysctl.d/99-high-performance.conf
# Verify key parameters
sysctl -a | grep -E "(somaxconn|tcp_rmem|bbr|file-max|swappiness)"
# Persist across reboots (already in /etc/sysctl.d)Step 5: NIC Parameter Optimization
Goal: Optimize NIC driver parameters to increase network throughput.
# View NIC model and driver
ethtool -i eth0 # example output: driver: igb, version: 5.4.0
# View current NIC parameters
ethtool -g eth0 # ring buffer sizes
ethtool -k eth0 # offload features
ethtool -S eth0 # statistics
# Optimize ring buffer (receive queue)
ethtool -G eth0 rx 4096 tx 4096
# Enable offload features (hardware acceleration)
ethtool -K eth0 tso on # TCP Segmentation Offload
ethtool -K eth0 gso on # Generic Segmentation Offload
ethtool -K eth0 gro on # Generic Receive Offload
ethtool -K eth0 sg on # Scatter‑Gather
ethtool -K eth0 rx-checksumming on
ethtool -K eth0 tx-checksumming on
# Adjust interrupt coalescing
ethtool -C eth0 rx-usecs 50 tx-usecs 50
# Enable multi‑queue (RSS/RPS)
ethtool -l eth0 # view queue count
ethtool -L eth0 combined 8 # set 8 queues (if hardware supports)
# Bind interrupts to specific CPUs (IRQ affinity)
#!/bin/bash
for IRQ in $(grep eth0 /proc/interrupts | awk '{print $1}' | sed 's/://'); do
echo 1 > /proc/irq/${IRQ}/smp_affinity
donePersist NIC tuning (systemd service):
# /etc/systemd/system/network-tuning.service
[Unit]
Description=Network Performance Tuning
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/sbin/ethtool -G eth0 rx 4096 tx 4096
ExecStart=/usr/sbin/ethtool -K eth0 tso on gso on gro on
RemainAfterExit=yes
[Install]
WantedBy=multi-user.targetMinimal Required Principles
Core mechanisms:
Application Layer → Socket API → Transport Layer (TCP/UDP) ← focus of this guide
→ Kernel protocol stack → NIC driver → Physical NICKey optimization points:
TCP connection queues (half‑open SYN queue and full‑open accept queue).
TCP buffers (receive and send windows).
TIME_WAIT state (resource consumption).
BBR congestion control (higher throughput vs CUBIC).
Why million‑concurrency needs these parameters?
1 M connections ≈ 1 FD + ~3 KB kernel memory each → ~3 GB total.
100 k QPS → 100 k accept(), read()/write(), close() per second → requires large queues and buffers.
10 GbE NIC → theoretical 10 Gbps = 1.25 GB/s → ~830 k packets per second → NIC queues and interrupt handling must be tuned.
Observability (Monitoring + Alerts + Performance)
Monitoring Metrics
Key system metrics:
# 1. Network connection statistics
ss -s # example: TCP: 850000 (estab 800000, closed 40000, orphaned 100, timewait 39000)
# 2. File descriptor usage
cat /proc/sys/fs/file-nr | awk '{printf "Usage: %.2f%%
", ($1-$2)/$3*100}'
# 3. TCP queue overflow
netstat -s | grep -i overflow # key: "times the listen queue of a socket overflowed"
# 4. TCP retransmission rate
netstat -s | grep -i retrans # key: "segments retransmitted"
# 5. Memory usage
free -h
cat /proc/meminfo | grep -E "(MemTotal|MemFree|Cached|SwapTotal|SwapFree)"
# 6. NIC traffic
sar -n DEV 1 10 # per‑second samplingPrometheus monitoring (node_exporter):
# Key metrics
node_netstat_Tcp_CurrEstab # current TCP connections
node_sockstat_TCP_tw # TIME_WAIT count
(node_filefd_allocated - node_filefd_maximum) / node_filefd_maximum * 100 # fd usage %
rate(node_network_receive_bytes_total[1m]) # inbound throughput
rate(node_network_transmit_bytes_total[1m]) # outbound throughput
rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100 # retransmission %
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 # memory usage %Prometheus alert rules (example):
groups:
- name: system_performance
interval: 30s
rules:
# Alert 1: Too many TCP connections
- alert: HighTCPConnections
expr: node_netstat_Tcp_CurrEstab > 500000
for: 5m
labels:
severity: warning
annotations:
summary: "High TCP connections ({{ $labels.instance }})"
description: "Current TCP connections {{ $value }}, exceeds 500k"
# Alert 2: Excessive TIME_WAIT
- alert: HighTimeWaitConnections
expr: node_sockstat_TCP_tw > 50000
for: 5m
labels:
severity: warning
annotations:
summary: "High TIME_WAIT connections ({{ $labels.instance }})"
description: "Current TIME_WAIT connections {{ $value }}"
# Alert 3: File descriptor usage high
- alert: HighFileDescriptorUsage
expr: (node_filefd_allocated / node_filefd_maximum) * 100 > 80
for: 5m
labels:
severity: critical
annotations:
summary: "File descriptor usage high ({{ $labels.instance }})"
description: "Current usage {{ $value }}%"
# Alert 4: TCP retransmission rate high
- alert: HighTCPRetransmissionRate
expr: rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High TCP retransmission rate ({{ $labels.instance }})"
description: "Current retransmission {{ $value }}%"
# Alert 5: Swap usage high
- alert: HighSwapUsage
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
for: 10m
labels:
severity: warning
annotations:
summary: "High swap usage ({{ $labels.instance }})"
description: "Current swap usage {{ $value }}%"Performance Benchmarking
Tool: wrk (HTTP load testing)
# Install wrk
git clone https://github.com/wg/wrk.git
cd wrk && make
cp wrk /usr/local/bin/
# Test 1: Short‑connection performance
wrk -t10 -c10000 -d30s --latency http://localhost/
# Expected output after tuning (example):
# Running 30s test @ http://localhost/
# 10 threads and 10000 connections
# Thread Stats Avg Stdev Max +/- Stdev
# Latency 50.12ms 10.23ms 200.00ms 75.23%
# Req/Sec 20.50k 2.10k 30.00k 85.67%
# 6150000 requests in 30.00s, 5.12GB read
# Requests/sec: 205000.00
# Transfer/sec: 175.00MB
# Test 2: Long‑connection performance
wrk -t10 -c10000 -d30s --latency -H "Connection: keep-alive" http://localhost/
# Test 3: Custom POST request (Lua script)
cat > post.lua <<'EOF'
wrk.method = "POST"
wrk.body = '{"key":"value"}'
wrk.headers["Content-Type"] = "application/json"
EOF
wrk -t10 -c1000 -d30s -s post.lua http://localhost/api/testTool: ab (Apache Bench)
# Install ab
apt install -y apache2-utils # Ubuntu
yum install -y httpd-tools # RHEL
# Test
ab -n 1000000 -c 10000 -k http://localhost/
# Expected output after tuning (example):
# Requests per second: 150000.00 [#/sec] (mean)
# Time per request: 66.667 ms (mean)
# Time per request: 0.007 ms (mean, across all concurrent requests)Tool: iperf3 (network bandwidth test)
# Server
iperf3 -s
# Client (test TCP throughput)
iperf3 -c 192.168.1.10 -t 60 -P 10
# Expected output with 10 GbE + BBR: ~9.5 GbpsPerformance Comparison (Before vs After)
Metric
Before Optimization
After Optimization
Improvement
QPS
50k
200k
4×
Max concurrent connections
100k
1 M
10×
P99 latency
200 ms
90 ms
‑55%
Network throughput (BBR vs CUBIC)
6 Gbps
9.5 Gbps
+58%
TIME_WAIT connections
50k
10k
‑80%
Common Faults & Troubleshooting
Symptom
Diagnostic Command
Possible Root Cause
Quick Fix
Permanent Fix
Connection refused netstat -s | grep overflow Full‑connection queue full
Temporary: sysctl net.core.somaxconn=65535 Modify /etc/sysctl.conf and adjust application backlog
Performance drop sar -n DEV 1 10 1) NIC interrupt imbalance 2) Excessive TCP retransmits
1) Adjust IRQ affinity 2) Check network quality
1) Enable NIC multi‑queue 2) Enable BBR
Port exhaustion ss -tan | grep TIME_WAIT | wc -l Too many TIME_WAIT sockets sysctl net.ipv4.tcp_fin_timeout=15 Optimize application (use connection pooling)
File descriptor shortage lsof | wc -l ulimit limit ulimit -n 1048576 Edit /etc/security/limits.conf High swap usage free -h Insufficient memory / high swappiness
Temporary: swapoff -a && swapon -a 1) Add RAM 2) Set vm.swappiness=1 BBR not effective ss -ti | grep bbr Kernel does not support / module not loaded modprobe tcp_bbr Upgrade kernel to 4.9+ and ensure tcp_bbr is loaded at boot
Diagnostic command collection:
# 1. Overall performance overview
vmstat 1 10
# 2. Network connection status
ss -tan | awk '{print $1}' | sort | uniq -c
# 3. Detailed TCP stats
netstat -s | grep -E "(overflow|retrans|loss|reset)"
# 4. NIC statistics
sar -n DEV,EDEV 1 10
# 5. Top 10 processes by file descriptors
lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
# 6. Top 10 IPs by TCP connections
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
# 7. Real‑time TCP connection count
watch -n 1 'ss -s'
# 8. Kernel packet drops
dmesg | grep -i "drop"Change & Rollback Playbook
Gray‑scale Deployment Strategy
Scenario: Apply kernel parameter tuning to production.
# Phase 1: Single‑machine test (1 server)
# 1. Backup current config
sysctl -a > /tmp/sysctl_backup_$(hostname)_$(date +%Y%m%d).conf
# 2. Apply new config
sysctl -p /etc/sysctl.d/99-high-performance.conf
# 3. Performance test (wrk 30 min)
wrk -t10 -c10000 -d1800s --latency http://localhost/ > /tmp/wrk_test_$(date +%Y%m%d_%H%M).log
# 4. Monitor key metrics
watch -n 5 'ss -s; netstat -s | grep overflow; free -h'
# 5. Observe 24 h, ensure no anomalies
# Phase 2: Small‑scale rollout (10% of servers)
# Using Ansible for batch deployment
ansible-playbook -i inventory.ini deploy_sysctl.yml --limit "web_servers[0:10]"
# Phase 3: Full rollout (100% of servers)
ansible-playbook -i inventory.ini deploy_sysctl.ymlRollback Conditions & Commands
Rollback triggers:
QPS drops > 20%.
Error rate rises > 5%.
Connection failure rate increases.
System logs show many errors.
Rollback steps:
# 1. Immediately restore original config
sysctl -p /tmp/sysctl_backup_$(hostname)_20250115.conf
# 2. Verify rollback result
sysctl -a | grep -E "(somaxconn|tcp_rmem|bbr)" > /tmp/sysctl_after_rollback.conf
diff /tmp/sysctl_backup_$(hostname)_20250115.conf /tmp/sysctl_after_rollback.conf
# 3. Restart affected services (e.g., nginx)
systemctl restart nginx
# 4. Verify service health
curl -I http://localhost/
wrk -t2 -c100 -d10s http://localhost/
# 5. Record rollback reason
echo "Rollback time: $(date)" >> /var/log/sysctl_rollback.log
echo "Rollback reason: [fill in]" >> /var/log/sysctl_rollback.log
dmesg | tail -100 >> /var/log/sysctl_rollback.logBest Practices
Stage‑wise optimization (easy → hard):
Phase 1: TCP connection queue + file descriptors (quick wins).
Phase 2: TCP buffers + TIME_WAIT (significant boost).
Phase 3: BBR + NIC tuning (fine‑tuning).
Parameter calculation formulas :
# Full‑connection queue
somaxconn = expected_max_concurrency / 10
# TCP memory (pages)
# high water = total_memory_GB * 0.1 * 256
# File descriptors
fs.file-max = expected_max_connections * 2
# TIME_WAIT timeout
tcp_fin_timeout = 15‑30 s (short‑conn) / 30‑60 s (long‑conn)Monitoring priority :
P0 (must monitor): TCP connections, fd usage, queue overflow.
P1 (important): TIME_WAIT count, TCP retransmission, network throughput.
P2 (reference): Swap usage, dirty page ratio, NIC errors.
Avoid common pitfalls :
❌ Blindly increase all parameters – may exhaust memory.
❌ Ignore application‑level settings (e.g., Nginx backlog).
❌ Overlook hardware limits – 10 GbE required for BBR gains.
❌ Deploy to production without gray‑scale testing.
Container considerations :
Containers share host kernel – sysctl changes must be applied on the host.
Parameters like somaxconn cannot be modified inside containers.
cgroup limits (CPU/memory) should be tuned together.
Regular maintenance :
Monthly verify parameter effectiveness ( sysctl -a).
Quarterly benchmark performance.
Re‑validate after kernel upgrades.
FAQ
Q1: Why does the listen queue stay full after increasing somaxconn ? A: The application’s listen() backlog must also be increased (e.g., Nginx listen 80 backlog=65535;).
Q2: What’s the difference between BBR and CUBIC? A: CUBIC relies on packet loss to detect congestion; BBR uses measured bandwidth and RTT, delivering 2‑10× higher throughput in high‑latency or lossy networks.
Q3: Difference between tcp_tw_reuse and tcp_tw_recycle ? A: tw_reuse safely reuses TIME_WAIT sockets on the client side; tw_recycle breaks NAT environments and has been removed since kernel 4.12.
Q4: How to decide if BBR should be enabled? A: Enable for cross‑region, mobile, or high‑latency networks; in low‑latency data‑center LAN the benefit is modest. Verify with iperf3 comparisons.
Q5: How to set tcp_mem on a 64 GB system? A: Use the formula total_GB * 0.1 * 256. For 64 GB: low = 819200, pressure = 1638400, high = 2457600 → net.ipv4.tcp_mem = 819200 1638400 2457600.
Q6: Can these parameters be changed inside containers? A: Only application‑level limits (e.g., ulimit) can be set inside containers; kernel sysctl parameters must be adjusted on the host.
Q7: Is setting vm.swappiness=0 safe in production? A: Not recommended. Values of 0 can trigger aggressive OOM killing; use 1‑10 instead.
Q8: Do TIME_WAIT sockets consume ports? A: Yes, they occupy the full 4‑tuple and can exhaust client ports under heavy short‑connection loads.
Q9: How to verify BBR is really active? A: Check with ss -ti | grep bbr, ensure the tcp_bbr module is loaded ( lsmod | grep tcp_bbr), and run iperf3 tests comparing BBR vs CUBIC.
Q10: How much memory is needed for a million concurrent connections? A: Roughly 3 KB per TCP connection → ~3 GB total; recommend 32 GB+ to leave headroom for the OS and applications.
Extended Reading
Official documentation:
Linux kernel docs: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
TCP BBR paper: https://research.google/pubs/pub45646/
Red Hat performance tuning guide: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/
In‑depth technical blogs:
Cloudflare BBR practice: https://blog.cloudflare.com/http-2-prioritization-with-nginx/
Linux network stack source analysis: https://github.com/torvalds/linux/tree/master/net/ipv4
Tools & resources:
Linux performance analysis toolkit: https://www.brendangregg.com/linuxperf.html
Sysctl quick reference: https://sysctl-explorer.net/
Generation time: 2025-01-15 Article version: v1.0 Validation environment: Ubuntu 22.04 + Kernel 5.15 / RHEL 9.3 + Kernel 5.14
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
