Operations 51 min read

Unlock Million-Connection Web Servers: Essential Linux sysctl Tuning Guide

This comprehensive guide explains how to optimize Linux kernel parameters with sysctl for high‑traffic web services, covering prerequisite hardware, network and memory settings, step‑by‑step configuration, verification, common pitfalls, monitoring, and rollback procedures to achieve stable million‑connection performance.

MaGe Linux Operations

Nov 5, 2025

Unlock Million-Connection Web Servers: Essential Linux sysctl Tuning Guide

Linux Kernel Parameter Optimization: sysctl Tuning Guide for Million‑Concurrent Web Services

Applicable Scenarios & Prerequisites

Project

Requirement

Applicable Scenario

Daily PV 10M+ / QPS 100k+ high‑concurrency web services, API gateways, load balancers

RHEL/CentOS 8.5+ or Ubuntu 20.04+

Kernel

Linux Kernel 5.10+ (recommended 5.15+ / 6.1+)

Hardware Specs

Minimum: 16C 32G / Recommended: 32C 64G / 10 GbE NIC

Network

10 GbE NIC, switch supporting jumbo frames (MTU 9000)

Permissions

Root (to modify kernel parameters)

Skill Requirements

Familiar with TCP/IP stack, Linux kernel parameters, network programming

Anti‑Pattern Warnings (When Not Applicable)

⚠️ The following scenarios are not recommended for this solution:

Low‑traffic applications : QPS < 1k, default kernel parameters are sufficient; over‑optimizing adds maintenance cost.

Desktop systems / development environments : Optimizations target servers; desktop systems may experience latency.

Container environments (partial parameters) : Docker/Kubernetes control some parameters from the host; they cannot be changed inside containers.

Non‑persistent‑connection scenarios : Short‑lived connections (e.g., crawlers) need different strategies (fast TIME_WAIT reclamation).

Specific hardware environments : Some parameters depend on hardware features (e.g., TCP offload) and may not apply to virtualized environments.

Alternative Solutions Comparison

Scenario

Environment & Version Matrix

Component

RHEL/CentOS

Ubuntu/Debian

Test Status

OS version

RHEL 9.3 / CentOS Stream 9

Ubuntu 22.04 LTS

[Tested]

Kernel version

5.14.0‑362+

5.15.0‑92+

[Tested]

Recommended kernel

6.1+ (better BBR v2 support)

6.1+

[Theoretical]

Hardware (minimum)

16C 32G / dual 10 GbE NIC

Hardware (recommended)

32C 64G / dual 10 GbE NIC / NVMe SSD

Kernel version differences:

Kernel 5.10 vs 5.15: 5.15 improves TCP BBR congestion control.

Kernel 5.15 vs 6.1: 6.1 adds BBR v2, io_uring, better cgroup v2.

Kernel 4.x: some parameters (e.g., tcp_mtu_probing) are unsupported; not recommended for high‑concurrency.

Reading Navigation

Quick Checklist

Preparation

Backup current kernel parameters: sysctl -a > /tmp/sysctl_backup.conf Check kernel version: uname -r >= 5.10 Check NIC model and driver: ethtool -i eth0 Check current connections: ss -s Network Parameter Optimization

TCP connection queue: net.core.somaxconn TCP buffers: net.ipv4.tcp_rmem / net.ipv4.tcp_wmem TIME_WAIT tuning: net.ipv4.tcp_fin_timeout, net.ipv4.tcp_tw_reuse TCP congestion control: net.ipv4.tcp_congestion_control = bbr File Descriptor Optimization

System limit: fs.file-max = 2097152 Per‑user limits in /etc/security/limits.conf Memory Parameter Optimization

Swap tendency: vm.swappiness = 10 Dirty page thresholds: vm.dirty_ratio, vm.dirty_background_ratio Overcommit: vm.overcommit_memory = 1 Verification & Testing

Apply configuration: sysctl -p Performance benchmark (wrk/ab)

Monitor key metrics (netstat, ss, free)

Implementation Steps

Step 1: Network Parameter Optimization (Core)

Goal: Optimize TCP stack to support million‑level concurrent connections.

1.1 TCP Connection Queue Optimization

Configuration file: /etc/sysctl.conf or

/etc/sysctl.d/99-custom.conf

# TCP connection queue (most critical parameters)
net.core.somaxconn = 65535   # listen() backlog upper limit
net.ipv4.tcp_max_syn_backlog = 8192   # SYN queue length (half‑open)
net.core.netdev_max_backlog = 16384   # NIC receive queue length

# Explanation:
# - somaxconn: maximum backlog value for listen()
# - tcp_max_syn_backlog: half‑open queue, prevents SYN flood
# - netdev_max_backlog: driver‑to‑stack queue size

Key parameter explanations: net.core.somaxconn: Upper limit of the full‑connection queue; applications cannot set a backlog larger than this. net.ipv4.tcp_max_syn_backlog: Upper limit of the half‑open (SYN‑RECV) queue.

Actual full‑connection queue size = min(backlog, somaxconn).

Pre‑validation:

# View current values
sysctl net.core.somaxconn
sysctl net.ipv4.tcp_max_syn_backlog
# Expected output: 128 / 512 (default values are too low)

Post‑validation:

# Apply configuration
sysctl -p
# Verify effect
sysctl net.core.somaxconn   # Expected: 65535 [tested]
# Load test (wrk)
wrk -t10 -c10000 -d30s http://localhost/
# Observe if any connections are refused

Common errors:

# Error 1: Application backlog exceeds somaxconn
# Symptom: netstat -s shows "listen queue of a socket overflowed"
# Fix: increase net.core.somaxconn to 65535

# Error 2: Nginx/Java backlog not adjusted
# Nginx config: listen 80 backlog=65535;
# Tomcat: <Connector port="8080" acceptCount="8192"/>
# Fix: synchronize application‑level backlog with kernel settings

1.2 TCP Buffer Optimization

# TCP receive buffer (auto‑tuned)
net.ipv4.tcp_rmem = 4096 87380 16777216   # min default max (bytes)
# TCP send buffer
net.ipv4.tcp_wmem = 4096 65536 16777216
# Core buffer limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144   # 256 KB default receive buffer
net.core.wmem_default = 262144   # 256 KB default send buffer

# TCP memory management (unit: pages, 1 page = 4 KB)
net.ipv4.tcp_mem = 786432 1048576 1572864   # low water, pressure, high water
# Explanation:
# - Below low water: no limit
# - At pressure: start limiting new connections
# - At high water: reject new connections

Key explanations: tcp_rmem / tcp_wmem: Three values are min, default, max; kernel auto‑adjusts within this range. tcp_mem: Global TCP memory limit shared by all connections, expressed in pages.

High‑water calculation example for 64 GB RAM: total GB × 0.1 × 256 pages.

Post‑validation:

# View current TCP memory usage
cat /proc/net/sockstat   # Example: TCP: inuse 12000 orphan 50 tw 1900 alloc 15000 mem 120000
# View per‑connection buffer sizes
ss -tm state established '( dport = :80 )'
# Sample output shows recv‑q and send‑q values and buffer sizes

1.3 TIME_WAIT State Optimization

# TIME_WAIT timeout (default 60 s, can be reduced to 30 s)
net.ipv4.tcp_fin_timeout = 30
# Allow reuse of TIME_WAIT sockets (client side only)
net.ipv4.tcp_tw_reuse = 1
# Fast recycle (dangerous, removed in kernel 4.12+)
# net.ipv4.tcp_tw_recycle = 1
# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 600   # start probing after 10 min idle
net.ipv4.tcp_keepalive_intvl = 10   # probe interval 10 s
net.ipv4.tcp_keepalive_probes = 3   # abort after 3 failed probes
# TCP connection timeout retries
net.ipv4.tcp_syn_retries = 2   # client SYN retries
net.ipv4.tcp_synack_retries = 2   # server SYN‑ACK retries

Key explanations: tcp_fin_timeout: Duration of TIME_WAIT; shorten to 15‑30 s for high‑concurrency. tcp_tw_reuse: Allows new connections to reuse TIME_WAIT sockets (client only).

Warning : tcp_tw_recycle breaks NAT environments and is removed in modern kernels.

Post‑validation:

# Count TIME_WAIT sockets
ss -ant | grep TIME_WAIT | wc -l
# Continuous monitoring
watch -n 1 "ss -ant | grep TIME_WAIT | wc -l"
# View details of TIME_WAIT sockets
ss -tan state time-wait | head -20

1.4 TCP Congestion Control Algorithm

# Enable BBR congestion control (kernel 4.9+)
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq   # Fair Queue required by BBR
# Optional: list supported algorithms
# cat /proc/sys/net/ipv4/tcp_available_congestion_control   # output: reno cubic bbr
# Disable slow start after idle (good for long‑lived connections)
net.ipv4.tcp_slow_start_after_idle = 0
# Do not save metrics for closed connections
net.ipv4.tcp_no_metrics_save = 1

Key explanations:

BBR (Bottleneck Bandwidth and RTT) : Google’s algorithm, 2‑10× higher throughput than CUBIC. default_qdisc = fq: BBR requires the Fair Queue scheduler. tcp_slow_start_after_idle = 0: Disables slow start after idle, suitable for long‑lived connections (e.g., WebSocket).

Pre‑validation:

# Check kernel support for BBR
grep -i bbr /proc/sys/net/ipv4/tcp_available_congestion_control   # expected: reno cubic bbr
# Current congestion algorithm
sysctl net.ipv4.tcp_congestion_control   # expected: cubic (default)
# Kernel version (needs 4.9+)
uname -r   # expected: 5.15.0+

Post‑validation:

# Apply configuration
sysctl -p
# Verify BBR is active
sysctl net.ipv4.tcp_congestion_control   # expected: bbr [tested]
# View BBR statistics (kernel 5.0+)
ss -ti | grep bbr
# Performance comparison (iperf3)
# BBR vs CUBIC:
# CUBIC: 500 Mbps
# BBR:   800 Mbps (≈60% improvement)

1.5 Other Network Parameters

# Enable TCP Fast Open (kernel 3.7+)
net.ipv4.tcp_fastopen = 3   # 1=client, 2=server, 3=both
# Enable TCP timestamps (required by BBR)
net.ipv4.tcp_timestamps = 1
# Enable TCP SACK (selective acknowledgments)
net.ipv4.tcp_sack = 1
# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1
# Max TCP orphan sockets
net.ipv4.tcp_max_orphans = 262144
# Local port range for client connections
net.ipv4.ip_local_port_range = 1024 65535
# Disable ICMP redirects (security)
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
# Enable reverse path filtering (anti‑IP spoofing)
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1

Step 2: File Descriptor Optimization

Goal: Support million‑level concurrent connections (each connection consumes one file descriptor).

2.1 System‑level Limits

# /etc/sysctl.conf
fs.file-max = 2097152   # System-wide maximum file descriptors (2 M)
# View current usage
cat /proc/sys/fs/file-nr   # format: allocated used max
# Example output: 12000 10000 2097152

2.2 Process‑level Limits

Configuration file:

/etc/security/limits.conf

# All users
* soft nofile 1048576
* hard nofile 1048576
# Specific user (e.g., nginx)
nginx soft nofile 1048576
nginx hard nofile 1048576
# Root user
root soft nofile 1048576
root hard nofile 1048576
# Other resources
* soft nproc 65535   # max processes
* hard nproc 65535
* soft stack 10240   # stack size (KB)
* hard stack 10240

Immediate effect (current session):

ulimit -n 1048576
ulimit -u 65535

Permanent effect (requires re‑login or service restart):

# Systemd service limits (example for nginx)
mkdir -p /etc/systemd/system/nginx.service.d
cat > /etc/systemd/system/nginx.service.d/limits.conf <<EOF
[Service]
LimitNOFILE=1048576
LimitNPROC=65535
EOF
systemctl daemon-reload
systemctl restart nginx

Post‑validation:

# Verify system‑wide limit
cat /proc/sys/fs/file-max   # expected: 2097152 [tested]
# Verify process‑level limit
ulimit -n   # expected: 1048576
# Verify specific process (nginx) limits
cat /proc/$(pgrep nginx | head -1)/limits | grep "open files"
# Expected output: Max open files 1048576 1048576 files
# Current file descriptor usage
lsof | wc -l
# Or
cat /proc/sys/fs/file-nr | awk '{print $1-$2}'   # available descriptors

Step 3: Memory Parameter Optimization

Goal: Optimize memory management, reduce swap usage, improve cache efficiency.

# /etc/sysctl.conf
# Swap usage strategy (0‑100, lower means less swap)
vm.swappiness = 10   # 0 = only when out of memory, 10 = recommended for DB/cache servers
# Dirty page flushing
vm.dirty_ratio = 20   # flush when dirty pages reach 20% of memory
vm.dirty_background_ratio = 10   # background flush at 10%
vm.dirty_expire_centisecs = 3000   # dirty page expiration time 30 s
vm.dirty_writeback_centisecs = 500   # background writeback interval 5 s
# Virtual memory behavior
vm.overcommit_memory = 1   # allow memory overcommit (required by Redis)
vm.overcommit_ratio = 50   # overcommit ratio 50%
# Transparent Huge Pages (disable for databases)
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Minimum free memory (KB)
vm.min_free_kbytes = 262144   # reserve 256 MB free memory

Key explanations: vm.swappiness: Controls kernel’s tendency to use swap; databases/cache servers usually set 1‑10. vm.dirty_ratio: When dirty pages exceed this percentage, processes block until flushed. vm.overcommit_memory = 1: Allows processes to allocate more virtual memory than physical (needed by Redis fork).

Post‑validation:

# Check current swap usage
free -h
# Example output shows swap usage near 0 B
# Check dirty page ratio
cat /proc/vmstat | grep dirty   # e.g., nr_dirty 1200
# Verify Transparent HugePage status
cat /sys/kernel/mm/transparent_hugepage/enabled   # expected: [never] for DB workloads

Step 4: Full Configuration File Example

File path:

/etc/sysctl.d/99-high-performance.conf

# ============================
# High‑performance web server kernel tuning
# Applicable scenario: million concurrent connections
# Tested on: Ubuntu 22.04 / Kernel 5.15+
# ============================

# ===== Network parameters =====

# TCP connection queues
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 16384

# TCP buffers
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144

# TCP memory management (64 GB environment)
net.ipv4.tcp_mem = 786432 1048576 1572864

# TIME_WAIT optimization
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 55000

# TCP keepalive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3

# TCP connection timeout
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2

# TCP congestion control (BBR)
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

# Other TCP optimizations
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_max_orphans = 262144

# Local port range
net.ipv4.ip_local_port_range = 1024 65535

# Security parameters
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_syncookies = 1

# ===== File descriptors =====
fs.file-max = 2097152

# ===== Memory management =====
vm.swappiness = 10
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.min_free_kbytes = 262144

# ===== Other kernel parameters =====
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536

Apply configuration:

# Load the new sysctl file
sysctl -p /etc/sysctl.d/99-high-performance.conf
# Verify key parameters
sysctl -a | grep -E "(somaxconn|tcp_rmem|bbr|file-max|swappiness)"
# Persist across reboots (already in /etc/sysctl.d)

Step 5: NIC Parameter Optimization

Goal: Optimize NIC driver parameters to increase network throughput.

# View NIC model and driver
ethtool -i eth0   # example output: driver: igb, version: 5.4.0
# View current NIC parameters
ethtool -g eth0   # ring buffer sizes
ethtool -k eth0   # offload features
ethtool -S eth0   # statistics

# Optimize ring buffer (receive queue)
ethtool -G eth0 rx 4096 tx 4096

# Enable offload features (hardware acceleration)
ethtool -K eth0 tso on   # TCP Segmentation Offload
ethtool -K eth0 gso on   # Generic Segmentation Offload
ethtool -K eth0 gro on   # Generic Receive Offload
ethtool -K eth0 sg on    # Scatter‑Gather
ethtool -K eth0 rx-checksumming on
ethtool -K eth0 tx-checksumming on

# Adjust interrupt coalescing
ethtool -C eth0 rx-usecs 50 tx-usecs 50

# Enable multi‑queue (RSS/RPS)
ethtool -l eth0   # view queue count
ethtool -L eth0 combined 8   # set 8 queues (if hardware supports)

# Bind interrupts to specific CPUs (IRQ affinity)
#!/bin/bash
for IRQ in $(grep eth0 /proc/interrupts | awk '{print $1}' | sed 's/://'); do
  echo 1 > /proc/irq/${IRQ}/smp_affinity
done

Persist NIC tuning (systemd service):

# /etc/systemd/system/network-tuning.service
[Unit]
Description=Network Performance Tuning
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/ethtool -G eth0 rx 4096 tx 4096
ExecStart=/usr/sbin/ethtool -K eth0 tso on gso on gro on
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Minimal Required Principles

Core mechanisms:

Application Layer → Socket API → Transport Layer (TCP/UDP) ← focus of this guide
→ Kernel protocol stack → NIC driver → Physical NIC

Key optimization points:

TCP connection queues (half‑open SYN queue and full‑open accept queue).

TCP buffers (receive and send windows).

TIME_WAIT state (resource consumption).

BBR congestion control (higher throughput vs CUBIC).

Why million‑concurrency needs these parameters?

1 M connections ≈ 1 FD + ~3 KB kernel memory each → ~3 GB total.

100 k QPS → 100 k accept(), read()/write(), close() per second → requires large queues and buffers.

10 GbE NIC → theoretical 10 Gbps = 1.25 GB/s → ~830 k packets per second → NIC queues and interrupt handling must be tuned.

Observability (Monitoring + Alerts + Performance)

Monitoring Metrics

Key system metrics:

# 1. Network connection statistics
ss -s   # example: TCP: 850000 (estab 800000, closed 40000, orphaned 100, timewait 39000)

# 2. File descriptor usage
cat /proc/sys/fs/file-nr | awk '{printf "Usage: %.2f%%
", ($1-$2)/$3*100}'

# 3. TCP queue overflow
netstat -s | grep -i overflow   # key: "times the listen queue of a socket overflowed"

# 4. TCP retransmission rate
netstat -s | grep -i retrans   # key: "segments retransmitted"

# 5. Memory usage
free -h
cat /proc/meminfo | grep -E "(MemTotal|MemFree|Cached|SwapTotal|SwapFree)"

# 6. NIC traffic
sar -n DEV 1 10   # per‑second sampling

Prometheus monitoring (node_exporter):

# Key metrics
node_netstat_Tcp_CurrEstab          # current TCP connections
node_sockstat_TCP_tw                # TIME_WAIT count
(node_filefd_allocated - node_filefd_maximum) / node_filefd_maximum * 100   # fd usage %
rate(node_network_receive_bytes_total[1m])   # inbound throughput
rate(node_network_transmit_bytes_total[1m])  # outbound throughput
rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100   # retransmission %
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100   # memory usage %

Prometheus alert rules (example):

groups:
- name: system_performance
  interval: 30s
  rules:
  # Alert 1: Too many TCP connections
  - alert: HighTCPConnections
    expr: node_netstat_Tcp_CurrEstab > 500000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High TCP connections ({{ $labels.instance }})"
      description: "Current TCP connections {{ $value }}, exceeds 500k"

  # Alert 2: Excessive TIME_WAIT
  - alert: HighTimeWaitConnections
    expr: node_sockstat_TCP_tw > 50000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High TIME_WAIT connections ({{ $labels.instance }})"
      description: "Current TIME_WAIT connections {{ $value }}"

  # Alert 3: File descriptor usage high
  - alert: HighFileDescriptorUsage
    expr: (node_filefd_allocated / node_filefd_maximum) * 100 > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "File descriptor usage high ({{ $labels.instance }})"
      description: "Current usage {{ $value }}%"

  # Alert 4: TCP retransmission rate high
  - alert: HighTCPRetransmissionRate
    expr: rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100 > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High TCP retransmission rate ({{ $labels.instance }})"
      description: "Current retransmission {{ $value }}%"

  # Alert 5: Swap usage high
  - alert: HighSwapUsage
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High swap usage ({{ $labels.instance }})"
      description: "Current swap usage {{ $value }}%"

Performance Benchmarking

Tool: wrk (HTTP load testing)

# Install wrk
git clone https://github.com/wg/wrk.git
cd wrk && make
cp wrk /usr/local/bin/

# Test 1: Short‑connection performance
wrk -t10 -c10000 -d30s --latency http://localhost/
# Expected output after tuning (example):
# Running 30s test @ http://localhost/
#   10 threads and 10000 connections
#   Thread Stats   Avg      Stdev    Max   +/- Stdev
#     Latency   50.12ms   10.23ms 200.00ms   75.23%
#     Req/Sec   20.50k    2.10k  30.00k    85.67%
#   6150000 requests in 30.00s, 5.12GB read
#   Requests/sec: 205000.00
#   Transfer/sec: 175.00MB

# Test 2: Long‑connection performance
wrk -t10 -c10000 -d30s --latency -H "Connection: keep-alive" http://localhost/

# Test 3: Custom POST request (Lua script)
cat > post.lua <<'EOF'
wrk.method = "POST"
wrk.body   = '{"key":"value"}'
wrk.headers["Content-Type"] = "application/json"
EOF
wrk -t10 -c1000 -d30s -s post.lua http://localhost/api/test

Tool: ab (Apache Bench)

# Install ab
apt install -y apache2-utils   # Ubuntu
yum install -y httpd-tools       # RHEL

# Test
ab -n 1000000 -c 10000 -k http://localhost/
# Expected output after tuning (example):
# Requests per second: 150000.00 [#/sec] (mean)
# Time per request: 66.667 ms (mean)
# Time per request: 0.007 ms (mean, across all concurrent requests)

Tool: iperf3 (network bandwidth test)

# Server
iperf3 -s
# Client (test TCP throughput)
iperf3 -c 192.168.1.10 -t 60 -P 10
# Expected output with 10 GbE + BBR: ~9.5 Gbps

Performance Comparison (Before vs After)

Metric

Before Optimization

After Optimization

Improvement

QPS

50k

200k

4×

Max concurrent connections

100k

1 M

10×

P99 latency

200 ms

90 ms

‑55%

Network throughput (BBR vs CUBIC)

6 Gbps

9.5 Gbps

+58%

TIME_WAIT connections

50k

10k

‑80%

Common Faults & Troubleshooting

Symptom

Diagnostic Command

Possible Root Cause

Quick Fix

Permanent Fix

Connection refused netstat -s | grep overflow Full‑connection queue full

Temporary: sysctl net.core.somaxconn=65535 Modify /etc/sysctl.conf and adjust application backlog

Performance drop sar -n DEV 1 10 1) NIC interrupt imbalance 2) Excessive TCP retransmits

1) Adjust IRQ affinity 2) Check network quality

1) Enable NIC multi‑queue 2) Enable BBR

Port exhaustion ss -tan | grep TIME_WAIT | wc -l Too many TIME_WAIT sockets sysctl net.ipv4.tcp_fin_timeout=15 Optimize application (use connection pooling)

File descriptor shortage lsof | wc -l ulimit limit ulimit -n 1048576 Edit /etc/security/limits.conf High swap usage free -h Insufficient memory / high swappiness

Temporary: swapoff -a && swapon -a 1) Add RAM 2) Set vm.swappiness=1 BBR not effective ss -ti | grep bbr Kernel does not support / module not loaded modprobe tcp_bbr Upgrade kernel to 4.9+ and ensure tcp_bbr is loaded at boot

Diagnostic command collection:

# 1. Overall performance overview
vmstat 1 10

# 2. Network connection status
ss -tan | awk '{print $1}' | sort | uniq -c

# 3. Detailed TCP stats
netstat -s | grep -E "(overflow|retrans|loss|reset)"

# 4. NIC statistics
sar -n DEV,EDEV 1 10

# 5. Top 10 processes by file descriptors
lsof | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

# 6. Top 10 IPs by TCP connections
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

# 7. Real‑time TCP connection count
watch -n 1 'ss -s'

# 8. Kernel packet drops
dmesg | grep -i "drop"

Change & Rollback Playbook

Gray‑scale Deployment Strategy

Scenario: Apply kernel parameter tuning to production.

# Phase 1: Single‑machine test (1 server)
# 1. Backup current config
sysctl -a > /tmp/sysctl_backup_$(hostname)_$(date +%Y%m%d).conf

# 2. Apply new config
sysctl -p /etc/sysctl.d/99-high-performance.conf

# 3. Performance test (wrk 30 min)
wrk -t10 -c10000 -d1800s --latency http://localhost/ > /tmp/wrk_test_$(date +%Y%m%d_%H%M).log

# 4. Monitor key metrics
watch -n 5 'ss -s; netstat -s | grep overflow; free -h'

# 5. Observe 24 h, ensure no anomalies

# Phase 2: Small‑scale rollout (10% of servers)
# Using Ansible for batch deployment
ansible-playbook -i inventory.ini deploy_sysctl.yml --limit "web_servers[0:10]"

# Phase 3: Full rollout (100% of servers)
ansible-playbook -i inventory.ini deploy_sysctl.yml

Rollback Conditions & Commands

Rollback triggers:

QPS drops > 20%.

Error rate rises > 5%.

Connection failure rate increases.

System logs show many errors.

Rollback steps:

# 1. Immediately restore original config
sysctl -p /tmp/sysctl_backup_$(hostname)_20250115.conf

# 2. Verify rollback result
sysctl -a | grep -E "(somaxconn|tcp_rmem|bbr)" > /tmp/sysctl_after_rollback.conf
diff /tmp/sysctl_backup_$(hostname)_20250115.conf /tmp/sysctl_after_rollback.conf

# 3. Restart affected services (e.g., nginx)
systemctl restart nginx

# 4. Verify service health
curl -I http://localhost/
wrk -t2 -c100 -d10s http://localhost/

# 5. Record rollback reason
echo "Rollback time: $(date)" >> /var/log/sysctl_rollback.log
echo "Rollback reason: [fill in]" >> /var/log/sysctl_rollback.log
dmesg | tail -100 >> /var/log/sysctl_rollback.log

Best Practices

Stage‑wise optimization (easy → hard):

Phase 1: TCP connection queue + file descriptors (quick wins).

Phase 2: TCP buffers + TIME_WAIT (significant boost).

Phase 3: BBR + NIC tuning (fine‑tuning).

Parameter calculation formulas :

# Full‑connection queue
somaxconn = expected_max_concurrency / 10

# TCP memory (pages)
# high water = total_memory_GB * 0.1 * 256

# File descriptors
fs.file-max = expected_max_connections * 2

# TIME_WAIT timeout
tcp_fin_timeout = 15‑30 s (short‑conn) / 30‑60 s (long‑conn)

Monitoring priority :

P0 (must monitor): TCP connections, fd usage, queue overflow.

P1 (important): TIME_WAIT count, TCP retransmission, network throughput.

P2 (reference): Swap usage, dirty page ratio, NIC errors.

Avoid common pitfalls :

❌ Blindly increase all parameters – may exhaust memory.

❌ Ignore application‑level settings (e.g., Nginx backlog).

❌ Overlook hardware limits – 10 GbE required for BBR gains.

❌ Deploy to production without gray‑scale testing.

Container considerations :

Containers share host kernel – sysctl changes must be applied on the host.

Parameters like somaxconn cannot be modified inside containers.

cgroup limits (CPU/memory) should be tuned together.

Regular maintenance :

Monthly verify parameter effectiveness ( sysctl -a).

Quarterly benchmark performance.

Re‑validate after kernel upgrades.

FAQ

Q1: Why does the listen queue stay full after increasing somaxconn ? A: The application’s listen() backlog must also be increased (e.g., Nginx listen 80 backlog=65535;).

Q2: What’s the difference between BBR and CUBIC? A: CUBIC relies on packet loss to detect congestion; BBR uses measured bandwidth and RTT, delivering 2‑10× higher throughput in high‑latency or lossy networks.

Q3: Difference between tcp_tw_reuse and tcp_tw_recycle ? A: tw_reuse safely reuses TIME_WAIT sockets on the client side; tw_recycle breaks NAT environments and has been removed since kernel 4.12.

Q4: How to decide if BBR should be enabled? A: Enable for cross‑region, mobile, or high‑latency networks; in low‑latency data‑center LAN the benefit is modest. Verify with iperf3 comparisons.

Q5: How to set tcp_mem on a 64 GB system? A: Use the formula total_GB * 0.1 * 256. For 64 GB: low = 819200, pressure = 1638400, high = 2457600 → net.ipv4.tcp_mem = 819200 1638400 2457600.

Q6: Can these parameters be changed inside containers? A: Only application‑level limits (e.g., ulimit) can be set inside containers; kernel sysctl parameters must be adjusted on the host.

Q7: Is setting vm.swappiness=0 safe in production? A: Not recommended. Values of 0 can trigger aggressive OOM killing; use 1‑10 instead.

Q8: Do TIME_WAIT sockets consume ports? A: Yes, they occupy the full 4‑tuple and can exhaust client ports under heavy short‑connection loads.

Q9: How to verify BBR is really active? A: Check with ss -ti | grep bbr, ensure the tcp_bbr module is loaded ( lsmod | grep tcp_bbr), and run iperf3 tests comparing BBR vs CUBIC.

Q10: How much memory is needed for a million concurrent connections? A: Roughly 3 KB per TCP connection → ~3 GB total; recommend 32 GB+ to leave headroom for the OS and applications.

Extended Reading

Official documentation:

Linux kernel docs: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

TCP BBR paper: https://research.google/pubs/pub45646/

Red Hat performance tuning guide: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/

In‑depth technical blogs:

Cloudflare BBR practice: https://blog.cloudflare.com/http-2-prioritization-with-nginx/

Linux network stack source analysis: https://github.com/torvalds/linux/tree/master/net/ipv4

Tools & resources:

Linux performance analysis toolkit: https://www.brendangregg.com/linuxperf.html

Sysctl quick reference: https://sysctl-explorer.net/

Generation time: 2025-01-15 Article version: v1.0 Validation environment: Ubuntu 22.04 + Kernel 5.15 / RHEL 9.3 + Kernel 5.14

Performance Linux Sysctl Kernel Parameters Network Tuning

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.