Boost 100k+ Concurrent Linux Systems: 300% QPS Increase & 50% Latency Reduction
This guide details step‑by‑step Linux kernel and network‑stack tuning techniques—including sysctl tweaks, TCP congestion control, interrupt affinity, memory and disk optimizations—to raise QPS by up to 316% and cut P99 latency by half for high‑concurrency web, API, and database services.
Linux Kernel and Network Stack Tuning: Boost 100k+ Concurrent Systems QPS by 300% and Reduce Latency by 50%
Applicable Scenarios & Prerequisites
Applicable Scenarios : High‑concurrency Web/API services (QPS 50K+), network‑intensive applications (CDN/Proxy/Gateway), database/cache servers (MySQL/Redis/Kafka), container platforms (Kubernetes node optimization).
Prerequisites : Linux kernel 4.18+ (recommended 5.10+ LTS), root access for sysctl, understanding of current bottlenecks (CPU/Network/Disk), established performance baseline (load test data).
Environment & Version Matrix
Component Version Requirement Key Feature Dependency Test Environment
OS RHEL 8+ / Ubuntu 22.04+ - RHEL 8.6 / Ubuntu 22.04
Kernel 5.10+ LTS BBR congestion control, eBPF tracing 5.15.0-89
CPU 8C+ Multi‑queue NIC support 16C (Intel Xeon)
Network 10Gbps+ RSS/RPS/XPS support 25Gbps NIC
Memory 16GB+ Conntrack table size 64GBQuick Checklist
Step 1: Establish performance baseline (load test & monitoring)
Step 2: TCP stack optimization (connections, buffers, congestion control)
Step 3: Network layer optimization (backlog, conntrack, port range)
Step 4: Interrupt and CPU affinity tuning
Step 5: Memory and file‑descriptor tuning
Step 6: Disk I/O and filesystem tuning
Step 7: Application‑layer coordination (Nginx/JVM)
Step 8: Verify results and persist configuration
Implementation Steps
Step 1: Establish Performance Baseline
Goal : Record pre‑tuning key metrics for comparison.
# System load and CPU
uptime
# Output: load average: 15.2, 12.8, 10.3 (baseline)
mpstat -P ALL 1 10 | tee /tmp/cpu-baseline.txt
# Network throughput and connections
sar -n DEV 1 10 | tee /tmp/network-baseline.txt
ss -s
netstat -s | grep -E "retransmit|timeout"
wrk -t8 -c1000 -d60s --latency http://192.168.1.10:8080/Baseline Example :
Optimized baseline:
- QPS: 12000 req/s
- P99 latency: 850ms
- Retransmission rate: 1.2%
- CPU %sys: 35%
- TCP ESTABLISHED: 8500Step 2: TCP Stack Optimization
Goal : Improve TCP connection establishment, data transfer, and teardown efficiency.
TCP Connection Queue Tuning
# /etc/sysctl.d/10-tcp-tuning.conf
net.ipv4.tcp_max_syn_backlog = 16384 # default 1024
net.core.somaxconn = 16384 # listen() backlog limit
net.ipv4.tcp_syncookies = 1 # enable SYN cookies
net.ipv4.tcp_synack_retries = 2 # reduce retries
sysctl -p /etc/sysctl.d/10-tcp-tuning.confValidate SYN queue overflow :
netstat -s | grep -i "SYNs to LISTEN"
# Expected near 0 drops after tuningTCP Buffer Tuning
# TCP read/write buffers
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_moderate_rcvbuf = 1
# Global socket limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"Parameter Explanation : tcp_rmem[2]: max receive window, affects download speed. tcp_wmem[2]: max send window, affects upload speed. tcp_moderate_rcvbuf: auto‑adjusts window based on RTT and bandwidth.
TCP TIME_WAIT Optimization
net.ipv4.tcp_tw_reuse = 1 # allow reuse of TIME_WAIT sockets
net.ipv4.tcp_fin_timeout = 15 # default 60s
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_max_orphans = 65536
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15Note : tcp_tw_recycle is removed in kernel 4.12; use tcp_tw_reuse instead.
TCP Congestion Control
# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# Switch to BBR (kernel 4.9+)
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
# Persist settings
cat <<EOF >> /etc/sysctl.d/10-tcp-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
EOFBBR vs Cubic :
Cubic (default): loss‑based, performance drops under high loss.
BBR : bandwidth‑delay based, improves throughput 10‑25% in lossy environments.
Step 3: Network Layer Optimization
IP Layer Parameters
# Local port range for client connections
net.ipv4.ip_local_port_range = 10000 65000
# Conntrack table size (NAT/firewall)
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 600
# ARP cache thresholds
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192
sysctl -p /etc/sysctl.d/10-tcp-tuning.confValidate conntrack usage :
cat /proc/sys/net/netfilter/nf_conntrack_count # current usage
cat /proc/sys/net/netfilter/nf_conntrack_max # limit
dmesg | grep "nf_conntrack: table full" # overflow warningNIC Multi‑Queue and Interrupt Affinity
# Check NIC queues
ethtool -l eth0
# Increase queues (hardware dependent)
ethtool -L eth0 combined 16
# View interrupt distribution
cat /proc/interrupts | grep eth0
# Manually bind interrupts to CPUs (example)
echo 1 > /proc/irq/125/smp_affinity_list # CPU 0
echo 2 > /proc/irq/126/smp_affinity_list # CPU 1Automation script (irqbalance alternative) :
#!/bin/bash
IFACE=eth0
QUEUES=$(ls -d /sys/class/net/$IFACE/queues/rx-* | wc -l)
CPU_COUNT=$(nproc)
for ((i=0; i<$QUEUES; i++)); do
IRQ=$(cat /sys/class/net/$IFACE/queues/rx-$i/../../../msi_irqs/* | head -1)
CPU=$((i % CPU_COUNT))
echo $CPU > /proc/irq/$IRQ/smp_affinity_list
echo "IRQ $IRQ -> CPU $CPU"
doneRPS/RFS Software Multi‑Queue
# Enable RPS (for NICs with fewer queues than CPUs)
for i in /sys/class/net/eth0/queues/rx-*/rps_cpus; do echo ffff > $i; done
# Enable RFS (flow steering to application CPU)
sysctl -w net.core.rps_sock_flow_entries=32768
for i in /sys/class/net/eth0/queues/rx-*/rps_flow_cnt; do echo 2048 > $i; doneStep 4: Interrupt & CPU Affinity
Goal : Reduce context switches and soft‑interrupt overhead.
Soft‑Interrupt Monitoring
mpstat -P ALL 1 5 # watch %soft column (<10% ideal)
cat /proc/softirqs # NET_RX per‑CPU counts
watch -n1 'cat /proc/softirqs | grep NET_RX'Application Process CPU Binding
# Check current affinity
taskset -cp <PID>
# Bind Nginx workers to specific CPUs
worker_processes auto;
worker_cpu_affinity 00000001 00000010 00000100 00001000 00010000 00100000 01000000 10000000;Step 5: Memory & File Descriptor Tuning
Memory Management
# Dirty page writeback
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.dirty_writeback_centisecs = 100
# Overcommit policy (Redis/MongoDB)
vm.overcommit_memory = 1
# Swappiness (reduce swapping)
vm.swappiness = 10
cat <<EOF >> /etc/sysctl.d/20-memory.conf
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 10
EOF
sysctl -p /etc/sysctl.d/20-memory.confFile Descriptor Limits
# System-wide limit
fs.file-max = 2097152
# Per‑process limits (limits.conf)
cat <<EOF >> /etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000
root soft nofile 100000
root hard nofile 100000
EOF
ulimit -n 100000
# Verify usage
cat /proc/sys/fs/file-nr # used, free, maxStep 6: Disk I/O & Filesystem
I/O Scheduler
# Current scheduler
cat /sys/block/sda/queue/scheduler # e.g., noop [deadline] cfq
# For SSD, use noop
echo noop > /sys/block/sda/queue/scheduler
# Persist via udev rule
cat <<EOF > /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop"
EOFFilesystem Mount Options (XFS example)
/dev/sda1 /data xfs defaults,noatime,nodiratime,nobarrier 0 0
# Remount to apply
mount -o remount /data
mount | grep /dataStep 7: Application‑Layer Coordination
Nginx Configuration
# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;
worker_rlimit_nofile 100000;
worker_cpu_affinity auto;
events {
use epoll;
worker_connections 10000;
multi_accept on;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 1000;
upstream backend {
server 192.168.1.21:8080;
keepalive 256;
keepalive_requests 1000;
keepalive_timeout 60s;
}
}Java JVM Tuning
# JVM start options
JAVA_OPTS="-Xms8g -Xmx8g \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:+ParallelRefProcEnabled \
-XX:+UnlockExperimentalVMOptions \
-XX:G1NewSizePercent=30 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+DisableExplicitGC \
-Djava.net.preferIPv4Stack=true"
# HikariCP pool example
spring.datasource.hikari.maximum-pool-size=200
spring.datasource.hikari.minimum-idle=20
spring.datasource.hikari.connection-timeout=5000Step 8: Verify Results & Persist Configuration
Performance Validation
# Repeat the same wrk test after tuning
wrk -t8 -c1000 -d60s --latency http://192.168.1.10:8080/
# Expected results after tuning:
# QPS: 38000 req/s (+316%)
# P99 latency: 420ms (-50%)
# Retransmission rate: 0.3% (-75%)
# CPU %sys: 18% (-48%)Persist Configuration Checks
# Merge all sysctl snippets
cat /etc/sysctl.d/*.conf > /etc/sysctl.conf
sysctl -p /etc/sysctl.conf
# Verify services start on boot
systemctl list-unit-files | grep -E "nginx|network" | grep enabledMonitoring & Alerting
Key Performance Indicators
# Real‑time monitoring script (perf-monitor.sh)
#!/bin/bash
while true; do
echo "=== $(date) ==="
ss -s | grep TCP
netstat -s | grep -E "segments retransmited"
mpstat -P ALL 1 1 | awk '/Average/ {print "CPU Soft IRQ: " $8 "%"}'
sar -n DEV 1 1 | grep eth0 | tail -1 | awk '{print "RX: " $5 " KB/s, TX: " $6 " KB/s"}'
echo "---"
sleep 5
donePrometheus Metrics Collection
# node‑exporter metrics of interest
node_netstat_Tcp_RetransSegs # TCP retransmissions
node_netstat_TcpExt_TCPTimeouts # TCP timeouts
node_softnet_dropped_total # Soft‑IRQ drops
node_network_receive_drop_total # NIC receive dropsAlert Rules :
groups:
- name: kernel_network_alerts
rules:
- alert: HighTCPRetrans
expr: rate(node_netstat_Tcp_RetransSegs[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "TCP retransmission rate abnormal (>100/s)"
- alert: SoftnetDropped
expr: rate(node_softnet_dropped_total[1m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Soft‑IRQ drop indicates kernel network stack overload"Performance & Capacity Planning
Sample capacity formula:
Single‑machine theoretical concurrency = (local ports × target IPs) / average connection duration (s)
Actual usable concurrency = min(
theoretical concurrency,
file‑descriptor limit,
memory limit (≈4KB per connection),
conntrack table size
)
# Example (16C 64G server):
# Port range: 10000‑65000 (55,000 ports)
# Targets: 10 backends
# Avg connection: 1s
# Theoretical: 550,000
# fd limit: 100,000 → recommend 80,000 concurrent connectionsSecurity & Compliance
SYN Flood Protection
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.tcp_synack_retries = 2
iptables -A INPUT -p tcp --dport 80 --syn -m limit --limit 100/s --limit-burst 200 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 --syn -j DROPIP Spoofing Protection
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0Common Issues & Troubleshooting
Soft‑IRQ CPU >30% : Increase NIC queues or balance interrupts (ethtool, irq affinity, RPS/RFS).
TCP retransmission rate >5% : Switch to BBR, increase buffers, check packet loss.
conntrack table full : Raise nf_conntrack_max and reduce timeout.
Excessive TIME_WAIT : Enable tcp_tw_reuse and use keep‑alive.
Port exhaustion : Expand ip_local_port_range.
File descriptor exhaustion : Raise fs.file-max and per‑process limits.
Best Practices (10 Items)
Layered optimization: TCP → IP → NIC → Application, validate each layer.
Enable BBR by default on production (kernel 4.9+).
Allow buffer auto‑adjustment via tcp_moderate_rcvbuf.
Force connection reuse: configure connection pools, enable Nginx keepalive.
Balance NIC interrupts across CPUs to avoid single‑core bottlenecks.
Monitor four key metrics: QPS, P99 latency, retransmission rate, soft‑IRQ CPU share.
Validate each tuning step with wrk/ab load tests against a baseline.
Adjust parameters incrementally (1‑3 at a time) and observe stability.
Control TIME_WAIT via tcp_tw_reuse but keep fin_timeout reasonable.
For container environments, additionally tune conntrack, iptables rule count, and overlay MTU.
Appendix: Full sysctl Configuration
# /etc/sysctl.d/99-production-tuning.conf
### TCP stack
net.ipv4.tcp_max_syn_backlog = 16384
net.core.somaxconn = 16384
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
### Network layer
net.ipv4.ip_local_port_range = 10000 65000
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.core.netdev_max_backlog = 8192
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192
### Memory
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 10
vm.overcommit_memory = 1
### Filesystem
fs.file-max = 2097152
### Security
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1One‑Click Apply Script
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/root/kernel-tuning-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Backup current settings
sysctl -a > "$BACKUP_DIR/sysctl-before.txt"
cp -r /etc/sysctl.d "$BACKUP_DIR/"
# Apply new config
curl -sO https://example.com/99-production-tuning.conf
mv 99-production-tuning.conf /etc/sysctl.d/
sysctl -p /etc/sysctl.d/99-production-tuning.conf
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
# Verify critical parameters
sysctl net.ipv4.tcp_congestion_control | grep -q bbr || { echo "BBR not active"; exit 1; }
sysctl net.ipv4.tcp_tw_reuse | grep -q "= 1" || { echo "tcp_tw_reuse not active"; exit 1; }
echo "Optimization applied, backup saved at $BACKUP_DIR"
echo "Reboot recommended to verify persistence"Tested on: 2025‑10, RHEL 8.6 / Ubuntu 22.04, Kernel 5.15, 16C 64G.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
