Operations 26 min read

Boost 100k+ Concurrent Linux Systems: 300% QPS Increase & 50% Latency Reduction

This guide details step‑by‑step Linux kernel and network‑stack tuning techniques—including sysctl tweaks, TCP congestion control, interrupt affinity, memory and disk optimizations—to raise QPS by up to 316% and cut P99 latency by half for high‑concurrency web, API, and database services.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Boost 100k+ Concurrent Linux Systems: 300% QPS Increase & 50% Latency Reduction

Linux Kernel and Network Stack Tuning: Boost 100k+ Concurrent Systems QPS by 300% and Reduce Latency by 50%

Applicable Scenarios & Prerequisites

Applicable Scenarios : High‑concurrency Web/API services (QPS 50K+), network‑intensive applications (CDN/Proxy/Gateway), database/cache servers (MySQL/Redis/Kafka), container platforms (Kubernetes node optimization).

Prerequisites : Linux kernel 4.18+ (recommended 5.10+ LTS), root access for sysctl, understanding of current bottlenecks (CPU/Network/Disk), established performance baseline (load test data).

Environment & Version Matrix

Component   Version Requirement   Key Feature Dependency   Test Environment
OS          RHEL 8+ / Ubuntu 22.04+   -                     RHEL 8.6 / Ubuntu 22.04
Kernel      5.10+ LTS             BBR congestion control, eBPF tracing   5.15.0-89
CPU         8C+                   Multi‑queue NIC support   16C (Intel Xeon)
Network     10Gbps+               RSS/RPS/XPS support      25Gbps NIC
Memory      16GB+                 Conntrack table size    64GB

Quick Checklist

Step 1: Establish performance baseline (load test & monitoring)

Step 2: TCP stack optimization (connections, buffers, congestion control)

Step 3: Network layer optimization (backlog, conntrack, port range)

Step 4: Interrupt and CPU affinity tuning

Step 5: Memory and file‑descriptor tuning

Step 6: Disk I/O and filesystem tuning

Step 7: Application‑layer coordination (Nginx/JVM)

Step 8: Verify results and persist configuration

Implementation Steps

Step 1: Establish Performance Baseline

Goal : Record pre‑tuning key metrics for comparison.

# System load and CPU
uptime
# Output: load average: 15.2, 12.8, 10.3 (baseline)
mpstat -P ALL 1 10 | tee /tmp/cpu-baseline.txt
# Network throughput and connections
sar -n DEV 1 10 | tee /tmp/network-baseline.txt
ss -s
netstat -s | grep -E "retransmit|timeout"
wrk -t8 -c1000 -d60s --latency http://192.168.1.10:8080/

Baseline Example :

Optimized baseline:
- QPS: 12000 req/s
- P99 latency: 850ms
- Retransmission rate: 1.2%
- CPU %sys: 35%
- TCP ESTABLISHED: 8500

Step 2: TCP Stack Optimization

Goal : Improve TCP connection establishment, data transfer, and teardown efficiency.

TCP Connection Queue Tuning

# /etc/sysctl.d/10-tcp-tuning.conf
net.ipv4.tcp_max_syn_backlog = 16384   # default 1024
net.core.somaxconn = 16384            # listen() backlog limit
net.ipv4.tcp_syncookies = 1           # enable SYN cookies
net.ipv4.tcp_synack_retries = 2       # reduce retries
sysctl -p /etc/sysctl.d/10-tcp-tuning.conf

Validate SYN queue overflow :

netstat -s | grep -i "SYNs to LISTEN"
# Expected near 0 drops after tuning

TCP Buffer Tuning

# TCP read/write buffers
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_moderate_rcvbuf = 1
# Global socket limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

Parameter Explanation : tcp_rmem[2]: max receive window, affects download speed. tcp_wmem[2]: max send window, affects upload speed. tcp_moderate_rcvbuf: auto‑adjusts window based on RTT and bandwidth.

TCP TIME_WAIT Optimization

net.ipv4.tcp_tw_reuse = 1          # allow reuse of TIME_WAIT sockets
net.ipv4.tcp_fin_timeout = 15      # default 60s
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_max_orphans = 65536
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15

Note : tcp_tw_recycle is removed in kernel 4.12; use tcp_tw_reuse instead.

TCP Congestion Control

# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# Switch to BBR (kernel 4.9+)
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
# Persist settings
cat <<EOF >> /etc/sysctl.d/10-tcp-tuning.conf
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
EOF

BBR vs Cubic :

Cubic (default): loss‑based, performance drops under high loss.

BBR : bandwidth‑delay based, improves throughput 10‑25% in lossy environments.

Step 3: Network Layer Optimization

IP Layer Parameters

# Local port range for client connections
net.ipv4.ip_local_port_range = 10000 65000
# Conntrack table size (NAT/firewall)
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 600
# ARP cache thresholds
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192
sysctl -p /etc/sysctl.d/10-tcp-tuning.conf

Validate conntrack usage :

cat /proc/sys/net/netfilter/nf_conntrack_count   # current usage
cat /proc/sys/net/netfilter/nf_conntrack_max     # limit
dmesg | grep "nf_conntrack: table full"          # overflow warning

NIC Multi‑Queue and Interrupt Affinity

# Check NIC queues
ethtool -l eth0
# Increase queues (hardware dependent)
ethtool -L eth0 combined 16
# View interrupt distribution
cat /proc/interrupts | grep eth0
# Manually bind interrupts to CPUs (example)
echo 1 > /proc/irq/125/smp_affinity_list   # CPU 0
echo 2 > /proc/irq/126/smp_affinity_list   # CPU 1

Automation script (irqbalance alternative) :

#!/bin/bash
IFACE=eth0
QUEUES=$(ls -d /sys/class/net/$IFACE/queues/rx-* | wc -l)
CPU_COUNT=$(nproc)
for ((i=0; i<$QUEUES; i++)); do
  IRQ=$(cat /sys/class/net/$IFACE/queues/rx-$i/../../../msi_irqs/* | head -1)
  CPU=$((i % CPU_COUNT))
  echo $CPU > /proc/irq/$IRQ/smp_affinity_list
  echo "IRQ $IRQ -> CPU $CPU"
done

RPS/RFS Software Multi‑Queue

# Enable RPS (for NICs with fewer queues than CPUs)
for i in /sys/class/net/eth0/queues/rx-*/rps_cpus; do echo ffff > $i; done
# Enable RFS (flow steering to application CPU)
sysctl -w net.core.rps_sock_flow_entries=32768
for i in /sys/class/net/eth0/queues/rx-*/rps_flow_cnt; do echo 2048 > $i; done

Step 4: Interrupt & CPU Affinity

Goal : Reduce context switches and soft‑interrupt overhead.

Soft‑Interrupt Monitoring

mpstat -P ALL 1 5   # watch %soft column (<10% ideal)
cat /proc/softirqs   # NET_RX per‑CPU counts
watch -n1 'cat /proc/softirqs | grep NET_RX'

Application Process CPU Binding

# Check current affinity
taskset -cp <PID>
# Bind Nginx workers to specific CPUs
worker_processes auto;
worker_cpu_affinity 00000001 00000010 00000100 00001000 00010000 00100000 01000000 10000000;

Step 5: Memory & File Descriptor Tuning

Memory Management

# Dirty page writeback
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.dirty_writeback_centisecs = 100
# Overcommit policy (Redis/MongoDB)
vm.overcommit_memory = 1
# Swappiness (reduce swapping)
vm.swappiness = 10
cat <<EOF >> /etc/sysctl.d/20-memory.conf
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 10
EOF
sysctl -p /etc/sysctl.d/20-memory.conf

File Descriptor Limits

# System-wide limit
fs.file-max = 2097152
# Per‑process limits (limits.conf)
cat <<EOF >> /etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000
root soft nofile 100000
root hard nofile 100000
EOF
ulimit -n 100000
# Verify usage
cat /proc/sys/fs/file-nr   # used, free, max

Step 6: Disk I/O & Filesystem

I/O Scheduler

# Current scheduler
cat /sys/block/sda/queue/scheduler   # e.g., noop [deadline] cfq
# For SSD, use noop
echo noop > /sys/block/sda/queue/scheduler
# Persist via udev rule
cat <<EOF > /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop"
EOF

Filesystem Mount Options (XFS example)

/dev/sda1 /data xfs defaults,noatime,nodiratime,nobarrier 0 0
# Remount to apply
mount -o remount /data
mount | grep /data

Step 7: Application‑Layer Coordination

Nginx Configuration

# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;
worker_rlimit_nofile 100000;
worker_cpu_affinity auto;

events {
  use epoll;
  worker_connections 10000;
  multi_accept on;
}

http {
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 65;
  keepalive_requests 1000;
  upstream backend {
    server 192.168.1.21:8080;
    keepalive 256;
    keepalive_requests 1000;
    keepalive_timeout 60s;
  }
}

Java JVM Tuning

# JVM start options
JAVA_OPTS="-Xms8g -Xmx8g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=200 \
  -XX:+ParallelRefProcEnabled \
  -XX:+UnlockExperimentalVMOptions \
  -XX:G1NewSizePercent=30 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+DisableExplicitGC \
  -Djava.net.preferIPv4Stack=true"
# HikariCP pool example
spring.datasource.hikari.maximum-pool-size=200
spring.datasource.hikari.minimum-idle=20
spring.datasource.hikari.connection-timeout=5000

Step 8: Verify Results & Persist Configuration

Performance Validation

# Repeat the same wrk test after tuning
wrk -t8 -c1000 -d60s --latency http://192.168.1.10:8080/
# Expected results after tuning:
# QPS: 38000 req/s  (+316%)
# P99 latency: 420ms (-50%)
# Retransmission rate: 0.3% (-75%)
# CPU %sys: 18% (-48%)

Persist Configuration Checks

# Merge all sysctl snippets
cat /etc/sysctl.d/*.conf > /etc/sysctl.conf
sysctl -p /etc/sysctl.conf
# Verify services start on boot
systemctl list-unit-files | grep -E "nginx|network" | grep enabled

Monitoring & Alerting

Key Performance Indicators

# Real‑time monitoring script (perf-monitor.sh)
#!/bin/bash
while true; do
  echo "=== $(date) ==="
  ss -s | grep TCP
  netstat -s | grep -E "segments retransmited"
  mpstat -P ALL 1 1 | awk '/Average/ {print "CPU Soft IRQ: " $8 "%"}'
  sar -n DEV 1 1 | grep eth0 | tail -1 | awk '{print "RX: " $5 " KB/s, TX: " $6 " KB/s"}'
  echo "---"
  sleep 5
done

Prometheus Metrics Collection

# node‑exporter metrics of interest
node_netstat_Tcp_RetransSegs          # TCP retransmissions
node_netstat_TcpExt_TCPTimeouts       # TCP timeouts
node_softnet_dropped_total            # Soft‑IRQ drops
node_network_receive_drop_total        # NIC receive drops

Alert Rules :

groups:
- name: kernel_network_alerts
  rules:
  - alert: HighTCPRetrans
    expr: rate(node_netstat_Tcp_RetransSegs[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "TCP retransmission rate abnormal (>100/s)"
  - alert: SoftnetDropped
    expr: rate(node_softnet_dropped_total[1m]) > 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Soft‑IRQ drop indicates kernel network stack overload"

Performance & Capacity Planning

Sample capacity formula:

Single‑machine theoretical concurrency = (local ports × target IPs) / average connection duration (s)
Actual usable concurrency = min(
    theoretical concurrency,
    file‑descriptor limit,
    memory limit (≈4KB per connection),
    conntrack table size
)
# Example (16C 64G server):
# Port range: 10000‑65000 (55,000 ports)
# Targets: 10 backends
# Avg connection: 1s
# Theoretical: 550,000
# fd limit: 100,000 → recommend 80,000 concurrent connections

Security & Compliance

SYN Flood Protection

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.tcp_synack_retries = 2
iptables -A INPUT -p tcp --dport 80 --syn -m limit --limit 100/s --limit-burst 200 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 --syn -j DROP

IP Spoofing Protection

net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0

Common Issues & Troubleshooting

Soft‑IRQ CPU >30% : Increase NIC queues or balance interrupts (ethtool, irq affinity, RPS/RFS).

TCP retransmission rate >5% : Switch to BBR, increase buffers, check packet loss.

conntrack table full : Raise nf_conntrack_max and reduce timeout.

Excessive TIME_WAIT : Enable tcp_tw_reuse and use keep‑alive.

Port exhaustion : Expand ip_local_port_range.

File descriptor exhaustion : Raise fs.file-max and per‑process limits.

Best Practices (10 Items)

Layered optimization: TCP → IP → NIC → Application, validate each layer.

Enable BBR by default on production (kernel 4.9+).

Allow buffer auto‑adjustment via tcp_moderate_rcvbuf.

Force connection reuse: configure connection pools, enable Nginx keepalive.

Balance NIC interrupts across CPUs to avoid single‑core bottlenecks.

Monitor four key metrics: QPS, P99 latency, retransmission rate, soft‑IRQ CPU share.

Validate each tuning step with wrk/ab load tests against a baseline.

Adjust parameters incrementally (1‑3 at a time) and observe stability.

Control TIME_WAIT via tcp_tw_reuse but keep fin_timeout reasonable.

For container environments, additionally tune conntrack, iptables rule count, and overlay MTU.

Appendix: Full sysctl Configuration

# /etc/sysctl.d/99-production-tuning.conf
### TCP stack
net.ipv4.tcp_max_syn_backlog = 16384
net.core.somaxconn = 16384
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_max_tw_buckets = 50000
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
### Network layer
net.ipv4.ip_local_port_range = 10000 65000
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.core.netdev_max_backlog = 8192
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192
### Memory
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 10
vm.overcommit_memory = 1
### Filesystem
fs.file-max = 2097152
### Security
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1

One‑Click Apply Script

#!/bin/bash
set -euo pipefail
BACKUP_DIR="/root/kernel-tuning-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Backup current settings
sysctl -a > "$BACKUP_DIR/sysctl-before.txt"
cp -r /etc/sysctl.d "$BACKUP_DIR/"
# Apply new config
curl -sO https://example.com/99-production-tuning.conf
mv 99-production-tuning.conf /etc/sysctl.d/
sysctl -p /etc/sysctl.d/99-production-tuning.conf
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
# Verify critical parameters
sysctl net.ipv4.tcp_congestion_control | grep -q bbr || { echo "BBR not active"; exit 1; }
sysctl net.ipv4.tcp_tw_reuse | grep -q "= 1" || { echo "tcp_tw_reuse not active"; exit 1; }
echo "Optimization applied, backup saved at $BACKUP_DIR"
echo "Reboot recommended to verify persistence"

Tested on: 2025‑10, RHEL 8.6 / Ubuntu 22.04, Kernel 5.15, 16C 64G.

Diagram illustrating kernel tuning layers
Diagram illustrating kernel tuning layers
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

network optimizationsysctlKernel Tuning
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.