Operations 14 min read

How to Boost Nginx to Over 1 Million QPS: Real‑World High‑Concurrency Tuning

This article presents a step‑by‑step, production‑tested guide for tuning Nginx from 100 k QPS to over a million requests per second, covering worker process settings, TCP tweaks, kernel parameters, caching, compression, SSL optimization, monitoring, and extreme‑case techniques such as DPDK and LuaJIT.

Liangxu Linux
Liangxu Linux
Liangxu Linux
How to Boost Nginx to Over 1 Million QPS: Real‑World High‑Concurrency Tuning

Introduction

Nginx is the dominant reverse proxy and load balancer in micro‑service architectures, but its default settings cannot sustain extreme traffic. Systematic tuning can raise throughput by an order of magnitude.

Performance before and after tuning

Optimization Stage   QPS   Response Time (ms)   CPU Usage   Memory Usage
Default Config       80k   125                 85%         2.1GB
Basic Tuning         250k  45                  68%         1.8GB
Deep Tuning          600k  18                  45%         1.5GB
Extreme Tuning       1.2M+ 8                  35%         1.2GB

Phase 1: Basic configuration tuning

1.1 Worker process tuning

# Set worker processes based on CPU cores
worker_processes auto;
worker_cpu_affinity auto;

events {
    worker_connections 65535;
    use epoll;               # Linux epoll model
    multi_accept on;         # Accept multiple connections per worker
}

On a 32‑core server, worker_processes auto yields ~15 % higher throughput than a fixed 32‑process setting because Nginx accounts for NUMA topology.

1.2 TCP connection tuning

http {
    tcp_nodelay on;               # Reduce small‑packet latency
    tcp_nopush on;                # Improve transmission efficiency
    keepalive_timeout 65;
    keepalive_requests 10000;
    client_max_body_size 20m;
    client_body_buffer_size 128k;
    client_header_buffer_size 4k;
    large_client_header_buffers 8 8k;
}

Phase 2: Kernel parameter tuning

2.1 System‑level tuning

# /etc/sysctl.conf
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_max_syn_backlog = 65535

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

fs.file-max = 6815744

2.2 User‑level limits

# /etc/security/limits.conf
nginx soft nofile 655350
nginx hard nofile 655350
nginx soft nproc 655350
nginx hard nproc 655350

Add the same limits to the systemd service file:

[Service]
LimitNOFILE=655350
LimitNPROC=655350

Phase 3: Cache and compression

3.1 Static file cache

location ~* \.(jpg|jpeg|png|gif|ico|css|js|pdf|txt)$ {
    expires 1y;
    add_header Cache-Control "public, immutable";
    add_header Pragma "cache";
    gzip_static on;
    access_log off;
    sendfile on;
    sendfile_max_chunk 1m;
}

3.2 Dynamic compression

# Gzip
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_comp_level 6;
gzip_types text/plain text/css text/xml text/javascript application/json application/javascript application/xml+rss application/atom+xml;

# Brotli (requires compiled module)
brotli on;
brotli_comp_level 6;
brotli_types text/plain text/css application/json application/javascript;

Enabling Brotli reduces transferred data by ~25 % and improves page‑load speed by ~35 %.

Phase 4: Advanced techniques

4.1 Upstream connection pool

upstream backend {
    least_conn;                     # Least‑connections load‑balancing
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
    keepalive 300;
    keepalive_requests 1000;
    keepalive_timeout 60s;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering on;
        proxy_buffer_size 128k;
        proxy_buffers 8 128k;
        proxy_busy_buffers_size 256k;
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
    }
}

4.2 SSL/TLS performance tuning

ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:50m;
ssl_session_timeout 1d;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
ssl_buffer_size 4k;
ssl_engine qat;               # Use hardware accelerator if available

4.3 Memory pool optimization

connection_pool_size 512;
request_pool_size 8k;
large_client_header_buffers 8 16k;
proxy_temp_file_write_size 256k;
proxy_temp_path /var/cache/nginx/proxy_temp levels=1:2 keys_zone=temp:10m;

Phase 5: Monitoring and testing

5.1 Key metrics

location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

Example Bash script that pushes metrics to StatsD:

#!/bin/bash
# nginx_monitor.sh
curl -s http://localhost/nginx_status | awk '
/Active connections/ {print "active_connections " $3}
/accepts/ {print "accepts " $1; print "handled " $2; print "requests " $3}
/Reading/ {print "reading " $2; print "writing " $4; print "waiting " $6}
' | while read metric value; do
    echo "nginx.${metric}:${value}|g" | nc -u localhost 8125
done

5.2 Benchmark commands

# wrk (high‑concurrency test)
wrk -t32 -c1000 -d60s --latency http://your-domain.com/

# ApacheBench for comparison
ab -n 100000 -c 1000 http://your-domain.com/

# wrk with custom Lua script
wrk -t32 -c1000 -d60s -s post.lua http://your-domain.com/api

Extreme optimization: breaking the million‑QPS barrier

6.1 Kernel bypass with DPDK

# Build Nginx with DPDK support
./configure --with-dpdk=/path/to/dpdk
# Bind network IRQs to specific CPU cores
echo 2 > /proc/irq/24/smp_affinity
echo 4 > /proc/irq/25/smp_affinity

6.2 JIT compilation via OpenResty (LuaJIT)

location /api {
    content_by_lua_block {
        ngx.header.content_type = "application/json"
        ngx.say('{"status": "ok"}')
    }
}

6.3 Zero‑copy and asynchronous I/O

splice on;
aio threads;
aio_write on;
directio 4m;
directio_alignment 512;

Real‑world case: e‑commerce flash‑sale system

Goal: 1.5 M QPS, <50 ms latency, 99.99 % availability. The configuration below handled a peak of 1.68 M QPS with an average response time of 32 ms.

upstream seckill_backend {
    hash $remote_addr consistent;
    server 10.0.1.10:8080 weight=3 max_conns=3000;
    server 10.0.1.11:8080 weight=3 max_conns=3000;
    server 10.0.1.12:8080 weight=4 max_conns=4000;
    keepalive 1000;
}

limit_req_zone $binary_remote_addr zone=seckill:100m rate=100r/s;
limit_conn_zone $binary_remote_addr zone=conn_seckill:100m;

server {
    location /seckill {
        limit_req zone=seckill burst=200 nodelay;
        limit_conn conn_seckill 10;
        proxy_cache seckill_cache;
        proxy_cache_valid 200 302 5s;
        proxy_cache_valid 404 1m;
        proxy_connect_timeout 1s;
        proxy_send_timeout 2s;
        proxy_read_timeout 2s;
        proxy_pass http://seckill_backend;
    }
}

Optimization checklist

Set worker_processes to CPU core count (use auto)

Enable epoll event model

Adjust worker_connections to match expected concurrent connections

Fine‑tune keepalive timeout and request limits

Configure appropriate buffer sizes for client body, headers, and proxy buffers

Apply kernel TCP tweaks ( somaxconn, netdev_max_backlog, etc.)

Raise file‑descriptor limits in /etc/security/limits.conf and systemd service

Use upstream keepalive and least_conn load‑balancing

Enable gzip and brotli compression

Optimize SSL/TLS (session cache, hardware offload, modern ciphers)

Expose stub_status and push metrics to a monitoring system

Run regular load‑test cycles (wrk, ab, custom scripts)

Common pitfalls and solutions

Too many worker_processes

Excessive processes cause high context‑switch overhead. Use worker_processes auto so Nginx selects the optimal number.

Ignoring upstream keepalive

Without keepalive, each request creates a new backend connection, increasing latency. Configure keepalive, keepalive_requests, and appropriate timeouts.

SSL handshake overhead

HTTPS can be slower than HTTP if session caching or hardware acceleration is disabled. Enable ssl_session_cache, ssl_engine qat, and tune cipher suites.

Log I/O bottleneck

Verbose access logs on high traffic can saturate disk I/O. Switch to asynchronous logging or disable unnecessary logs.

Future trends

HTTP/3 & QUIC support

listen 443 quic reuseport;
listen 443 ssl http2;
add_header Alt‑Svc 'h3=":443"; ma=86400';

Edge‑computing integration and AI‑driven auto‑tuning are expected to further boost Nginx performance.

Summary

Systematic Nginx optimization can raise throughput from 100 k QPS to the million‑level. The essential steps are layered configuration tuning, continuous monitoring, rigorous benchmark validation, and adapting settings to specific business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System optimizationload balancingperformance tuningNGINX
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.