Zero‑Downtime Nginx Load Balancing: Build a 99.99% HA Architecture
This guide walks through designing and implementing a highly available Nginx load‑balancing solution—covering applicable scenarios, prerequisites, environment matrix, step‑by‑step configuration of Nginx, Keepalived, SSL termination, health checks, monitoring, performance tuning, security hardening, troubleshooting, and a concise list of best‑practice recommendations.
Nginx High‑Availability Load Balancing: From a Single Point to a Zero‑Downtime 99.99% Architecture
Applicable Scenarios & Prerequisites
Applicable scenarios :
Web/API services requiring SLA > 99.9% availability
Horizontal scaling of backend services (3+ instances)
Zero‑downtime updates and automatic failover
Multi‑datacenter / multi‑availability‑zone deployments
Prerequisites :
2+ Nginx nodes (active‑passive or multi‑master)
Keepalived 1.4+ or cloud SLB/ELB
At least three backend instances with health‑check support
RHEL 8+ / Ubuntu 22.04+ with root access
Environment & Version Matrix
Component
Version Requirement
Key Feature Dependency
Minimum Resources
Nginx
1.20+
upstream health‑check, stream module
2C4G (supports 10K QPS)
Keepalived
1.4+
VRRP protocol, script health‑check
512M RAM
OS
RHEL 8+ / Ubuntu 22.04+
ip_vs, ipvs kernel modules
-
Kernel
4.18+
nf_conntrack optimization
-
HAProxy (optional)
2.6+
health check, session persistence
2C4G
Quick Checklist
Step 1: Plan HA topology (active/passive, dual‑master, multi‑layer LB)
Step 2: Install and configure Nginx basic load balancing
Step 3: Configure upstream health checks and failover
Step 4: Deploy Keepalived for VIP floating
Step 5: Set session persistence and load‑balancing algorithm
Step 6: Implement SSL/TLS termination and certificate management
Step 7: Set up monitoring, alerts, and log collection
Step 8: Conduct failure‑switch drills and rollback scripts
Implementation Steps
Step 1: Plan High‑Availability Architecture
Goal : Design an HA topology that meets business SLA.
Architecture Comparison
Solution
Availability
Cost
Complexity
Applicable Scenario
Single Nginx
95%
Low
Low
Dev/Test environments
Active‑Passive + VIP
99.9%
Medium
Medium
Small‑to‑medium production
Dual‑Master + DNS
99.95%
Medium
Medium
Multi‑datacenter deployments
Cloud LB + Nginx
99.99%
High
Low
Recommended for cloud environments
Multi‑layer LB
99.99%
High
High
Large clusters (10K+ QPS)
Recommended Architecture (Active‑Passive + VIP)
┌─────────────┐
│ VIP 192.168.1.100 │
└──────┬───────┘
│ (Keepalived VRRP)
┌─────────────────┴─────────────────┐
│ │
┌─────▼──────┐ ┌─────▼──────┐
│ Nginx Master│ │ Nginx Backup│
│ 192.168.1.10│ │ 192.168.1.11│
│ (MASTER) │ │ (BACKUP) │
└─────┬───────┘ └─────┬───────┘
│ │
└─────────────┬───────────┬─────────┘
│ │
┌──────────────▼─┐ ┌──────▼─────────┐ ┌──────────────┐
│ Backend‑1 │ │ Backend‑2 │ │ Backend‑3 │
│ :8080 │ │ :8080 │ │ :8080 │
└────────────────┘ └────────────────┘ └──────────────┘Step 2: Install and Configure Nginx
Goal : Deploy a standardized Nginx service.
RHEL/CentOS Installation
# Install Nginx official repo
cat <<EOF > /etc/yum.repos.d/nginx.repo
[nginx-stable]
name=nginx stable repo
baseurl=http://nginx.org/packages/rhel/$releasever/$basearch/
gpgcheck=1
enabled=1
gpgkey=https://nginx.org/keys/nginx_signing.key
EOF
# Install Nginx
yum install -y nginx
# Enable and start
systemctl enable --now nginx
systemctl status nginxUbuntu/Debian Installation
# Add official PPA
apt update
apt install -y curl gnupg2 ca-certificates lsb-release
curl -fsSL https://nginx.org/keys/nginx_signing.key | gpg --dearmor > /usr/share/keyrings/nginx-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] http://nginx.org/packages/ubuntu $(lsb_release -cs) nginx" > /etc/apt/sources.list.d/nginx.list
# Install
apt update
apt install -y nginx
# Start
systemctl enable --now nginxVerification :
nginx -v # nginx version: nginx/1.24.0
nginx -t # configuration syntax is ok
curl -I http://localhost # HTTP/1.1 200 OKStep 3: Configure Upstream & Health Checks
Goal : Enable load balancing and automatic removal of unhealthy backends.
Basic Upstream Configuration
# /etc/nginx/conf.d/upstream.conf
upstream backend_pool {
# load‑balancing algorithm (default round‑robin)
# least_conn; # least connections
# ip_hash; # session persistence
# hash $request_uri; # URL hash
# Backend servers
server 192.168.1.21:8080 weight=5 max_fails=3 fail_timeout=10s;
server 192.168.1.22:8080 weight=5 max_fails=3 fail_timeout=10s;
server 192.168.1.23:8080 weight=3 max_fails=3 fail_timeout=10s backup; # standby
# Connection pool
keepalive 128; # keep 128 idle connections
keepalive_requests 1000; # max 1000 requests per connection
keepalive_timeout 60s; # idle timeout
}
server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://backend_pool;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
}
location /health {
access_log off;
return 200 "healthy
";
add_header Content-Type text/plain;
}
}Parameter Explanation : max_fails=3: mark server down after 3 failures fail_timeout=10s: retry after 10 seconds backup: used only when all primary servers fail weight: traffic distribution based on weight
Reload Configuration :
nginx -t && nginx -s reload
# Verify upstream status (requires stub_status module)
curl http://localhost/nginx_statusActive Health Check (Open‑Source Module)
# Download Nginx source and health‑check module
cd /tmp
wget http://nginx.org/download/nginx-1.24.0.tar.gz
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git
# Compile and install
tar xf nginx-1.24.0.tar.gz
cd nginx-1.24.0
patch -p1 < /tmp/nginx_upstream_check_module/check_1.20.1+.patch
./configure \
--prefix=/etc/nginx \
--add-module=/tmp/nginx_upstream_check_module \
--with-http_ssl_module \
--with-http_v2_module \
--with-stream
make && make installHealth‑Check Configuration :
upstream backend_pool {
server 192.168.1.21:8080;
server 192.168.1.22:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "HEAD /health HTTP/1.0
";
check_http_expect_alive http_2xx http_3xx;
}
server {
location /upstream_status {
check_status;
access_log off;
}
}Step 4: Deploy Keepalived for VIP Floating
Goal : Automatic Nginx master‑backup failover via VRRP.
Install Keepalived
# RHEL/CentOS
yum install -y keepalived
# Ubuntu/Debian
apt install -y keepalived
# Enable service
systemctl enable --now keepalivedMaster Node Configuration
# /etc/keepalived/keepalived.conf (Master: 192.168.1.10)
global_defs {
router_id NGINX_MASTER
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_script check_nginx {
script "/etc/keepalived/check_nginx.sh"
interval 2
weight -20
fall 2
rise 1
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass SecurePass2024
}
virtual_ipaddress {
192.168.1.100/24
}
track_script { check_nginx }
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}Backup Node Configuration
# /etc/keepalived/keepalived.conf (Backup: 192.168.1.11)
# Same as master, but:
# router_id NGINX_BACKUP
# state BACKUP
# priority 90Health‑Check Script
# /etc/keepalived/check_nginx.sh
#!/bin/bash
pgrep nginx > /dev/null 2>&1 || exit 1
nc -zv localhost 80 > /dev/null 2>&1 || exit 1
curl -sf http://localhost/health > /dev/null 2>&1 || exit 1
exit 0
chmod +x /etc/keepalived/check_nginx.shState‑Change Notification Script
# /etc/keepalived/notify.sh
#!/bin/bash
TYPE=$1
DATE=$(date '+%Y-%m-%d %H:%M:%S')
case $TYPE in
MASTER) echo "$DATE - Transition to MASTER" >> /var/log/keepalived-state.log ;;
BACKUP) echo "$DATE - Transition to BACKUP" >> /var/log/keepalived-state.log ;;
FAULT) echo "$DATE - Fault detected" >> /var/log/keepalived-state.log ;;
esac
chmod +x /etc/keepalived/notify.shStart Keepalived and verify VIP:
systemctl restart keepalived
ip addr show eth0 | grep 192.168.1.100 # should show VIP on master
curl -I http://192.168.1.100 # HTTP/1.1 200 OKStep 5: Session Persistence & Load‑Balancing Algorithms
Goal : Choose the appropriate algorithm for the business case.
IP Hash (session persistence)
upstream backend_pool {
ip_hash;
server 192.168.1.21:8080;
server 192.168.1.22:8080;
server 192.168.1.23:8080;
}Consistent URL/ Cookie Hash
upstream backend_pool {
hash $request_uri consistent; # URL‑based
server 192.168.1.21:8080;
server 192.168.1.22:8080;
server 192.168.1.23:8080;
# hash $cookie_jsessionid consistent; # cookie‑based
}Least Connections
upstream backend_pool {
least_conn;
server 192.168.1.21:8080;
server 192.168.1.22:8080;
server 192.168.1.23:8080;
}Weighted Round‑Robin (default)
upstream backend_pool {
server 192.168.1.21:8080 weight=5; # 50%
server 192.168.1.22:8080 weight=3; # 30%
server 192.168.1.23:8080 weight=2; # 20%
}Step 6: SSL/TLS Offloading & Certificate Management
Goal : Terminate HTTPS at Nginx and forward plain HTTP to backends.
Obtain Let’s Encrypt Certificate
# Install Certbot
yum install -y certbot python3-certbot-nginx # RHEL/CentOS
apt install -y certbot python3-certbot-nginx # Ubuntu/Debian
# Request certificates (auto‑configure Nginx)
certbot --nginx -d api.example.com -d www.example.com
# Verify files
ls -l /etc/letsencrypt/live/api.example.com/
# fullchain.pem privkey.pem chain.pem cert.pem
# Test renewal
certbot renew --dry-runNginx HTTPS Configuration
# /etc/nginx/conf.d/ssl.conf
server {
listen 80;
server_name api.example.com;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name api.example.com;
ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256';
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
resolver 8.8.8.8 8.8.4.4 valid=300s;
add_header Strict-Transport-Security "max-age=63072000" always;
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
location / {
proxy_pass http://backend_pool;
# (other proxy settings as in Step 3)
}
}Validate SSL :
openssl s_client -connect api.example.com:443 -servername api.example.com
# Check certificate dates
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | openssl x509 -noout -datesStep 7: Monitoring, Alerts & Log Collection
Prometheus + Node Exporter
# Install nginx‑prometheus‑exporter
wget https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v0.11.0/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
tar xf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
cp nginx-prometheus-exporter /usr/local/bin/
# Enable stub_status
cat <<EOF >> /etc/nginx/conf.d/status.conf
server {
listen 8080;
location /stub_status {
stub_status;
access_log off;
allow 127.0.0.1;
deny all;
}
}
EOF
nginx -s reload
# Start exporter
nohup nginx-prometheus-exporter -nginx.scrape-uri=http://localhost:8080/stub_status &
# Verify metric
curl http://localhost:9113/metrics | grep nginx_Key PromQL Queries
# Request rate
rate(nginx_http_requests_total[1m])
# P99 backend latency
histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m]))
# 5xx error rate
rate(nginx_http_requests_total{status=~"5.."}[1m]) / rate(nginx_http_requests_total[1m]) * 100
# Upstream active connections
nginx_upstream_server_connections{state="active"}JSON Log Format
# /etc/nginx/nginx.conf (http block)
log_format json_combined escape=json '{'
"time":"$time_iso8601",
"remote_addr":"$remote_addr",
"request":"$request",
"status":$status,
"body_bytes_sent":$body_bytes_sent,
"request_time":$request_time,
"upstream_response_time":"$upstream_response_time",
"upstream_addr":"$upstream_addr"
'}';
access_log /var/log/nginx/access.log json_combined;Log Analysis Examples
# Top 10 request URIs
cat /var/log/nginx/access.log | jq -r '.request' | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
# Average response time
cat /var/log/nginx/access.log | jq -r '.request_time' | awk '{sum+=$1; cnt++} END {print sum/cnt}'
# 5xx errors
cat /var/log/nginx/access.log | jq -r 'select(.status>=500) | .request'Monitoring & Alerting
Grafana Dashboards
Nginx Overview (ID: 12708)
Nginx Prometheus Exporter (ID: 12708)
Core Panels
Requests/sec grouped by status code
Upstream response time (P50/P95/P99)
Active / waiting connections
Upstream server health (up/down)
Alert Rules (prometheus‑alerts.yaml)
# NginxDown
- alert: NginxDown
expr: up{job="nginx"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Nginx instance {{ $labels.instance }} unreachable"
# Nginx5xxHigh
- alert: Nginx5xxHigh
expr: rate(nginx_http_requests_total{status=~"5.."}[1m]) / rate(nginx_http_requests_total[1m]) > 0.05
for: 3m
labels:
severity: warning
annotations:
summary: "Nginx 5xx error rate >5%"
# UpstreamDown
- alert: UpstreamDown
expr: nginx_upstream_server_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Backend server {{ $labels.server }} unhealthy"
# NginxHighLatency
- alert: NginxHighLatency
expr: histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Nginx P99 latency >1s"Performance & Capacity Planning
Benchmarking
# wrk test (2C4G Nginx)
wrk -t4 -c1000 -d60s --latency http://192.168.1.100/
# Expected: 15000+ req/s, P99 latency <50ms, 10 MB/s transfer
# SSL test (throughput drop ~20‑30%)
wrk -t4 -c1000 -d60s --latency https://api.example.com/System Tuning
# /etc/sysctl.d/99-nginx-tuning.conf
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
fs.file-max = 2097152
sysctl -p /etc/sysctl.d/99-nginx-tuning.conf
# Nginx worker limits
ulimit -n 100000
# (also adjust /etc/security/limits.conf)Worker Configuration
# /etc/nginx/nginx.conf (worker section)
worker_processes auto;
worker_rlimit_nofile 100000;
events {
use epoll;
worker_connections 10000;
multi_accept on;
}Capacity Estimation
# Concurrent connections = worker_processes × worker_connections
# QPS ≈ worker_connections / average_response_time(s)
# Example (4C8G, avg response 0.1s):
# Concurrent: 4 × 10000 = 40000
# Theoretical QPS: 10000 / 0.1 = 100000
# Recommended safe QPS: ~30000 (30% headroom)Security & Compliance
DDoS Protection
http {
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
}
server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
limit_conn conn_limit 10; # max 10 connections per IP
}
}Access Control
# IP whitelist for admin area
location /admin/ {
allow 192.168.1.0/24;
deny all;
}
# Basic authentication for private endpoints
location /private/ {
auth_basic "Restricted Area";
auth_basic_user_file /etc/nginx/.htpasswd;
}Common Issues & Troubleshooting
Symptom
Diagnostic Command
Possible Root Cause
Quick Fix
Permanent Fix
VIP cannot ping
ip addr | grep vip
Keepalived not running
Restart keepalived
Check VRRP config and firewall
All traffic goes to a single backend
curl -I vip | grep X-Upstream
ip_hash configuration
Switch to least_conn
Use Redis session sharing for stateful apps
502 Bad Gateway
tail -f /var/log/nginx/error.log
Backend service down or network issue
Verify backend availability (ss -tulnp)
Fix backend / firewall rules
SSL handshake failure
openssl s_client -connect host:443
Expired certificate or protocol mismatch
Renew certificate
Configure automatic renewal (certbot timer)
Upstream timeout
grep "upstream timed out" error.log
Slow backend processing
Increase proxy_read_timeout
Optimize backend or make it asynchronous
Keepalived split‑brain
Both nodes hold VIP simultaneously
Network partition or multicast failure
Disable preempt mode
Use unicast VRRP and add monitoring alerts
Best Practices (10 Items)
Multi‑layer health checks : Keepalived → Nginx → backend self‑check.
Configure connection pools : keepalive ≥ backend_instances × 32.
Three‑stage timeouts : connect 5s, send 10s, read 10s to avoid slow‑request blocking.
JSON log format : facilitates ELK/Loki ingestion and includes request_time/upstream_response_time.
SSL performance tweaks : enable http2, ssl_session_cache, OCSP stapling.
Layered rate limiting : global + API‑level + business‑logic limits.
Canary releases : use split_clients or weighted upstream to route a fraction of traffic.
Monitoring triad : QPS, 5xx rate, P99 latency; set alert thresholds based on historical P95.
Automatic certificate renewal : Certbot renew + systemd timer; alert 30 days before expiry.
Regular failover drills : monthly tests for Nginx switch, backend outage, and certificate expiration scenarios.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
