Operations 29 min read

Zero‑Downtime Nginx Load Balancing: Build a 99.99% HA Architecture

This guide walks through designing and implementing a highly available Nginx load‑balancing solution—covering applicable scenarios, prerequisites, environment matrix, step‑by‑step configuration of Nginx, Keepalived, SSL termination, health checks, monitoring, performance tuning, security hardening, troubleshooting, and a concise list of best‑practice recommendations.

Ops Community
Ops Community
Ops Community
Zero‑Downtime Nginx Load Balancing: Build a 99.99% HA Architecture

Nginx High‑Availability Load Balancing: From a Single Point to a Zero‑Downtime 99.99% Architecture

Applicable Scenarios & Prerequisites

Applicable scenarios :

Web/API services requiring SLA > 99.9% availability

Horizontal scaling of backend services (3+ instances)

Zero‑downtime updates and automatic failover

Multi‑datacenter / multi‑availability‑zone deployments

Prerequisites :

2+ Nginx nodes (active‑passive or multi‑master)

Keepalived 1.4+ or cloud SLB/ELB

At least three backend instances with health‑check support

RHEL 8+ / Ubuntu 22.04+ with root access

Environment & Version Matrix

Component

Version Requirement

Key Feature Dependency

Minimum Resources

Nginx

1.20+

upstream health‑check, stream module

2C4G (supports 10K QPS)

Keepalived

1.4+

VRRP protocol, script health‑check

512M RAM

OS

RHEL 8+ / Ubuntu 22.04+

ip_vs, ipvs kernel modules

-

Kernel

4.18+

nf_conntrack optimization

-

HAProxy (optional)

2.6+

health check, session persistence

2C4G

Quick Checklist

Step 1: Plan HA topology (active/passive, dual‑master, multi‑layer LB)

Step 2: Install and configure Nginx basic load balancing

Step 3: Configure upstream health checks and failover

Step 4: Deploy Keepalived for VIP floating

Step 5: Set session persistence and load‑balancing algorithm

Step 6: Implement SSL/TLS termination and certificate management

Step 7: Set up monitoring, alerts, and log collection

Step 8: Conduct failure‑switch drills and rollback scripts

Implementation Steps

Step 1: Plan High‑Availability Architecture

Goal : Design an HA topology that meets business SLA.

Architecture Comparison

Solution

Availability

Cost

Complexity

Applicable Scenario

Single Nginx

95%

Low

Low

Dev/Test environments

Active‑Passive + VIP

99.9%

Medium

Medium

Small‑to‑medium production

Dual‑Master + DNS

99.95%

Medium

Medium

Multi‑datacenter deployments

Cloud LB + Nginx

99.99%

High

Low

Recommended for cloud environments

Multi‑layer LB

99.99%

High

High

Large clusters (10K+ QPS)

Recommended Architecture (Active‑Passive + VIP)

┌─────────────┐
               │  VIP 192.168.1.100 │
               └──────┬───────┘
                      │ (Keepalived VRRP)
      ┌─────────────────┴─────────────────┐
      │                                   │
  ┌─────▼──────┐                     ┌─────▼──────┐
  │ Nginx Master│                     │ Nginx Backup│
  │ 192.168.1.10│                     │ 192.168.1.11│
  │ (MASTER)   │                     │ (BACKUP)   │
  └─────┬───────┘                     └─────┬───────┘
            │                                   │
            └─────────────┬───────────┬─────────┘
                          │           │
          ┌──────────────▼─┐   ┌──────▼─────────┐   ┌──────────────┐
          │ Backend‑1      │   │ Backend‑2      │   │ Backend‑3    │
          │ :8080          │   │ :8080          │   │ :8080        │
          └────────────────┘   └────────────────┘   └──────────────┘

Step 2: Install and Configure Nginx

Goal : Deploy a standardized Nginx service.

RHEL/CentOS Installation

# Install Nginx official repo
cat <<EOF > /etc/yum.repos.d/nginx.repo
[nginx-stable]
name=nginx stable repo
baseurl=http://nginx.org/packages/rhel/$releasever/$basearch/
gpgcheck=1
enabled=1
gpgkey=https://nginx.org/keys/nginx_signing.key
EOF

# Install Nginx
yum install -y nginx

# Enable and start
systemctl enable --now nginx
systemctl status nginx

Ubuntu/Debian Installation

# Add official PPA
apt update
apt install -y curl gnupg2 ca-certificates lsb-release
curl -fsSL https://nginx.org/keys/nginx_signing.key | gpg --dearmor > /usr/share/keyrings/nginx-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] http://nginx.org/packages/ubuntu $(lsb_release -cs) nginx" > /etc/apt/sources.list.d/nginx.list

# Install
apt update
apt install -y nginx

# Start
systemctl enable --now nginx

Verification :

nginx -v               # nginx version: nginx/1.24.0
nginx -t               # configuration syntax is ok
curl -I http://localhost   # HTTP/1.1 200 OK

Step 3: Configure Upstream & Health Checks

Goal : Enable load balancing and automatic removal of unhealthy backends.

Basic Upstream Configuration

# /etc/nginx/conf.d/upstream.conf
upstream backend_pool {
    # load‑balancing algorithm (default round‑robin)
    # least_conn;          # least connections
    # ip_hash;            # session persistence
    # hash $request_uri;   # URL hash

    # Backend servers
    server 192.168.1.21:8080 weight=5 max_fails=3 fail_timeout=10s;
    server 192.168.1.22:8080 weight=5 max_fails=3 fail_timeout=10s;
    server 192.168.1.23:8080 weight=3 max_fails=3 fail_timeout=10s backup; # standby

    # Connection pool
    keepalive 128;               # keep 128 idle connections
    keepalive_requests 1000;     # max 1000 requests per connection
    keepalive_timeout 60s;       # idle timeout
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://backend_pool;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
    }

    location /health {
        access_log off;
        return 200 "healthy
";
        add_header Content-Type text/plain;
    }
}

Parameter Explanation : max_fails=3: mark server down after 3 failures fail_timeout=10s: retry after 10 seconds backup: used only when all primary servers fail weight: traffic distribution based on weight

Reload Configuration :

nginx -t && nginx -s reload
# Verify upstream status (requires stub_status module)
curl http://localhost/nginx_status

Active Health Check (Open‑Source Module)

# Download Nginx source and health‑check module
cd /tmp
wget http://nginx.org/download/nginx-1.24.0.tar.gz
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git

# Compile and install
tar xf nginx-1.24.0.tar.gz
cd nginx-1.24.0
patch -p1 < /tmp/nginx_upstream_check_module/check_1.20.1+.patch

./configure \
    --prefix=/etc/nginx \
    --add-module=/tmp/nginx_upstream_check_module \
    --with-http_ssl_module \
    --with-http_v2_module \
    --with-stream
make && make install

Health‑Check Configuration :

upstream backend_pool {
    server 192.168.1.21:8080;
    server 192.168.1.22:8080;
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "HEAD /health HTTP/1.0

";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    location /upstream_status {
        check_status;
        access_log off;
    }
}

Step 4: Deploy Keepalived for VIP Floating

Goal : Automatic Nginx master‑backup failover via VRRP.

Install Keepalived

# RHEL/CentOS
yum install -y keepalived

# Ubuntu/Debian
apt install -y keepalived

# Enable service
systemctl enable --now keepalived

Master Node Configuration

# /etc/keepalived/keepalived.conf (Master: 192.168.1.10)
global_defs {
    router_id NGINX_MASTER
    vrrp_skip_check_adv_addr
    vrrp_strict
    vrrp_garp_interval 0
    vrrp_gna_interval 0
}

vrrp_script check_nginx {
    script "/etc/keepalived/check_nginx.sh"
    interval 2
    weight -20
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass SecurePass2024
    }
    virtual_ipaddress {
        192.168.1.100/24
    }
    track_script { check_nginx }
    notify_master "/etc/keepalived/notify.sh MASTER"
    notify_backup "/etc/keepalived/notify.sh BACKUP"
    notify_fault  "/etc/keepalived/notify.sh FAULT"
}

Backup Node Configuration

# /etc/keepalived/keepalived.conf (Backup: 192.168.1.11)
# Same as master, but:
# router_id NGINX_BACKUP
# state BACKUP
# priority 90

Health‑Check Script

# /etc/keepalived/check_nginx.sh
#!/bin/bash
pgrep nginx > /dev/null 2>&1 || exit 1
nc -zv localhost 80 > /dev/null 2>&1 || exit 1
curl -sf http://localhost/health > /dev/null 2>&1 || exit 1
exit 0
chmod +x /etc/keepalived/check_nginx.sh

State‑Change Notification Script

# /etc/keepalived/notify.sh
#!/bin/bash
TYPE=$1
DATE=$(date '+%Y-%m-%d %H:%M:%S')
case $TYPE in
    MASTER) echo "$DATE - Transition to MASTER" >> /var/log/keepalived-state.log ;;
    BACKUP) echo "$DATE - Transition to BACKUP" >> /var/log/keepalived-state.log ;;
    FAULT)  echo "$DATE - Fault detected" >> /var/log/keepalived-state.log ;;
esac
chmod +x /etc/keepalived/notify.sh

Start Keepalived and verify VIP:

systemctl restart keepalived
ip addr show eth0 | grep 192.168.1.100   # should show VIP on master
curl -I http://192.168.1.100          # HTTP/1.1 200 OK

Step 5: Session Persistence & Load‑Balancing Algorithms

Goal : Choose the appropriate algorithm for the business case.

IP Hash (session persistence)

upstream backend_pool {
    ip_hash;
    server 192.168.1.21:8080;
    server 192.168.1.22:8080;
    server 192.168.1.23:8080;
}

Consistent URL/ Cookie Hash

upstream backend_pool {
    hash $request_uri consistent;   # URL‑based
    server 192.168.1.21:8080;
    server 192.168.1.22:8080;
    server 192.168.1.23:8080;
    # hash $cookie_jsessionid consistent;  # cookie‑based
}

Least Connections

upstream backend_pool {
    least_conn;
    server 192.168.1.21:8080;
    server 192.168.1.22:8080;
    server 192.168.1.23:8080;
}

Weighted Round‑Robin (default)

upstream backend_pool {
    server 192.168.1.21:8080 weight=5;   # 50%
    server 192.168.1.22:8080 weight=3;   # 30%
    server 192.168.1.23:8080 weight=2;   # 20%
}

Step 6: SSL/TLS Offloading & Certificate Management

Goal : Terminate HTTPS at Nginx and forward plain HTTP to backends.

Obtain Let’s Encrypt Certificate

# Install Certbot
yum install -y certbot python3-certbot-nginx   # RHEL/CentOS
apt install -y certbot python3-certbot-nginx   # Ubuntu/Debian

# Request certificates (auto‑configure Nginx)
certbot --nginx -d api.example.com -d www.example.com

# Verify files
ls -l /etc/letsencrypt/live/api.example.com/
# fullchain.pem  privkey.pem  chain.pem  cert.pem

# Test renewal
certbot renew --dry-run

Nginx HTTPS Configuration

# /etc/nginx/conf.d/ssl.conf
server {
    listen 80;
    server_name api.example.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256';
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;
    ssl_session_tickets off;
    ssl_stapling on;
    ssl_stapling_verify on;
    resolver 8.8.8.8 8.8.4.4 valid=300s;
    add_header Strict-Transport-Security "max-age=63072000" always;
    add_header X-Frame-Options DENY;
    add_header X-Content-Type-Options nosniff;

    location / {
        proxy_pass http://backend_pool;
        # (other proxy settings as in Step 3)
    }
}

Validate SSL :

openssl s_client -connect api.example.com:443 -servername api.example.com
# Check certificate dates
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates

Step 7: Monitoring, Alerts & Log Collection

Prometheus + Node Exporter

# Install nginx‑prometheus‑exporter
wget https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v0.11.0/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
tar xf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
cp nginx-prometheus-exporter /usr/local/bin/

# Enable stub_status
cat <<EOF >> /etc/nginx/conf.d/status.conf
server {
    listen 8080;
    location /stub_status {
        stub_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}
EOF
nginx -s reload

# Start exporter
nohup nginx-prometheus-exporter -nginx.scrape-uri=http://localhost:8080/stub_status &

# Verify metric
curl http://localhost:9113/metrics | grep nginx_

Key PromQL Queries

# Request rate
rate(nginx_http_requests_total[1m])

# P99 backend latency
histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m]))

# 5xx error rate
rate(nginx_http_requests_total{status=~"5.."}[1m]) / rate(nginx_http_requests_total[1m]) * 100

# Upstream active connections
nginx_upstream_server_connections{state="active"}

JSON Log Format

# /etc/nginx/nginx.conf (http block)
log_format json_combined escape=json '{'
    "time":"$time_iso8601",
    "remote_addr":"$remote_addr",
    "request":"$request",
    "status":$status,
    "body_bytes_sent":$body_bytes_sent,
    "request_time":$request_time,
    "upstream_response_time":"$upstream_response_time",
    "upstream_addr":"$upstream_addr"
'}';
access_log /var/log/nginx/access.log json_combined;

Log Analysis Examples

# Top 10 request URIs
cat /var/log/nginx/access.log | jq -r '.request' | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

# Average response time
cat /var/log/nginx/access.log | jq -r '.request_time' | awk '{sum+=$1; cnt++} END {print sum/cnt}'

# 5xx errors
cat /var/log/nginx/access.log | jq -r 'select(.status>=500) | .request'

Monitoring & Alerting

Grafana Dashboards

Nginx Overview (ID: 12708)

Nginx Prometheus Exporter (ID: 12708)

Core Panels

Requests/sec grouped by status code

Upstream response time (P50/P95/P99)

Active / waiting connections

Upstream server health (up/down)

Alert Rules (prometheus‑alerts.yaml)

# NginxDown
- alert: NginxDown
  expr: up{job="nginx"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Nginx instance {{ $labels.instance }} unreachable"

# Nginx5xxHigh
- alert: Nginx5xxHigh
  expr: rate(nginx_http_requests_total{status=~"5.."}[1m]) / rate(nginx_http_requests_total[1m]) > 0.05
  for: 3m
  labels:
    severity: warning
  annotations:
    summary: "Nginx 5xx error rate >5%"

# UpstreamDown
- alert: UpstreamDown
  expr: nginx_upstream_server_up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Backend server {{ $labels.server }} unhealthy"

# NginxHighLatency
- alert: NginxHighLatency
  expr: histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Nginx P99 latency >1s"

Performance & Capacity Planning

Benchmarking

# wrk test (2C4G Nginx)
wrk -t4 -c1000 -d60s --latency http://192.168.1.100/
# Expected: 15000+ req/s, P99 latency <50ms, 10 MB/s transfer

# SSL test (throughput drop ~20‑30%)
wrk -t4 -c1000 -d60s --latency https://api.example.com/

System Tuning

# /etc/sysctl.d/99-nginx-tuning.conf
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
fs.file-max = 2097152

sysctl -p /etc/sysctl.d/99-nginx-tuning.conf

# Nginx worker limits
ulimit -n 100000
# (also adjust /etc/security/limits.conf)

Worker Configuration

# /etc/nginx/nginx.conf (worker section)
worker_processes auto;
worker_rlimit_nofile 100000;

events {
    use epoll;
    worker_connections 10000;
    multi_accept on;
}

Capacity Estimation

# Concurrent connections = worker_processes × worker_connections
# QPS ≈ worker_connections / average_response_time(s)
# Example (4C8G, avg response 0.1s):
#   Concurrent: 4 × 10000 = 40000
#   Theoretical QPS: 10000 / 0.1 = 100000
#   Recommended safe QPS: ~30000 (30% headroom)

Security & Compliance

DDoS Protection

http {
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
}

server {
    location /api/ {
        limit_req zone=api_limit burst=20 nodelay;
        limit_conn conn_limit 10;   # max 10 connections per IP
    }
}

Access Control

# IP whitelist for admin area
location /admin/ {
    allow 192.168.1.0/24;
    deny all;
}

# Basic authentication for private endpoints
location /private/ {
    auth_basic "Restricted Area";
    auth_basic_user_file /etc/nginx/.htpasswd;
}

Common Issues & Troubleshooting

Symptom

Diagnostic Command

Possible Root Cause

Quick Fix

Permanent Fix

VIP cannot ping

ip addr | grep vip

Keepalived not running

Restart keepalived

Check VRRP config and firewall

All traffic goes to a single backend

curl -I vip | grep X-Upstream

ip_hash configuration

Switch to least_conn

Use Redis session sharing for stateful apps

502 Bad Gateway

tail -f /var/log/nginx/error.log

Backend service down or network issue

Verify backend availability (ss -tulnp)

Fix backend / firewall rules

SSL handshake failure

openssl s_client -connect host:443

Expired certificate or protocol mismatch

Renew certificate

Configure automatic renewal (certbot timer)

Upstream timeout

grep "upstream timed out" error.log

Slow backend processing

Increase proxy_read_timeout

Optimize backend or make it asynchronous

Keepalived split‑brain

Both nodes hold VIP simultaneously

Network partition or multicast failure

Disable preempt mode

Use unicast VRRP and add monitoring alerts

Best Practices (10 Items)

Multi‑layer health checks : Keepalived → Nginx → backend self‑check.

Configure connection pools : keepalive ≥ backend_instances × 32.

Three‑stage timeouts : connect 5s, send 10s, read 10s to avoid slow‑request blocking.

JSON log format : facilitates ELK/Loki ingestion and includes request_time/upstream_response_time.

SSL performance tweaks : enable http2, ssl_session_cache, OCSP stapling.

Layered rate limiting : global + API‑level + business‑logic limits.

Canary releases : use split_clients or weighted upstream to route a fraction of traffic.

Monitoring triad : QPS, 5xx rate, P99 latency; set alert thresholds based on historical P95.

Automatic certificate renewal : Certbot renew + systemd timer; alert 30 days before expiry.

Regular failover drills : monthly tests for Nginx switch, backend outage, and certificate expiration scenarios.

Monitoringhigh availabilitySSLKeepalived
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.