Operations 21 min read

Master High‑Availability Nginx Clusters: Load Balancing & Failover Guide

This comprehensive guide walks you through designing, configuring, and optimizing a production‑grade Nginx cluster with 99.99% availability, covering architecture principles, load‑balancing algorithms, Keepalived failover, monitoring, performance tuning, failure‑injection drills, and advanced automation techniques.

Ops Community
Ops Community
Ops Community
Master High‑Availability Nginx Clusters: Load Balancing & Failover Guide

Building a High‑Availability Nginx Cluster: Complete Guide

Why High‑Availability?

At 3 am a critical alert signals the primary server is down, illustrating the business need for an architecture that prevents costly downtime.

1. High‑Availability Architecture Design

1.1 What is true high‑availability?

Availability metric: ≥99.9% (downtime < 8.76 h/year)

RTO < 5 min

RPO = 0

Performance after failover ≥95%

1.2 Typical Architecture

Internet Users
               |
          [DNS Round‑Robin]
               |
        VIP1:80   VIP2:80
          |          |
   [Keepalived]  [Keepalived]
          |          |
   [Nginx‑Master] [Nginx‑Backup]
          |          |
   ━━━━━━━━━━━━━━━━━
          |   |   |
        [Web1] [Web2] [Web3]

This layout eliminates a single point of failure but real‑world deployments often require more complex designs.

2. Nginx Load‑Balancing Optimization

2.1 Algorithm Selection

Choose the algorithm that matches the workload:

Static resources – round‑robin (default configuration)

API services – least_conn

Session persistence – ip_hash

High‑performance computing – least_time header

2.2 Practical Configuration

# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile 65535;

events {
    use epoll;
    worker_connections 20480;
    multi_accept on;
}

http {
    # Performance basics
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;

    # Connection tuning
    keepalive_timeout 65;
    keepalive_requests 100;
    client_body_timeout 10;
    client_header_timeout 10;
    send_timeout 10;

    # Buffer tuning
    client_body_buffer_size 128k;
    client_max_body_size 10m;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 4k;
    output_buffers 1 32k;
    postpone_output 1460;

    # File cache
    open_file_cache max=1000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

    # Gzip compression
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml text/javascript application/json application/javascript application/xml+rss application/rss+xml application/atom+xml image/svg+xml text/x-js text/x-cross-domain-policy application/x-font-ttf application/x-font-opentype application/vnd.ms-fontobject image/x-icon;
    gzip_disable "msie6";

    # Proxy cache
    proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=my_cache:100m max_size=10g inactive=60m use_temp_path=off;

    upstream backend_cluster {
        least_conn;
        server 192.168.1.10:8080 max_fails=3 fail_timeout=30s weight=5;
        server 192.168.1.11:8080 max_fails=3 fail_timeout=30s weight=3;
        server 192.168.1.12:8080 max_fails=3 fail_timeout=30s weight=2;
        server 192.168.1.20:8080 backup;
        keepalive 32;
    }

    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    server {
        listen 80 default_server reuseport;
        server_name _;

        # Static file cache
        location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
            expires 1y;
            add_header Cache-Control "public, immutable";
            access_log off;
        }

        # API cache and proxy
        location /api/ {
            proxy_pass http://backend_cluster;
            proxy_cache my_cache;
            proxy_cache_key "$scheme$request_method$host$request_uri";
            proxy_cache_valid 200 302 10m;
            proxy_cache_valid 404 1m;
            proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
            proxy_cache_background_update on;
            proxy_cache_lock on;
            add_header X-Cache-Status $upstream_cache_status;
        }
    }
}

2.3 Dynamic Upstream Management

# using nginx‑upsync‑module with Consul
upstream backend_dynamic {
    upsync 127.0.0.1:8500/v1/kv/upstreams/backend_cluster upsync_timeout=6m upsync_interval=500ms;
    upsync_type consul;
    include /etc/nginx/conf.d/servers.conf;
}

3. Keepalived Automatic Failover

3.1 How Keepalived Works

Keepalived uses the VRRP protocol to assign a virtual IP (VIP). When the master node fails, the backup node instantly takes over the VIP, achieving sub‑second switchover.

3.2 Master Configuration

# /etc/keepalived/keepalived.conf (Master)
global_defs {
    router_id NGINX_MASTER
    script_user root
    enable_script_security
}

vrrp_script check_nginx {
    script "/etc/keepalived/check_nginx.sh"
    interval 2
    weight -20
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass nginx_ha_2024
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
    track_script { check_nginx }
    notify_master "/etc/keepalived/notify.sh master"
    notify_backup "/etc/keepalived/notify.sh backup"
    notify_fault "/etc/keepalived/notify.sh fault"
}

3.3 Backup Configuration

# /etc/keepalived/keepalived.conf (Backup)
global_defs {
    router_id NGINX_BACKUP
    script_user root
    enable_script_security
}

vrrp_script check_nginx {
    script "/etc/keepalived/check_nginx.sh"
    interval 2
    weight -20
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass nginx_ha_2024
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
    track_script { check_nginx }
    notify_master "/etc/keepalived/notify.sh master"
    notify_backup "/etc/keepalived/notify.sh backup"
    notify_fault "/etc/keepalived/notify.sh fault"
}

3.4 Health‑Check Scripts

# /etc/keepalived/check_nginx.sh
#!/bin/bash
# Check Nginx process
nginx_process=$(ps -ef | grep -v grep | grep -c nginx)
if [ $nginx_process -eq 0 ]; then
    systemctl start nginx
    sleep 2
    nginx_process=$(ps -ef | grep -v grep | grep -c nginx)
    if [ $nginx_process -eq 0 ]; then
        exit 1
    fi
fi
# Check port
nc -z localhost 80 || exit 1
# Check HTTP response
http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
[ "$http_code" != "200" ] && exit 1
exit 0
# /etc/keepalived/notify.sh
#!/bin/bash
TYPE=$1
NAME=$(hostname)
IP=$(ip addr show eth0 | grep "inet " | awk '{print $2}' | cut -d/ -f1)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
MAIL_TO="[email protected]"
case $TYPE in
    master)
        echo "$DATE: $NAME($IP) became MASTER" | mail -s "Nginx HA: $NAME is MASTER" $MAIL_TO
        /usr/local/bin/update_dns.sh add
        ;;
    backup)
        echo "$DATE: $NAME($IP) became BACKUP" | mail -s "Nginx HA: $NAME is BACKUP" $MAIL_TO
        ;;
    fault)
        echo "$DATE: $NAME($IP) became FAULT" | mail -s "Nginx HA: $NAME is FAULT" $MAIL_TO
        /usr/local/bin/send_alert.sh critical
        ;;
esac
echo "$DATE $NAME $IP $TYPE" >> /var/log/keepalived/state_change.log

4. Monitoring & Alerting

4.1 Prometheus + Grafana

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: 'nginx'
    static_configs:
      - targets: ['192.168.1.100:9113', '192.168.1.101:9113']
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.1.100:9100', '192.168.1.101:9100']
rule_files:
  - 'nginx_rules.yml'
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

4.2 Alert Rules

# nginx_rules.yml
groups:
  - name: nginx_alerts
    interval: 30s
    rules:
      - alert: NginxDown
        expr: up{job="nginx"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Nginx is down on {{ $labels.instance }}"
      - alert: HighErrorRate
        expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High 5xx error rate on {{ $labels.instance }}"
      - alert: HighResponseTime
        expr: nginx_http_request_duration_seconds{quantile="0.99"} > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on {{ $labels.instance }}"

4.3 Custom Metrics Collection

# collect_metrics.py
#!/usr/bin/env python3
import requests, time
from prometheus_client import Gauge, push_to_gateway

active_connections = Gauge('nginx_active_connections', 'Active connections')
requests_per_second = Gauge('nginx_requests_per_second', 'Requests per second')

def collect_metrics():
    response = requests.get('http://localhost/nginx_status')
    lines = response.text.strip().split('
')
    active = int(lines[0].split(':')[1].strip())
    active_connections.set(active)
    req = int(lines[2].split()[2])
    requests_per_second.set(req)
    push_to_gateway('localhost:9091', job='nginx_custom')

if __name__ == '__main__':
    while True:
        collect_metrics()
        time.sleep(15)

5. System & Nginx Performance Tuning

Kernel parameters (net.ipv4.*, net.core.*, fs.*) are tuned for high concurrency, and Nginx directives (worker_processes, keepalive_timeout, gzip, proxy_cache, etc.) are optimized to maximize throughput and reduce latency.

6. Failure‑Injection Drills

Scripts simulate Nginx crashes, network partitions, and backend server failures to verify automatic VIP migration, load‑balancer resilience, and alerting behavior.

7. Production Best Practices

All nodes synchronized via NTP

SELinux and firewall rules hardened

System parameter optimizations applied

Prometheus/Grafana monitoring and alerting configured

Log rotation and backup policies in place

Automation scripts for syntax check, reload, backup, and rollback

8. FAQ

VIP drift causes access issues

Clear ARP cache (e.g., arping -I eth0 -c 3 -s 192.168.1.100 192.168.1.1) or configure vrrp_garp_master_delay 10 and vrrp_garp_interval 0.001 in Keepalived.

Load imbalance

Switch to least_conn, adjust server weights, and enable slow_start for newly added nodes.

Session persistence failure

Use consistent hashing ( hash $cookie_sessionid consistent;) or sticky cookies ( sticky cookie srv_id expires=1h path=/;).

9. Load Testing with wrk

Example commands:

# Basic test
./wrk -t12 -c400 -d30s http://192.168.1.100/
# Lua script for POST requests
./wrk -t12 -c400 -d30s -s post.lua http://192.168.1.100/api/test

Sample result shows ~25,600 requests/sec with average latency 15.6 ms, confirming the cluster can handle high traffic.

10. Advanced Techniques

Dynamic routing with Lua

location /dynamic {
    content_by_lua_block {
        local redis = require "resty.redis"
        local red = redis:new()
        red:connect("127.0.0.1", 6379)
        local backend = red:get("route:" .. ngx.var.uri)
        if backend then
            ngx.exec("@" .. backend)
        else
            ngx.exec("@default")
        end
    }
}

Intelligent rate limiting

map $http_x_user_level $limit_rate {
    default 10r/s;
    "vip"   50r/s;
    "svip" 100r/s;
}
limit_req_zone $binary_remote_addr zone=user_limit:10m rate=$limit_rate;

Blue‑Green Deployment Script

# blue_green_deploy.sh
BLUE_SERVERS="192.168.1.10:8080 192.168.1.11:8080"
GREEN_SERVERS="192.168.1.20:8080 192.168.1.21:8080"
CURRENT_ENV=$(cat /etc/nginx/current_env)
if [ "$CURRENT_ENV" == "blue" ]; then
    NEW_ENV="green"
    NEW_SERVERS=$GREEN_SERVERS
else
    NEW_ENV="blue"
    NEW_SERVERS=$BLUE_SERVERS
fi
cat > /etc/nginx/conf.d/upstream.conf <<EOF
upstream backend_cluster {
    least_conn;
EOF
for server in $NEW_SERVERS; do
    echo "    server $server;" >> /etc/nginx/conf.d/upstream.conf
done
echo "}" >> /etc/nginx/conf.d/upstream.conf
nginx -s reload
echo $NEW_ENV > /etc/nginx/current_env
echo "Switched to $NEW_ENV environment"

Following this guide enables a production‑grade, 99.99 %‑available Nginx service with automated failover, observability, and continuous improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Automationload balancingNGINXkeepalived
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.