Operations 16 min read

Avoid 3 Hidden Nginx+Keepalived HA Pitfalls That 90% of Ops Encounter

This article reveals three hard‑to‑detect pitfalls in Nginx + Keepalived high‑availability setups—split‑brain caused by network partitions, inadequate health‑check scripts, and unsafe configuration‑sync timing—provides real‑world incident examples, and offers complete, battle‑tested solutions with ready‑to‑use scripts.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Avoid 3 Hidden Nginx+Keepalived HA Pitfalls That 90% of Ops Encounter

Nginx+Keepalived High‑Availability Architecture: 3 Hidden Pitfalls and How to Avoid Them

Blood‑tear lessons! Three fatal traps distilled from production incidents—read this to save three years of troubleshooting.

Preface: A 3 AM Production Outage

At 3 AM the monitoring alarm screamed “service unavailable! Users cannot access!” Our Nginx+Keepalived HA cluster failed, both master and backup nodes went down, and the whole business system collapsed. After an all‑night investigation I discovered three hidden traps that are almost impossible to reproduce in a test environment but cause massive loss in production.

Pitfall 1: Split‑Brain – The Invisible Killer Caused by Network Partition

Problem Description

Many operators only guard against heartbeat loss and ignore the more subtle double‑master situation caused by network segmentation.

Real‑World Example

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.1.100
    }
}

This configuration looks fine on a single‑NIC host, but in a multi‑NIC or complex network it can lead to a fatal split‑brain.

What Happened

Master node thought the backup was dead and kept the VIP.

Backup node also thought the master was dead and grabbed the VIP.

Two machines now owned the same VIP, causing session inconsistency and data loss.

Perfect Solution

# Prevent split‑brain – full configuration
vrrp_instance VI_1 {
    state BACKUP  # both nodes set to BACKUP
    interface eth0
    virtual_router_id 51
    priority 100  # master 100, backup 90
    advert_int 1
    nopreempt      # disable pre‑emptive takeover
    authentication {
        auth_type PASS
        auth_pass your_complex_password_here
    }
    track_script {
        chk_nginx
        chk_network
    }
    notify_master "/etc/keepalived/scripts/check_split_brain.sh"
    virtual_ipaddress {
        192.168.1.100
    }
}

vrrp_script chk_nginx {
    script "/etc/keepalived/scripts/check_nginx.sh"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_script chk_network {
    script "/etc/keepalived/scripts/check_network.sh"
    interval 5
    weight -2
    fall 2
    rise 1
}

Split‑brain detection script (check_split_brain.sh):

#!/bin/bash
# Split‑brain detection script
REMOTE_IP="192.168.1.11"
VIP="192.168.1.100"
# Check if remote also holds the VIP
ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
    ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP "ip addr show | grep $VIP" >/dev/null 2>&1
    if [ $? -eq 0 ]; then
        logger "CRITICAL: Split brain detected! Releasing VIP..."
        ip addr del $VIP/24 dev eth0
        curl -X POST "your_alert_webhook" -d "Split brain detected on $(hostname)"
        exit 1
    fi
fi

Pitfall 2: Health‑Check Defect – Zombie‑Process Trap

Problem Description

90% of operators write health checks that only verify the process exists, ignoring whether the service is truly functional.

Typical Wrong Script

# Bad example – most people write this
#!/bin/bash
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then
    exit 1
fi

This script passes even if the nginx worker processes are zombies and cannot serve requests.

Real Incident

In production a nginx worker became a zombie due to a memory leak; the master kept running but could not handle traffic. The health‑check script still reported success, keepalived did not fail over, and all user requests failed.

Perfect Health‑Check Script

#!/bin/bash
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"
# 1. Process exists
if [ -z "$NGINX_PID" ]; then
    logger "Nginx master process not found"
    exit 1
fi
# 2. Port listening
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx port 80 not listening"
    exit 1
fi
# 3. Config syntax
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx configuration syntax error"
    exit 1
fi
# 4. Real HTTP check
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
    logger "Nginx health check failed, HTTP code: $HTTP_CODE"
    systemctl restart nginx
    sleep 2
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
    if [ "$HTTP_CODE" != "200" ]; then
        logger "Nginx restart failed, triggering failover"
        exit 1
    fi
fi
# 5. System load
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
    logger "System load too high: $LOAD"
    exit 1
fi
# 6. Memory usage
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
    logger "Memory usage too high: $MEM_USAGE%"
    exit 1
fi
logger "Nginx health check passed"
exit 0

Corresponding nginx health endpoint:

# Simple health endpoint
location /health {
    access_log off;
    return 200 "healthy
";
    add_header Content-Type text/plain;
}

# Detailed health endpoint (Lua)
location /health/detailed {
    access_log off;
    content_by_lua_block {
        local json = require "cjson"
        local health_data = {
            status = "healthy",
            timestamp = ngx.time(),
            connections = {
                active = ngx.var.connections_active,
                reading = ngx.var.connections_reading,
                writing = ngx.var.connections_writing,
                waiting = ngx.var.connections_waiting,
            }
        }
        ngx.say(json.encode(health_data))
    }
}

Pitfall 3: Configuration Sync Timing – Domino Effect During Service Restart

Problem Description

If the restart order of master and backup nodes is not coordinated, the VIP may switch to a node that still runs the old configuration, causing 500 errors.

Incident Replay

Update master configuration and restart nginx.

Update backup configuration and restart nginx.

When the master restarts, keepalived moves the VIP to the backup, but the backup still has the old config, so requests hit a non‑existent upstream and fail.

Perfect Update Procedure

#!/bin/bash
MASTER_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
CONFIG_FILE="/etc/nginx/nginx.conf"
VIP="192.168.1.100"

is_master() {
    ip addr show | grep $VIP >/dev/null 2>&1
    return $?
}

sync_config() {
    local target_ip=$1
    echo "Syncing config to $target_ip..."
    scp $CONFIG_FILE root@$target_ip:$CONFIG_FILE
    ssh root@$target_ip "nginx -t"
    if [ $? -ne 0 ]; then
        echo "Configuration syntax error on $target_ip"
        return 1
    fi
    return 0
}

safe_restart_nginx() {
    is_master
    local is_current_master=$?
    if [ $is_current_master -eq 0 ]; then
        echo "Current node is MASTER, performing graceful restart..."
        echo "Decreasing VRRP priority..."
        sed -i 's/priority 100/priority 50/' /etc/keepalived/keepalived.conf
        systemctl reload keepalived
        sleep 5
        for i in {1..10}; do
            is_master
            if [ $? -ne 0 ]; then
                echo "VIP switched successfully"
                break
            fi
            echo "Waiting for VIP switch... ($i/10)"
            sleep 2
        done
        systemctl restart nginx
        if [ $? -eq 0 ] && curl -s http://127.0.0.1/health >/dev/null; then
            echo "Nginx restarted successfully"
            sed -i 's/priority 50/priority 100/' /etc/keepalived/keepalived.conf
            systemctl reload keepalived
        else
            echo "Nginx restart failed!"
            return 1
        fi
    else
        echo "Current node is BACKUP, restarting nginx directly..."
        systemctl restart nginx
        if [ $? -ne 0 ]; then
            echo "Nginx restart failed on backup!"
            return 1
        fi
    fi
    return 0
}

main() {
    echo "Starting safe nginx configuration update..."
    is_master
    local current_master=$?
    if [ $current_master -eq 0 ]; then
        echo "Running on MASTER node"
        other_node=$BACKUP_IP
    else
        echo "Running on BACKUP node"
        other_node=$MASTER_IP
    fi
    echo "Step 1: Syncing configuration to peer node..."
    sync_config $other_node || { echo "Configuration sync failed!"; exit 1; }
    echo "Step 2: Restarting nginx on peer node..."
    ssh root@$other_node "systemctl restart nginx" || { echo "Failed to restart nginx on peer node!"; exit 1; }
    ssh root@$other_node "curl -s http://127.0.0.1/health" >/dev/null || { echo "Peer node health check failed!"; exit 1; }
    echo "Step 3: Restarting nginx on current node..."
    safe_restart_nginx || { echo "Failed to restart nginx on current node!"; exit 1; }
    echo "Configuration update completed successfully!"
    echo "Final verification..."
    curl -s http://$VIP/health >/dev/null && echo "✅ All services are healthy!" || { echo "❌ Service verification failed!"; exit 1; }
}

main

Conclusion: From Pitfalls to Mastery

After years of battling these issues I’ve learned that details decide success; prevention beats cure.

Key Takeaways

Split‑brain protection: multi‑layer detection, intelligent failover, real‑time monitoring.

Health checks: real service verification, system‑resource monitoring, automatic remediation.

Config sync: safe sequencing, graceful switch, automatic rollback.

Best‑Practice Recommendations

Monitoring & Alerts: monitor not only service status but also VIP switch events.

Documentation: record every incident to build a knowledge base.

Regular Drills: conduct at least monthly failover rehearsals.

Automation: encode operational experience into reusable scripts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityNginxhealth checkkeepalivedSplit-BrainConfiguration Sync
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.