Operations 16 min read

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

This article reveals three hidden traps in Nginx‑Keepalived high‑availability setups—network‑partition split‑brain, inadequate health‑check scripts, and unsafe configuration‑sync timing—explains real incidents caused by each, and provides concrete configuration changes, Bash scripts, and automation tips to prevent service outages.

Raymond Ops

Jan 2, 2026

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

During a 3 am production incident the author experienced a complete service outage caused by a mis‑configured Nginx + Keepalived HA cluster, which exposed three deep‑lying pitfalls that are hard to reproduce in test environments.

Pitfall 1: Split‑brain caused by network partition

Problem description

Many operators only guard against heartbeat loss, overlooking the case where a network partition makes both nodes think they are master, resulting in two VIPs on the network.

Real‑world example

# Apparent normal configuration
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.1.100
    }
}

In multi‑NIC or complex network environments this config can lead to both nodes holding the VIP simultaneously, causing session inconsistency.

Robust solution

# Prevent split‑brain with full configuration
vrrp_instance VI_1 {
    state BACKUP  # both nodes start as BACKUP
    interface eth0
    virtual_router_id 51
    priority 100   # master = 100, backup = 90
    advert_int 1
    nopreempt      # disable pre‑emptive takeover
    authentication {
        auth_type PASS
        auth_pass your_complex_password_here
    }
    track_script {
        chk_nginx
        chk_network
    }
    notify_master "/etc/keepalived/scripts/check_split_brain.sh"
    virtual_ipaddress {
        192.168.1.100
    }
}

vrrp_script chk_nginx {
    script "/etc/keepalived/scripts/check_nginx.sh"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_script chk_network {
    script "/etc/keepalived/scripts/check_network.sh"
    interval 5
    weight -2
    fall 2
    rise 1
}

The accompanying split‑brain detection script checks whether the peer still holds the VIP and, if so, releases it and sends an alert:

#!/bin/bash
REMOTE_IP="192.168.1.11"
VIP="192.168.1.100"
# Check peer reachability
ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
    # Peer reachable, verify VIP presence
    ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP "ip addr show | grep $VIP" >/dev/null 2>&1
    if [ $? -eq 0 ]; then
        logger "CRITICAL: Split brain detected! Releasing VIP..."
        ip addr del $VIP/24 dev eth0
        curl -X POST "your_alert_webhook" -d "Split brain detected on $(hostname)"
        exit 1
    fi
fi

Pitfall 2: Health‑check scripts that miss zombie processes

Problem description

Most health checks only verify that a process exists, ignoring whether the service is actually serving traffic. A zombie Nginx worker can keep the master process alive while serving no requests.

Typical faulty script

# Bad example – only checks process existence
#!/bin/bash
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then
    exit 1
fi

In production this led to a situation where Nginx workers became zombies, the health check reported success, Keepalived never triggered failover, and all user requests failed.

Perfect health‑check solution

#!/bin/bash
# 1. Verify master process exists
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"

if [ -z "$NGINX_PID" ]; then
    logger "Nginx master process not found"
    exit 1
fi

# 2. Verify port 80 is listening
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx port 80 not listening"
    exit 1
fi

# 3. Verify configuration syntax
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx configuration syntax error"
    exit 1
fi

# 4. Real HTTP request check
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
    logger "Nginx health check failed, HTTP code: $HTTP_CODE"
    systemctl restart nginx
    sleep 2
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
    if [ "$HTTP_CODE" != "200" ]; then
        logger "Nginx restart failed, triggering failover"
        exit 1
    fi
fi

# 5. System load check
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
    logger "System load too high: $LOAD"
    exit 1
fi

# 6. Memory usage check
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
    logger "Memory usage too high: $MEM_USAGE%"
    exit 1
fi

logger "Nginx health check passed"
exit 0

A matching Nginx location provides the health endpoint:

# Simple health endpoint
location /health {
    access_log off;
    return 200 "healthy
";
    add_header Content-Type text/plain;
}

# Detailed JSON health endpoint (requires lua-nginx-module)
location /health/detailed {
    access_log off;
    content_by_lua_block {
        local json = require "cjson"
        local health_data = {
            status = "healthy",
            timestamp = ngx.time(),
            connections = {
                active = ngx.var.connections_active,
                reading = ngx.var.connections_reading,
                writing = ngx.var.connections_writing,
                waiting = ngx.var.connections_waiting,
            }
        }
        ngx.say(json.encode(health_data))
    }
}

Pitfall 3: Unsafe configuration‑sync timing causing service domino effect

Problem description

When updating Nginx configuration on a two‑node Keepalived cluster, restarting the master first makes Keepalived move the VIP to the backup, which may still run the old configuration, leading to 500 errors.

Safe update workflow

#!/bin/bash
# Safe Nginx configuration update script
MASTER_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
CONFIG_FILE="/etc/nginx/nginx.conf"
VIP="192.168.1.100"

is_master() {
    ip addr show | grep $VIP >/dev/null 2>&1
    return $?
}

sync_config() {
    local target_ip=$1
    echo "Syncing config to $target_ip..."
    scp $CONFIG_FILE root@$target_ip:$CONFIG_FILE
    ssh root@$target_ip "nginx -t"
    if [ $? -ne 0 ]; then
        echo "Configuration syntax error on $target_ip"
        return 1
    fi
    return 0
}

safe_restart_nginx() {
    is_master
    if [ $? -eq 0 ]; then
        echo "Current node is MASTER, performing graceful restart..."
        sed -i 's/priority 100/priority 50/' /etc/keepalived/keepalived.conf
        systemctl reload keepalived
        sleep 5
        for i in {1..10}; do
            is_master
            if [ $? -ne 0 ]; then
                echo "VIP switched successfully"
                break
            fi
            echo "Waiting for VIP switch... ($i/10)"
            sleep 2
        done
        systemctl restart nginx
        if [ $? -eq 0 ] && curl -s http://127.0.0.1/health >/dev/null; then
            echo "Nginx restarted successfully"
            sed -i 's/priority 50/priority 100/' /etc/keepalived/keepalived.conf
            systemctl reload keepalived
        else
            echo "Nginx restart failed!"
            return 1
        fi
    else
        echo "Current node is BACKUP, restarting nginx directly..."
        systemctl restart nginx
        if [ $? -ne 0 ]; then
            echo "Nginx restart failed on backup!"
            return 1
        fi
    fi
    return 0
}

main() {
    is_master
    if [ $? -eq 0 ]; then
        echo "Running on MASTER node"
        other_node=$BACKUP_IP
    else
        echo "Running on BACKUP node"
        other_node=$MASTER_IP
    fi
    echo "Step 1: Syncing configuration to peer node..."
    sync_config $other_node || { echo "Configuration sync failed!"; exit 1; }
    echo "Step 2: Restarting nginx on peer node..."
    ssh root@$other_node "systemctl restart nginx" || { echo "Failed to restart nginx on peer node!"; exit 1; }
    ssh root@$other_node "curl -s http://127.0.0.1/health" >/dev/null || { echo "Peer node health check failed!"; exit 1; }
    echo "Step 3: Restarting nginx on current node..."
    safe_restart_nginx || { echo "Failed to restart nginx on current node!"; exit 1; }
    echo "Configuration update completed successfully!"
    echo "Final verification..."
    curl -s http://$VIP/health >/dev/null && echo "✅ All services are healthy!" || { echo "❌ Service verification failed!"; exit 1; }
}

main

Automation hook example (GitLab CI)

# GitLab CI job to deploy configuration safely
deploy_nginx_config:
  stage: deploy
  script:
    - echo "Deploying nginx configuration..."
    - ansible-playbook -i inventory/production nginx_update.yml
  only:
    - master
  when: manual

Conclusion

Split‑brain protection : multi‑layer health checks, disable pre‑emptive takeover, and run a detection script.

Robust health checks : verify process, port, config syntax, real HTTP response, system load, and memory usage; auto‑restart on failure.

Safe config sync : synchronize files first, restart the backup, ensure VIP has moved, then restart the former master with priority adjustments.

Best practices : monitor both service state and VIP movement, keep detailed incident logs, conduct regular failover drills, and codify procedures into automated scripts.

Automation Ops Nginx high-availability Health Check Keepalived

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Pitfall 1: Split‑brain caused by network partition

Problem description

Real‑world example

Robust solution

Pitfall 2: Health‑check scripts that miss zombie processes

Problem description

Typical faulty script

Perfect health‑check solution

Pitfall 3: Unsafe configuration‑sync timing causing service domino effect

Problem description

Safe update workflow

Automation hook example (GitLab CI)

Conclusion

Raymond Ops

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 1: Split‑brain caused by network partition

Pitfall 2: Health‑check scripts that miss zombie processes

Pitfall 3: Unsafe configuration‑sync timing causing service domino effect