Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss
This article reveals three hidden traps in Nginx‑Keepalived high‑availability setups—network‑partition split‑brain, inadequate health‑check scripts, and unsafe configuration‑sync timing—explains real incidents caused by each, and provides concrete configuration changes, Bash scripts, and automation tips to prevent service outages.
During a 3 am production incident the author experienced a complete service outage caused by a mis‑configured Nginx + Keepalived HA cluster, which exposed three deep‑lying pitfalls that are hard to reproduce in test environments.
Pitfall 1: Split‑brain caused by network partition
Problem description
Many operators only guard against heartbeat loss, overlooking the case where a network partition makes both nodes think they are master, resulting in two VIPs on the network.
Real‑world example
# Apparent normal configuration
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.100
}
}In multi‑NIC or complex network environments this config can lead to both nodes holding the VIP simultaneously, causing session inconsistency.
Robust solution
# Prevent split‑brain with full configuration
vrrp_instance VI_1 {
state BACKUP # both nodes start as BACKUP
interface eth0
virtual_router_id 51
priority 100 # master = 100, backup = 90
advert_int 1
nopreempt # disable pre‑emptive takeover
authentication {
auth_type PASS
auth_pass your_complex_password_here
}
track_script {
chk_nginx
chk_network
}
notify_master "/etc/keepalived/scripts/check_split_brain.sh"
virtual_ipaddress {
192.168.1.100
}
}
vrrp_script chk_nginx {
script "/etc/keepalived/scripts/check_nginx.sh"
interval 2
weight -2
fall 3
rise 2
}
vrrp_script chk_network {
script "/etc/keepalived/scripts/check_network.sh"
interval 5
weight -2
fall 2
rise 1
}The accompanying split‑brain detection script checks whether the peer still holds the VIP and, if so, releases it and sends an alert:
#!/bin/bash
REMOTE_IP="192.168.1.11"
VIP="192.168.1.100"
# Check peer reachability
ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
# Peer reachable, verify VIP presence
ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP "ip addr show | grep $VIP" >/dev/null 2>&1
if [ $? -eq 0 ]; then
logger "CRITICAL: Split brain detected! Releasing VIP..."
ip addr del $VIP/24 dev eth0
curl -X POST "your_alert_webhook" -d "Split brain detected on $(hostname)"
exit 1
fi
fiPitfall 2: Health‑check scripts that miss zombie processes
Problem description
Most health checks only verify that a process exists, ignoring whether the service is actually serving traffic. A zombie Nginx worker can keep the master process alive while serving no requests.
Typical faulty script
# Bad example – only checks process existence
#!/bin/bash
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then
exit 1
fiIn production this led to a situation where Nginx workers became zombies, the health check reported success, Keepalived never triggered failover, and all user requests failed.
Perfect health‑check solution
#!/bin/bash
# 1. Verify master process exists
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"
if [ -z "$NGINX_PID" ]; then
logger "Nginx master process not found"
exit 1
fi
# 2. Verify port 80 is listening
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx port 80 not listening"
exit 1
fi
# 3. Verify configuration syntax
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx configuration syntax error"
exit 1
fi
# 4. Real HTTP request check
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx health check failed, HTTP code: $HTTP_CODE"
systemctl restart nginx
sleep 2
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx restart failed, triggering failover"
exit 1
fi
fi
# 5. System load check
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
logger "System load too high: $LOAD"
exit 1
fi
# 6. Memory usage check
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
logger "Memory usage too high: $MEM_USAGE%"
exit 1
fi
logger "Nginx health check passed"
exit 0A matching Nginx location provides the health endpoint:
# Simple health endpoint
location /health {
access_log off;
return 200 "healthy
";
add_header Content-Type text/plain;
}
# Detailed JSON health endpoint (requires lua-nginx-module)
location /health/detailed {
access_log off;
content_by_lua_block {
local json = require "cjson"
local health_data = {
status = "healthy",
timestamp = ngx.time(),
connections = {
active = ngx.var.connections_active,
reading = ngx.var.connections_reading,
writing = ngx.var.connections_writing,
waiting = ngx.var.connections_waiting,
}
}
ngx.say(json.encode(health_data))
}
}Pitfall 3: Unsafe configuration‑sync timing causing service domino effect
Problem description
When updating Nginx configuration on a two‑node Keepalived cluster, restarting the master first makes Keepalived move the VIP to the backup, which may still run the old configuration, leading to 500 errors.
Safe update workflow
#!/bin/bash
# Safe Nginx configuration update script
MASTER_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
CONFIG_FILE="/etc/nginx/nginx.conf"
VIP="192.168.1.100"
is_master() {
ip addr show | grep $VIP >/dev/null 2>&1
return $?
}
sync_config() {
local target_ip=$1
echo "Syncing config to $target_ip..."
scp $CONFIG_FILE root@$target_ip:$CONFIG_FILE
ssh root@$target_ip "nginx -t"
if [ $? -ne 0 ]; then
echo "Configuration syntax error on $target_ip"
return 1
fi
return 0
}
safe_restart_nginx() {
is_master
if [ $? -eq 0 ]; then
echo "Current node is MASTER, performing graceful restart..."
sed -i 's/priority 100/priority 50/' /etc/keepalived/keepalived.conf
systemctl reload keepalived
sleep 5
for i in {1..10}; do
is_master
if [ $? -ne 0 ]; then
echo "VIP switched successfully"
break
fi
echo "Waiting for VIP switch... ($i/10)"
sleep 2
done
systemctl restart nginx
if [ $? -eq 0 ] && curl -s http://127.0.0.1/health >/dev/null; then
echo "Nginx restarted successfully"
sed -i 's/priority 50/priority 100/' /etc/keepalived/keepalived.conf
systemctl reload keepalived
else
echo "Nginx restart failed!"
return 1
fi
else
echo "Current node is BACKUP, restarting nginx directly..."
systemctl restart nginx
if [ $? -ne 0 ]; then
echo "Nginx restart failed on backup!"
return 1
fi
fi
return 0
}
main() {
is_master
if [ $? -eq 0 ]; then
echo "Running on MASTER node"
other_node=$BACKUP_IP
else
echo "Running on BACKUP node"
other_node=$MASTER_IP
fi
echo "Step 1: Syncing configuration to peer node..."
sync_config $other_node || { echo "Configuration sync failed!"; exit 1; }
echo "Step 2: Restarting nginx on peer node..."
ssh root@$other_node "systemctl restart nginx" || { echo "Failed to restart nginx on peer node!"; exit 1; }
ssh root@$other_node "curl -s http://127.0.0.1/health" >/dev/null || { echo "Peer node health check failed!"; exit 1; }
echo "Step 3: Restarting nginx on current node..."
safe_restart_nginx || { echo "Failed to restart nginx on current node!"; exit 1; }
echo "Configuration update completed successfully!"
echo "Final verification..."
curl -s http://$VIP/health >/dev/null && echo "✅ All services are healthy!" || { echo "❌ Service verification failed!"; exit 1; }
}
mainAutomation hook example (GitLab CI)
# GitLab CI job to deploy configuration safely
deploy_nginx_config:
stage: deploy
script:
- echo "Deploying nginx configuration..."
- ansible-playbook -i inventory/production nginx_update.yml
only:
- master
when: manualConclusion
Split‑brain protection : multi‑layer health checks, disable pre‑emptive takeover, and run a detection script.
Robust health checks : verify process, port, config syntax, real HTTP response, system load, and memory usage; auto‑restart on failure.
Safe config sync : synchronize files first, restart the backup, ensure VIP has moved, then restart the former master with priority adjustments.
Best practices : monitor both service state and VIP movement, keep detailed incident logs, conduct regular failover drills, and codify procedures into automated scripts.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
