Avoid 3 Hidden Nginx+Keepalived HA Pitfalls That 90% of Ops Encounter
This article reveals three hard‑to‑detect pitfalls in Nginx + Keepalived high‑availability setups—split‑brain caused by network partitions, inadequate health‑check scripts, and unsafe configuration‑sync timing—provides real‑world incident examples, and offers complete, battle‑tested solutions with ready‑to‑use scripts.
Nginx+Keepalived High‑Availability Architecture: 3 Hidden Pitfalls and How to Avoid Them
Blood‑tear lessons! Three fatal traps distilled from production incidents—read this to save three years of troubleshooting.
Preface: A 3 AM Production Outage
At 3 AM the monitoring alarm screamed “service unavailable! Users cannot access!” Our Nginx+Keepalived HA cluster failed, both master and backup nodes went down, and the whole business system collapsed. After an all‑night investigation I discovered three hidden traps that are almost impossible to reproduce in a test environment but cause massive loss in production.
Pitfall 1: Split‑Brain – The Invisible Killer Caused by Network Partition
Problem Description
Many operators only guard against heartbeat loss and ignore the more subtle double‑master situation caused by network segmentation.
Real‑World Example
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.100
}
}This configuration looks fine on a single‑NIC host, but in a multi‑NIC or complex network it can lead to a fatal split‑brain.
What Happened
Master node thought the backup was dead and kept the VIP.
Backup node also thought the master was dead and grabbed the VIP.
Two machines now owned the same VIP, causing session inconsistency and data loss.
Perfect Solution
# Prevent split‑brain – full configuration
vrrp_instance VI_1 {
state BACKUP # both nodes set to BACKUP
interface eth0
virtual_router_id 51
priority 100 # master 100, backup 90
advert_int 1
nopreempt # disable pre‑emptive takeover
authentication {
auth_type PASS
auth_pass your_complex_password_here
}
track_script {
chk_nginx
chk_network
}
notify_master "/etc/keepalived/scripts/check_split_brain.sh"
virtual_ipaddress {
192.168.1.100
}
}
vrrp_script chk_nginx {
script "/etc/keepalived/scripts/check_nginx.sh"
interval 2
weight -2
fall 3
rise 2
}
vrrp_script chk_network {
script "/etc/keepalived/scripts/check_network.sh"
interval 5
weight -2
fall 2
rise 1
}Split‑brain detection script (check_split_brain.sh):
#!/bin/bash
# Split‑brain detection script
REMOTE_IP="192.168.1.11"
VIP="192.168.1.100"
# Check if remote also holds the VIP
ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP "ip addr show | grep $VIP" >/dev/null 2>&1
if [ $? -eq 0 ]; then
logger "CRITICAL: Split brain detected! Releasing VIP..."
ip addr del $VIP/24 dev eth0
curl -X POST "your_alert_webhook" -d "Split brain detected on $(hostname)"
exit 1
fi
fiPitfall 2: Health‑Check Defect – Zombie‑Process Trap
Problem Description
90% of operators write health checks that only verify the process exists, ignoring whether the service is truly functional.
Typical Wrong Script
# Bad example – most people write this
#!/bin/bash
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then
exit 1
fiThis script passes even if the nginx worker processes are zombies and cannot serve requests.
Real Incident
In production a nginx worker became a zombie due to a memory leak; the master kept running but could not handle traffic. The health‑check script still reported success, keepalived did not fail over, and all user requests failed.
Perfect Health‑Check Script
#!/bin/bash
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"
# 1. Process exists
if [ -z "$NGINX_PID" ]; then
logger "Nginx master process not found"
exit 1
fi
# 2. Port listening
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx port 80 not listening"
exit 1
fi
# 3. Config syntax
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx configuration syntax error"
exit 1
fi
# 4. Real HTTP check
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx health check failed, HTTP code: $HTTP_CODE"
systemctl restart nginx
sleep 2
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx restart failed, triggering failover"
exit 1
fi
fi
# 5. System load
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
logger "System load too high: $LOAD"
exit 1
fi
# 6. Memory usage
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
logger "Memory usage too high: $MEM_USAGE%"
exit 1
fi
logger "Nginx health check passed"
exit 0Corresponding nginx health endpoint:
# Simple health endpoint
location /health {
access_log off;
return 200 "healthy
";
add_header Content-Type text/plain;
}
# Detailed health endpoint (Lua)
location /health/detailed {
access_log off;
content_by_lua_block {
local json = require "cjson"
local health_data = {
status = "healthy",
timestamp = ngx.time(),
connections = {
active = ngx.var.connections_active,
reading = ngx.var.connections_reading,
writing = ngx.var.connections_writing,
waiting = ngx.var.connections_waiting,
}
}
ngx.say(json.encode(health_data))
}
}Pitfall 3: Configuration Sync Timing – Domino Effect During Service Restart
Problem Description
If the restart order of master and backup nodes is not coordinated, the VIP may switch to a node that still runs the old configuration, causing 500 errors.
Incident Replay
Update master configuration and restart nginx.
Update backup configuration and restart nginx.
When the master restarts, keepalived moves the VIP to the backup, but the backup still has the old config, so requests hit a non‑existent upstream and fail.
Perfect Update Procedure
#!/bin/bash
MASTER_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
CONFIG_FILE="/etc/nginx/nginx.conf"
VIP="192.168.1.100"
is_master() {
ip addr show | grep $VIP >/dev/null 2>&1
return $?
}
sync_config() {
local target_ip=$1
echo "Syncing config to $target_ip..."
scp $CONFIG_FILE root@$target_ip:$CONFIG_FILE
ssh root@$target_ip "nginx -t"
if [ $? -ne 0 ]; then
echo "Configuration syntax error on $target_ip"
return 1
fi
return 0
}
safe_restart_nginx() {
is_master
local is_current_master=$?
if [ $is_current_master -eq 0 ]; then
echo "Current node is MASTER, performing graceful restart..."
echo "Decreasing VRRP priority..."
sed -i 's/priority 100/priority 50/' /etc/keepalived/keepalived.conf
systemctl reload keepalived
sleep 5
for i in {1..10}; do
is_master
if [ $? -ne 0 ]; then
echo "VIP switched successfully"
break
fi
echo "Waiting for VIP switch... ($i/10)"
sleep 2
done
systemctl restart nginx
if [ $? -eq 0 ] && curl -s http://127.0.0.1/health >/dev/null; then
echo "Nginx restarted successfully"
sed -i 's/priority 50/priority 100/' /etc/keepalived/keepalived.conf
systemctl reload keepalived
else
echo "Nginx restart failed!"
return 1
fi
else
echo "Current node is BACKUP, restarting nginx directly..."
systemctl restart nginx
if [ $? -ne 0 ]; then
echo "Nginx restart failed on backup!"
return 1
fi
fi
return 0
}
main() {
echo "Starting safe nginx configuration update..."
is_master
local current_master=$?
if [ $current_master -eq 0 ]; then
echo "Running on MASTER node"
other_node=$BACKUP_IP
else
echo "Running on BACKUP node"
other_node=$MASTER_IP
fi
echo "Step 1: Syncing configuration to peer node..."
sync_config $other_node || { echo "Configuration sync failed!"; exit 1; }
echo "Step 2: Restarting nginx on peer node..."
ssh root@$other_node "systemctl restart nginx" || { echo "Failed to restart nginx on peer node!"; exit 1; }
ssh root@$other_node "curl -s http://127.0.0.1/health" >/dev/null || { echo "Peer node health check failed!"; exit 1; }
echo "Step 3: Restarting nginx on current node..."
safe_restart_nginx || { echo "Failed to restart nginx on current node!"; exit 1; }
echo "Configuration update completed successfully!"
echo "Final verification..."
curl -s http://$VIP/health >/dev/null && echo "✅ All services are healthy!" || { echo "❌ Service verification failed!"; exit 1; }
}
mainConclusion: From Pitfalls to Mastery
After years of battling these issues I’ve learned that details decide success; prevention beats cure.
Key Takeaways
Split‑brain protection: multi‑layer detection, intelligent failover, real‑time monitoring.
Health checks: real service verification, system‑resource monitoring, automatic remediation.
Config sync: safe sequencing, graceful switch, automatic rollback.
Best‑Practice Recommendations
Monitoring & Alerts: monitor not only service status but also VIP switch events.
Documentation: record every incident to build a knowledge base.
Regular Drills: conduct at least monthly failover rehearsals.
Automation: encode operational experience into reusable scripts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
