How to Achieve 99.99% Uptime with Keepalived Dual‑Node HA
This guide explains how to design a high‑availability architecture using Keepalived's VRRP‑based active‑passive failover, covering technical features, applicable scenarios, environment requirements, step‑by‑step installation and configuration for services like Nginx, MySQL and Redis, plus best practices, troubleshooting, monitoring and backup strategies.
High Availability Architecture Design: Keepalived Dual‑Node Failover for 99.99% Uptime
1. Overview
In modern internet architecture, system high availability (HA) is a crucial metric. A single point of failure (SPOF) can cause complete service outage and huge economic loss; Gartner reports that each hour of downtime can cost up to $300,000.
Keepalived is a HA solution based on the Virtual Router Redundancy Protocol (VRRP). It uses a master‑backup mode to automatically fail over services within 3 seconds, achieving >99.99% availability (less than 53 minutes of downtime per year).
This article deeply explains Keepalived's working principle, configuration methods, and best‑practice implementations for services such as Nginx, MySQL, and Redis, helping you build a truly HA enterprise‑grade architecture.
1.2 Technical Features
VRRP Protocol : Standardized virtual router redundancy (RFC 3768), mature and widely used in enterprise networks.
Second‑Level Switching : Failover completes within 1‑3 seconds, virtually transparent to business.
VIP Drift : Service IP (VIP) moves transparently, clients need no configuration changes.
Health Checks : Built‑in HTTP, TCP, script checks monitor service status in real time.
Preempt Mode : Optional automatic VIP reclamation when the master recovers.
Multi‑Instance Support : One server can run multiple Keepalived instances to manage different VIPs.
Simple Configuration : Compared with Heartbeat or Pacemaker, Keepalived config is concise and easy to maintain.
1.3 Applicable Scenarios
Load balancer HA (Nginx, HAProxy)
Database HA (MySQL master‑slave with automatic failover)
Redis Sentinel integration
Web server clusters (Apache, Tomcat)
API gateway HA (Kong, APISIX)
File server HA (NFS, Samba)
1.4 Environment Requirements
Component
Version Requirement
Notes
Operating System
CentOS 7+/Ubuntu 18.04+
Supports major Linux distributions
Keepalived
2.0.20+
Use the latest stable version
Kernel
3.10+
Must support VRRP
Network
Same subnet
Master and backup must be on the same L2 network
Hardware
2‑core 4GB+ (production recommends higher)
Service Software
Depends on scenario (Nginx, MySQL, Redis, etc.)
2. Detailed Steps
2.1 Preparation
2.1.1 System Check
# Check OS version
cat /etc/os-release
uname -r
# Check network configuration
ip addr show
ifconfig -a
# Check hostname and hosts file
hostname
cat /etc/hosts
# Check firewall status
systemctl status firewalld
iptables -L -n
# Check time synchronization (critical)
timedatectl status
ntpq -p
# Check NIC multicast support (required for VRRP)
ip maddr show2.1.2 Network Planning
# Example topology
# Master node: web-master (192.168.1.10, eth0)
# Backup node: web-backup (192.168.1.11, eth0)
# Virtual IP (VIP): 192.168.1.100
# Append to /etc/hosts on both servers
cat >> /etc/hosts <<EOF
192.168.1.10 web-master
192.168.1.11 web-backup
192.168.1.100 web-vip
EOF2.1.3 Install Keepalived
# CentOS/RHEL
sudo yum install -y keepalived
# Ubuntu/Debian
sudo apt update
sudo apt install -y keepalived
# Build from source (latest version)
sudo yum install -y gcc openssl-devel libnl3-devel
wget https://www.keepalived.org/software/keepalived-2.2.8.tar.gz
tar -xzf keepalived-2.2.8.tar.gz
cd keepalived-2.2.8
./configure --prefix=/usr/local/keepalived
make && sudo make install
sudo mkdir -p /etc/keepalived
keepalived -v2.1.4 Firewall Configuration
# Open VRRP protocol (112) in firewalld
sudo firewall-cmd --permanent --add-rich-rule='rule protocol value="vrrp" accept'
sudo firewall-cmd --reload
# iptables example
sudo iptables -A INPUT -p vrrp -j ACCEPT
sudo service iptables save
# Allow multicast address (VRRP uses 224.0.0.18)
sudo iptables -A INPUT -d 224.0.0.0/8 -j ACCEPT2.2 Core Configuration
2.2.1 Keepalived Basic Config (Master)
# /etc/keepalived/keepalived.conf
global_defs {
router_id web-master
vrrp_mcast_group4 224.0.0.18
script_user root
enable_script_security
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1234567890
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 label eth0:vip
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}2.2.2 Keepalived Basic Config (Backup)
# /etc/keepalived/keepalived.conf (backup node)
global_defs {
router_id web-backup
vrrp_mcast_group4 224.0.0.18
script_user root
enable_script_security
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass 1234567890
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 label eth0:vip
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}2.2.3 Notification Script
# /etc/keepalived/notify.sh
#!/bin/bash
CONTACT="[email protected]"
SUBJECT="Keepalived State Change Notification"
STATE=$1
MESSAGE="$(hostname) switched to $STATE"
# Log
echo "$(date) - $MESSAGE" >> /var/log/keepalived_notify.log
# Optional email
# echo "$MESSAGE" | mail -s "$SUBJECT" $CONTACT2.2.4 Nginx HA Configuration
Scenario: Two Nginx servers act as load balancers, with Keepalived providing HA.
# Install Nginx on both nodes
sudo yum install -y nginx
# Upstream config (shared)
cat > /etc/nginx/conf.d/upstream.conf <<'EOF'
upstream backend {
server 192.168.1.20:80 weight=1;
server 192.168.1.21:80 weight=1;
server 192.168.1.22:80 weight=1;
}
server {
listen 80;
server_name www.example.com;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
EOF
sudo systemctl start nginx
sudo systemctl enable nginx2.2.5 Nginx Health Check Script
# /etc/keepalived/check_nginx.sh
#!/bin/bash
if ! pgrep -x nginx > /dev/null; then
echo "Nginx process missing, trying to start..."
systemctl start nginx
sleep 2
if ! pgrep -x nginx > /dev/null; then
echo "Nginx restart failed"
exit 1
fi
fi
if ! ss -tlnp | grep :80 > /dev/null; then
echo "Nginx port 80 not listening"
exit 1
fi
HTTP_CODE=$(curl -o /dev/null -s -w "%{http_code}" http://127.0.0.1:80)
if [ "$HTTP_CODE" != "200" ]; then
echo "Nginx health check failed, HTTP code: $HTTP_CODE"
exit 1
fi
exit 02.2.6 MySQL HA Configuration (Master‑Slave)
Scenario: MySQL master‑slave with automatic VIP failover.
# Master MySQL config (mysqld.cnf)
[mysqld]
server-id=1
log-bin=mysql-bin
binlog-format=ROW
expire_logs_days=7
# Create replication user
CREATE USER 'repl'@'192.168.1.%' IDENTIFIED BY 'ReplicationPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.1.%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS; # Slave MySQL config
[mysqld]
server-id=2
relay-log=relay-bin
read_only=1
# Set up replication
CHANGE MASTER TO MASTER_HOST='192.168.1.10', MASTER_USER='repl', MASTER_PASSWORD='ReplicationPassword123!', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=154;
START SLAVE;
SHOW SLAVE STATUS\G2.2.7 Keepalived Config for MySQL
# /etc/keepalived/keepalived.conf (MySQL master node)
global_defs {
router_id mysql-master
script_user root
enable_script_security
}
vrrp_script check_mysql {
script "/etc/keepalived/check_mysql.sh"
interval 2
weight -20
fall 3
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 52
priority 100
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass mysql1234
}
virtual_ipaddress {
192.168.1.100/24 dev eth0 label eth0:vip
}
track_script { check_mysql }
notify_master "/etc/keepalived/mysql_master.sh"
notify_backup "/etc/keepalived/mysql_backup.sh"
}2.2.8 MySQL Health Check Script
# /etc/keepalived/check_mysql.sh
#!/bin/bash
MYSQL_USER="monitor"
MYSQL_PASS="MonitorPass123!"
# Check process
if ! pgrep -x mysqld > /dev/null; then echo "MySQL process missing"; exit 1; fi
# Check connectivity
mysql -h 127.0.0.1 -P 3306 -u $MYSQL_USER -p$MYSQL_PASS -e "SELECT 1" > /dev/null 2>&1 || { echo "MySQL cannot connect"; exit 1; }
# If slave, verify replication status
if [ "$(hostname)" == "mysql-slave" ]; then
SLAVE_STATUS=$(mysql -h 127.0.0.1 -P 3306 -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null)
SLAVE_IO=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running" | awk '{print $2}')
SLAVE_SQL=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running" | awk '{print $2}')
if [ "$SLAVE_IO" != "Yes" ] || [ "$SLAVE_SQL" != "Yes" ]; then
echo "Replication abnormal"
exit 1
fi
fi
exit 02.3 Start and Verify
2.3.1 Start Keepalived Service
# Verify config syntax
sudo keepalived -t -f /etc/keepalived/keepalived.conf
# Start service
sudo systemctl start keepalived
sudo systemctl enable keepalived
# Check status
sudo systemctl status keepalived
# View logs
sudo tail -f /var/log/messages | grep Keepalived2.3.2 Verify VIP
# On master, VIP should be present
ip addr show eth0
# On backup, VIP should not be present
ip addr show eth0
# Ping from client
ping -c 3 192.168.1.100
# Access service
curl http://192.168.1.1002.3.3 Failover Test
# Stop Keepalived on master
sudo systemctl stop keepalived
# After 3 seconds, check VIP on backup
ip addr show eth0
# View logs for failover
tail -20 /var/log/messages | grep Keepalived
# Stop Nginx on master to trigger health‑check based failover
sudo systemctl stop nginx
# Wait 6 seconds (fall 3 × interval 2s) and verify VIP drift
ip addr show eth0
# Simulate network failure
sudo ifdown eth0 # or block VRRP with iptables
# Backup should take VIP within 3 seconds
# Restore master Keepalived
sudo systemctl start keepalived
# If preempt is enabled, VIP returns to master; otherwise it stays on backup3. Example Configurations
3.1 Full Configuration Example
3.1.1 Dual‑Master Mode (Mutual Backup)
Scenario: Two servers act as each other's backup, each managing a different VIP for load balancing and HA.
# Node1 /etc/keepalived/keepalived.conf
global_defs { router_id node1; script_user root; enable_script_security; }
vrrp_script check_service { script "/etc/keepalived/check_service.sh"; interval 2; weight -20; fall 3; rise 2; }
vrrp_instance VI_1 { state MASTER; interface eth0; virtual_router_id 51; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass vip1_pass; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip1; } track_script { check_service; } }
vrrp_instance VI_2 { state BACKUP; interface eth0; virtual_router_id 52; priority 90; advert_int 1; authentication { auth_type PASS; auth_pass vip2_pass; } virtual_ipaddress { 192.168.1.101/24 dev eth0 label eth0:vip2; } track_script { check_service; } }
EOF # Node2 /etc/keepalived/keepalived.conf (mirror of Node1 with roles swapped)
global_defs { router_id node2; script_user root; enable_script_security; }
vrrp_script check_service { script "/etc/keepalived/check_service.sh"; interval 2; weight -20; fall 3; rise 2; }
vrrp_instance VI_1 { state BACKUP; interface eth0; virtual_router_id 51; priority 90; advert_int 1; authentication { auth_type PASS; auth_pass vip1_pass; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip1; } track_script { check_service; } }
vrrp_instance VI_2 { state MASTER; interface eth0; virtual_router_id 52; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass vip2_pass; } virtual_ipaddress { 192.168.1.101/24 dev eth0 label eth0:vip2; } track_script { check_service; } }
EOF3.1.2 LVS + Keepalived Configuration
Scenario: Use LVS for layer‑4 load balancing, Keepalived for HA and health checks.
# /etc/keepalived/keepalived.conf
global_defs { router_id lvs-master; script_user root; enable_script_security; }
vrrp_instance VI_1 { state MASTER; interface eth0; virtual_router_id 51; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass lvs12345; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip; } }
virtual_server 192.168.1.100 80 {
delay_loop 6
lb_algo rr
lb_kind DR
persistence_timeout 50
protocol TCP
real_server 192.168.1.20 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
real_server 192.168.1.21 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
real_server 192.168.1.22 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
}
EOF3.2 Real‑World Case Studies
Case 1: E‑commerce Site Nginx + Keepalived HA
Scenario: Two Nginx load balancers serve 10 M daily page views. Backend consists of 10 Tomcat servers. Requirements include 99.99% uptime, failover < 3 s, gray‑release support, real‑time alerts.
Architecture diagram (simplified): Internet → VIP 192.168.1.100 → Nginx‑Master / Nginx‑Backup → Tomcat pool.
Key steps: identical Nginx config on both nodes, Keepalived master‑backup with VIP, health‑check scripts, DingTalk alert integration, performance tuning (worker_processes, keepalive, connection limits).
Case 2: Redis Sentinel + Keepalived HA
Scenario: Social platform uses Redis for cache and session storage. Sentinel handles master‑slave failover; Keepalived provides a unified VIP for client access.
Key steps: configure Sentinel, Keepalived script to verify current master, unicast mode for cloud environments, firewall rules for VRRP, monitoring via Prometheus exporter.
4. Best Practices and Precautions
4.1 Performance Optimization
Adjust VRRP advert_int : 1 s for LAN, 3‑5 s for WAN.
Tune health‑check parameters : reasonable interval, weight, fall/rise to avoid false positives.
Use non‑preempt mode for stateful services (e.g., databases) to prevent frequent VIP bounce.
4.2 Security Hardening
Replace default passwords with strong, randomly generated values.
Restrict VRRP multicast via firewall rules.
Run health‑check scripts under a dedicated non‑root user.
4.3 High‑Availability Design
Deploy multiple NICs for dual‑link redundancy.
Integrate Prometheus + Grafana for real‑time metrics.
5. Troubleshooting and Monitoring
5.1 Log Inspection
# View Keepalived logs
tail -f /var/log/messages | grep Keepalived
# Systemd journal
journalctl -u keepalived -f
# Check VRRP state transitions
grep "MASTER STATE" /var/log/messages
grep "BACKUP STATE" /var/log/messages
# Health‑check logs
tail -f /var/log/keepalived_check.log5.2 Common Issues
VIP cannot bind : verify interface name, IP conflicts, firewall.
Split‑brain : check network connectivity, ensure unique virtual_router_id, verify firewall allows VRRP.
Failover not triggered : test health‑check scripts manually, adjust weight/fall values.
Frequent VIP flapping : increase health‑check interval, improve service stability.
5.3 Monitoring Scripts
# /root/scripts/keepalived_monitor.sh
HOSTNAME=$(hostname)
LOG_FILE="/var/log/keepalived_monitor.log"
# Process check
if ! pgrep -x keepalived > /dev/null; then echo "$(date) - Keepalived not running" >> $LOG_FILE; systemctl start keepalived; fi
# VIP status
VIP="192.168.1.100"
VIP_STATUS=$(ip addr show eth0 | grep "$VIP" | wc -l)
ROLE=$([ "$VIP_STATUS" -eq 1 ] && echo "MASTER" || echo "BACKUP")
echo "$(date) - $HOSTNAME role: $ROLE" >> $LOG_FILE
# Export Prometheus metrics (optional)
cat > /var/lib/node_exporter/keepalived.prom <<EOF
# HELP keepalived_vip_status VIP status (1=MASTER, 0=BACKUP)
# TYPE keepalived_vip_status gauge
keepalived_vip_status{hostname="$HOSTNAME",vip="$VIP"} $VIP_STATUS
# HELP keepalived_process_status Keepalived process status (1=running, 0=stopped)
# TYPE keepalived_process_status gauge
keepalived_process_status{hostname="$HOSTNAME"} $(pgrep -x keepalived > /dev/null && echo 1 || echo 0)
EOF
EOFSchedule the script via cron (every minute) to keep metrics up‑to‑date.
5.4 Prometheus Alert Rules (example)
groups:
- name: keepalived_alerts
interval: 30s
rules:
- alert: KeepalivedDown
expr: keepalived_process_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Keepalived process stopped"
description: "{{ $labels.hostname }} Keepalived has been down for over 1 minute"
- alert: KeepalivedSplitBrain
expr: sum(keepalived_vip_status) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "Keepalived split‑brain detected"
description: "Multiple nodes hold the VIP simultaneously"
- alert: KeepalivedNoMaster
expr: sum(keepalived_vip_status) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Keepalived has no master"
description: "No node currently holds the VIP; service unavailable"6. Summary
6.1 Key Technical Points
VRRP protocol enables master‑backup failover with sub‑second VIP migration.
VIP drift provides transparent client access without configuration changes.
Health checks (script, HTTP, TCP) trigger automatic failover.
Choose preempt or non‑preempt mode based on service statefulness.
Prevent split‑brain by proper authentication, unique virtual_router_id, and network monitoring.
Integrate Prometheus/Grafana for real‑time monitoring and alerting.
6.2 Further Learning
Pacemaker + Corosync for multi‑node clustering.
Cloud‑native HA with Kubernetes StatefulSets and Services.
Distributed system design – CAP theorem, Raft/Paxos, consistency models.
6.3 References
Keepalived official documentation.
VRRP RFC 3768.
Linux‑HA project.
Nginx official docs.
MySQL High‑Availability Guide.
Appendix
A. Command Cheat Sheet
# Keepalived service management
systemctl start keepalived # start
systemctl stop keepalived # stop
systemctl restart keepalived # restart
systemctl status keepalived # status
systemctl enable keepalived # enable at boot
# Config syntax check
keepalived -t -f /etc/keepalived/keepalived.conf
# Run in foreground for debugging
keepalived -D -l
# VIP management
ip addr show eth0
ip addr add 192.168.1.100/24 dev eth0
ip addr del 192.168.1.100/24 dev eth0
# Log viewing
tail -f /var/log/messages | grep Keepalived
journalctl -u keepalived -f
# Network diagnostics
tcpdump -i eth0 vrrp
iptables -A INPUT -p vrrp -j ACCEPT
firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept'B. Configuration Parameter Details
global_defs.router_id : Unique identifier, usually hostname.
global_defs.vrrp_mcast_group4 : VRRP multicast address (default 224.0.0.18).
global_defs.script_user : User to run health‑check scripts (non‑root recommended).
global_defs.enable_script_security : Enables script security checks.
vrrp_instance.state : Initial state (MASTER or BACKUP).
vrrp_instance.interface : Network interface name.
vrrp_instance.virtual_router_id : Identifier (1‑255) unique per VRRP domain.
vrrp_instance.priority : Priority (1‑254); higher wins.
vrrp_instance.advert_int : VRRP advertisement interval (seconds).
vrrp_instance.nopreempt : Non‑preempt mode to avoid immediate VIP return after master recovery.
vrrp_script.script : Path to health‑check script.
vrrp_script.interval : Check interval (seconds).
vrrp_script.weight : Weight adjustment on failure (negative value).
vrrp_script.fall : Consecutive failures before marking down.
vrrp_script.rise : Consecutive successes before marking up.
C. Glossary
Term
English
Explanation
高可用
High Availability (HA)
System designed to minimize downtime.
VRRP
Virtual Router Redundancy Protocol
Standard protocol for router redundancy (RFC 3768).
VIP
Virtual IP
IP address that floats between master and backup.
故障转移
Failover
Automatic switch to backup when master fails.
脑裂
Split Brain
Multiple nodes think they are master simultaneously.
抢占模式
Preempt Mode
Master automatically reclaims VIP after recovery.
健康检查
Health Check
Periodic verification of service health.
单点故障
Single Point of Failure (SPOF)
Component whose failure brings down the whole system.
99.99% 可用性
Four Nines
Less than 52.56 minutes of downtime per year.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
