Operations 46 min read

How to Achieve 99.99% Uptime with Keepalived Dual‑Node HA

This guide explains how to design a high‑availability architecture using Keepalived's VRRP‑based active‑passive failover, covering technical features, applicable scenarios, environment requirements, step‑by‑step installation and configuration for services like Nginx, MySQL and Redis, plus best practices, troubleshooting, monitoring and backup strategies.

Ops Community

Nov 9, 2025

How to Achieve 99.99% Uptime with Keepalived Dual‑Node HA

High Availability Architecture Design: Keepalived Dual‑Node Failover for 99.99% Uptime

1. Overview

In modern internet architecture, system high availability (HA) is a crucial metric. A single point of failure (SPOF) can cause complete service outage and huge economic loss; Gartner reports that each hour of downtime can cost up to $300,000.

Keepalived is a HA solution based on the Virtual Router Redundancy Protocol (VRRP). It uses a master‑backup mode to automatically fail over services within 3 seconds, achieving >99.99% availability (less than 53 minutes of downtime per year).

This article deeply explains Keepalived's working principle, configuration methods, and best‑practice implementations for services such as Nginx, MySQL, and Redis, helping you build a truly HA enterprise‑grade architecture.

1.2 Technical Features

VRRP Protocol : Standardized virtual router redundancy (RFC 3768), mature and widely used in enterprise networks.

Second‑Level Switching : Failover completes within 1‑3 seconds, virtually transparent to business.

VIP Drift : Service IP (VIP) moves transparently, clients need no configuration changes.

Health Checks : Built‑in HTTP, TCP, script checks monitor service status in real time.

Preempt Mode : Optional automatic VIP reclamation when the master recovers.

Multi‑Instance Support : One server can run multiple Keepalived instances to manage different VIPs.

Simple Configuration : Compared with Heartbeat or Pacemaker, Keepalived config is concise and easy to maintain.

1.3 Applicable Scenarios

Load balancer HA (Nginx, HAProxy)

Database HA (MySQL master‑slave with automatic failover)

Redis Sentinel integration

Web server clusters (Apache, Tomcat)

API gateway HA (Kong, APISIX)

File server HA (NFS, Samba)

1.4 Environment Requirements

Component

Version Requirement

Notes

Operating System

CentOS 7+/Ubuntu 18.04+

Supports major Linux distributions

Keepalived

2.0.20+

Use the latest stable version

Kernel

3.10+

Must support VRRP

Network

Same subnet

Master and backup must be on the same L2 network

Hardware

2‑core 4GB+ (production recommends higher)

Service Software

Depends on scenario (Nginx, MySQL, Redis, etc.)

2. Detailed Steps

2.1 Preparation

2.1.1 System Check

# Check OS version
cat /etc/os-release
uname -r
# Check network configuration
ip addr show
ifconfig -a
# Check hostname and hosts file
hostname
cat /etc/hosts
# Check firewall status
systemctl status firewalld
iptables -L -n
# Check time synchronization (critical)
timedatectl status
ntpq -p
# Check NIC multicast support (required for VRRP)
ip maddr show

2.1.2 Network Planning

# Example topology
# Master node: web-master (192.168.1.10, eth0)
# Backup node: web-backup (192.168.1.11, eth0)
# Virtual IP (VIP): 192.168.1.100
# Append to /etc/hosts on both servers
cat >> /etc/hosts <<EOF
192.168.1.10  web-master
192.168.1.11  web-backup
192.168.1.100 web-vip
EOF

2.1.3 Install Keepalived

# CentOS/RHEL
sudo yum install -y keepalived
# Ubuntu/Debian
sudo apt update
sudo apt install -y keepalived
# Build from source (latest version)
sudo yum install -y gcc openssl-devel libnl3-devel
wget https://www.keepalived.org/software/keepalived-2.2.8.tar.gz
tar -xzf keepalived-2.2.8.tar.gz
cd keepalived-2.2.8
./configure --prefix=/usr/local/keepalived
make && sudo make install
sudo mkdir -p /etc/keepalived
keepalived -v

2.1.4 Firewall Configuration

# Open VRRP protocol (112) in firewalld
sudo firewall-cmd --permanent --add-rich-rule='rule protocol value="vrrp" accept'
sudo firewall-cmd --reload
# iptables example
sudo iptables -A INPUT -p vrrp -j ACCEPT
sudo service iptables save
# Allow multicast address (VRRP uses 224.0.0.18)
sudo iptables -A INPUT -d 224.0.0.0/8 -j ACCEPT

2.2 Core Configuration

2.2.1 Keepalived Basic Config (Master)

# /etc/keepalived/keepalived.conf
global_defs {
    router_id web-master
    vrrp_mcast_group4 224.0.0.18
    script_user root
    enable_script_security
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234567890
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
    notify_master "/etc/keepalived/notify.sh MASTER"
    notify_backup "/etc/keepalived/notify.sh BACKUP"
    notify_fault "/etc/keepalived/notify.sh FAULT"
}

2.2.2 Keepalived Basic Config (Backup)

# /etc/keepalived/keepalived.conf (backup node)
global_defs {
    router_id web-backup
    vrrp_mcast_group4 224.0.0.18
    script_user root
    enable_script_security
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234567890
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
    notify_master "/etc/keepalived/notify.sh MASTER"
    notify_backup "/etc/keepalived/notify.sh BACKUP"
    notify_fault "/etc/keepalived/notify.sh FAULT"
}

2.2.3 Notification Script

# /etc/keepalived/notify.sh
#!/bin/bash
CONTACT="[email protected]"
SUBJECT="Keepalived State Change Notification"
STATE=$1
MESSAGE="$(hostname) switched to $STATE"
# Log
echo "$(date) - $MESSAGE" >> /var/log/keepalived_notify.log
# Optional email
# echo "$MESSAGE" | mail -s "$SUBJECT" $CONTACT

2.2.4 Nginx HA Configuration

Scenario: Two Nginx servers act as load balancers, with Keepalived providing HA.

# Install Nginx on both nodes
sudo yum install -y nginx
# Upstream config (shared)
cat > /etc/nginx/conf.d/upstream.conf <<'EOF'
upstream backend {
    server 192.168.1.20:80 weight=1;
    server 192.168.1.21:80 weight=1;
    server 192.168.1.22:80 weight=1;
}
server {
    listen 80;
    server_name www.example.com;
    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
EOF
sudo systemctl start nginx
sudo systemctl enable nginx

2.2.5 Nginx Health Check Script

# /etc/keepalived/check_nginx.sh
#!/bin/bash
if ! pgrep -x nginx > /dev/null; then
    echo "Nginx process missing, trying to start..."
    systemctl start nginx
    sleep 2
    if ! pgrep -x nginx > /dev/null; then
        echo "Nginx restart failed"
        exit 1
    fi
fi
if ! ss -tlnp | grep :80 > /dev/null; then
    echo "Nginx port 80 not listening"
    exit 1
fi
HTTP_CODE=$(curl -o /dev/null -s -w "%{http_code}" http://127.0.0.1:80)
if [ "$HTTP_CODE" != "200" ]; then
    echo "Nginx health check failed, HTTP code: $HTTP_CODE"
    exit 1
fi
exit 0

2.2.6 MySQL HA Configuration (Master‑Slave)

Scenario: MySQL master‑slave with automatic VIP failover.

# Master MySQL config (mysqld.cnf)
[mysqld]
server-id=1
log-bin=mysql-bin
binlog-format=ROW
expire_logs_days=7
# Create replication user
CREATE USER 'repl'@'192.168.1.%' IDENTIFIED BY 'ReplicationPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.1.%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS;

# Slave MySQL config
[mysqld]
server-id=2
relay-log=relay-bin
read_only=1
# Set up replication
CHANGE MASTER TO MASTER_HOST='192.168.1.10', MASTER_USER='repl', MASTER_PASSWORD='ReplicationPassword123!', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=154;
START SLAVE;
SHOW SLAVE STATUS\G

2.2.7 Keepalived Config for MySQL

# /etc/keepalived/keepalived.conf (MySQL master node)
global_defs {
    router_id mysql-master
    script_user root
    enable_script_security
}

vrrp_script check_mysql {
    script "/etc/keepalived/check_mysql.sh"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 52
    priority 100
    advert_int 1
    nopreempt
    authentication {
        auth_type PASS
        auth_pass mysql1234
    }
    virtual_ipaddress {
        192.168.1.100/24 dev eth0 label eth0:vip
    }
    track_script { check_mysql }
    notify_master "/etc/keepalived/mysql_master.sh"
    notify_backup "/etc/keepalived/mysql_backup.sh"
}

2.2.8 MySQL Health Check Script

# /etc/keepalived/check_mysql.sh
#!/bin/bash
MYSQL_USER="monitor"
MYSQL_PASS="MonitorPass123!"
# Check process
if ! pgrep -x mysqld > /dev/null; then echo "MySQL process missing"; exit 1; fi
# Check connectivity
mysql -h 127.0.0.1 -P 3306 -u $MYSQL_USER -p$MYSQL_PASS -e "SELECT 1" > /dev/null 2>&1 || { echo "MySQL cannot connect"; exit 1; }
# If slave, verify replication status
if [ "$(hostname)" == "mysql-slave" ]; then
    SLAVE_STATUS=$(mysql -h 127.0.0.1 -P 3306 -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null)
    SLAVE_IO=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running" | awk '{print $2}')
    SLAVE_SQL=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running" | awk '{print $2}')
    if [ "$SLAVE_IO" != "Yes" ] || [ "$SLAVE_SQL" != "Yes" ]; then
        echo "Replication abnormal"
        exit 1
    fi
fi
exit 0

2.3 Start and Verify

2.3.1 Start Keepalived Service

# Verify config syntax
sudo keepalived -t -f /etc/keepalived/keepalived.conf
# Start service
sudo systemctl start keepalived
sudo systemctl enable keepalived
# Check status
sudo systemctl status keepalived
# View logs
sudo tail -f /var/log/messages | grep Keepalived

2.3.2 Verify VIP

# On master, VIP should be present
ip addr show eth0
# On backup, VIP should not be present
ip addr show eth0
# Ping from client
ping -c 3 192.168.1.100
# Access service
curl http://192.168.1.100

2.3.3 Failover Test

# Stop Keepalived on master
sudo systemctl stop keepalived
# After 3 seconds, check VIP on backup
ip addr show eth0
# View logs for failover
tail -20 /var/log/messages | grep Keepalived
# Stop Nginx on master to trigger health‑check based failover
sudo systemctl stop nginx
# Wait 6 seconds (fall 3 × interval 2s) and verify VIP drift
ip addr show eth0
# Simulate network failure
sudo ifdown eth0   # or block VRRP with iptables
# Backup should take VIP within 3 seconds
# Restore master Keepalived
sudo systemctl start keepalived
# If preempt is enabled, VIP returns to master; otherwise it stays on backup

3. Example Configurations

3.1 Full Configuration Example

3.1.1 Dual‑Master Mode (Mutual Backup)

Scenario: Two servers act as each other's backup, each managing a different VIP for load balancing and HA.

# Node1 /etc/keepalived/keepalived.conf
global_defs { router_id node1; script_user root; enable_script_security; }

vrrp_script check_service { script "/etc/keepalived/check_service.sh"; interval 2; weight -20; fall 3; rise 2; }

vrrp_instance VI_1 { state MASTER; interface eth0; virtual_router_id 51; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass vip1_pass; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip1; } track_script { check_service; } }

vrrp_instance VI_2 { state BACKUP; interface eth0; virtual_router_id 52; priority 90; advert_int 1; authentication { auth_type PASS; auth_pass vip2_pass; } virtual_ipaddress { 192.168.1.101/24 dev eth0 label eth0:vip2; } track_script { check_service; } }
EOF

# Node2 /etc/keepalived/keepalived.conf (mirror of Node1 with roles swapped)
global_defs { router_id node2; script_user root; enable_script_security; }

vrrp_script check_service { script "/etc/keepalived/check_service.sh"; interval 2; weight -20; fall 3; rise 2; }

vrrp_instance VI_1 { state BACKUP; interface eth0; virtual_router_id 51; priority 90; advert_int 1; authentication { auth_type PASS; auth_pass vip1_pass; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip1; } track_script { check_service; } }

vrrp_instance VI_2 { state MASTER; interface eth0; virtual_router_id 52; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass vip2_pass; } virtual_ipaddress { 192.168.1.101/24 dev eth0 label eth0:vip2; } track_script { check_service; } }
EOF

3.1.2 LVS + Keepalived Configuration

Scenario: Use LVS for layer‑4 load balancing, Keepalived for HA and health checks.

# /etc/keepalived/keepalived.conf
global_defs { router_id lvs-master; script_user root; enable_script_security; }

vrrp_instance VI_1 { state MASTER; interface eth0; virtual_router_id 51; priority 100; advert_int 1; authentication { auth_type PASS; auth_pass lvs12345; } virtual_ipaddress { 192.168.1.100/24 dev eth0 label eth0:vip; } }

virtual_server 192.168.1.100 80 {
    delay_loop 6
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP
    real_server 192.168.1.20 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
    real_server 192.168.1.21 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
    real_server 192.168.1.22 80 { weight 1; HTTP_GET { url { path /health; status_code 200; } connect_timeout 3; nb_get_retry 3; delay_before_retry 3; } }
}
EOF

3.2 Real‑World Case Studies

Case 1: E‑commerce Site Nginx + Keepalived HA

Scenario: Two Nginx load balancers serve 10 M daily page views. Backend consists of 10 Tomcat servers. Requirements include 99.99% uptime, failover < 3 s, gray‑release support, real‑time alerts.

Architecture diagram (simplified): Internet → VIP 192.168.1.100 → Nginx‑Master / Nginx‑Backup → Tomcat pool.

Key steps: identical Nginx config on both nodes, Keepalived master‑backup with VIP, health‑check scripts, DingTalk alert integration, performance tuning (worker_processes, keepalive, connection limits).

Case 2: Redis Sentinel + Keepalived HA

Scenario: Social platform uses Redis for cache and session storage. Sentinel handles master‑slave failover; Keepalived provides a unified VIP for client access.

Key steps: configure Sentinel, Keepalived script to verify current master, unicast mode for cloud environments, firewall rules for VRRP, monitoring via Prometheus exporter.

4. Best Practices and Precautions

4.1 Performance Optimization

Adjust VRRP advert_int : 1 s for LAN, 3‑5 s for WAN.

Tune health‑check parameters : reasonable interval, weight, fall/rise to avoid false positives.

Use non‑preempt mode for stateful services (e.g., databases) to prevent frequent VIP bounce.

4.2 Security Hardening

Replace default passwords with strong, randomly generated values.

Restrict VRRP multicast via firewall rules.

Run health‑check scripts under a dedicated non‑root user.

4.3 High‑Availability Design

Deploy multiple NICs for dual‑link redundancy.

Integrate Prometheus + Grafana for real‑time metrics.

5. Troubleshooting and Monitoring

5.1 Log Inspection

# View Keepalived logs
tail -f /var/log/messages | grep Keepalived
# Systemd journal
journalctl -u keepalived -f
# Check VRRP state transitions
grep "MASTER STATE" /var/log/messages
grep "BACKUP STATE" /var/log/messages
# Health‑check logs
tail -f /var/log/keepalived_check.log

5.2 Common Issues

VIP cannot bind : verify interface name, IP conflicts, firewall.

Split‑brain : check network connectivity, ensure unique virtual_router_id, verify firewall allows VRRP.

Failover not triggered : test health‑check scripts manually, adjust weight/fall values.

Frequent VIP flapping : increase health‑check interval, improve service stability.

5.3 Monitoring Scripts

# /root/scripts/keepalived_monitor.sh
HOSTNAME=$(hostname)
LOG_FILE="/var/log/keepalived_monitor.log"
# Process check
if ! pgrep -x keepalived > /dev/null; then echo "$(date) - Keepalived not running" >> $LOG_FILE; systemctl start keepalived; fi
# VIP status
VIP="192.168.1.100"
VIP_STATUS=$(ip addr show eth0 | grep "$VIP" | wc -l)
ROLE=$([ "$VIP_STATUS" -eq 1 ] && echo "MASTER" || echo "BACKUP")
echo "$(date) - $HOSTNAME role: $ROLE" >> $LOG_FILE
# Export Prometheus metrics (optional)
cat > /var/lib/node_exporter/keepalived.prom <<EOF
# HELP keepalived_vip_status VIP status (1=MASTER, 0=BACKUP)
# TYPE keepalived_vip_status gauge
keepalived_vip_status{hostname="$HOSTNAME",vip="$VIP"} $VIP_STATUS
# HELP keepalived_process_status Keepalived process status (1=running, 0=stopped)
# TYPE keepalived_process_status gauge
keepalived_process_status{hostname="$HOSTNAME"} $(pgrep -x keepalived > /dev/null && echo 1 || echo 0)
EOF
EOF

Schedule the script via cron (every minute) to keep metrics up‑to‑date.

5.4 Prometheus Alert Rules (example)

groups:
- name: keepalived_alerts
  interval: 30s
  rules:
  - alert: KeepalivedDown
    expr: keepalived_process_status == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Keepalived process stopped"
      description: "{{ $labels.hostname }} Keepalived has been down for over 1 minute"
  - alert: KeepalivedSplitBrain
    expr: sum(keepalived_vip_status) > 1
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Keepalived split‑brain detected"
      description: "Multiple nodes hold the VIP simultaneously"
  - alert: KeepalivedNoMaster
    expr: sum(keepalived_vip_status) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Keepalived has no master"
      description: "No node currently holds the VIP; service unavailable"

6. Summary

6.1 Key Technical Points

VRRP protocol enables master‑backup failover with sub‑second VIP migration.

VIP drift provides transparent client access without configuration changes.

Health checks (script, HTTP, TCP) trigger automatic failover.

Choose preempt or non‑preempt mode based on service statefulness.

Prevent split‑brain by proper authentication, unique virtual_router_id, and network monitoring.

Integrate Prometheus/Grafana for real‑time monitoring and alerting.

6.2 Further Learning

Pacemaker + Corosync for multi‑node clustering.

Cloud‑native HA with Kubernetes StatefulSets and Services.

Distributed system design – CAP theorem, Raft/Paxos, consistency models.

6.3 References

Keepalived official documentation.

VRRP RFC 3768.

Linux‑HA project.

Nginx official docs.

MySQL High‑Availability Guide.

Appendix

A. Command Cheat Sheet

# Keepalived service management
systemctl start keepalived      # start
systemctl stop keepalived       # stop
systemctl restart keepalived   # restart
systemctl status keepalived    # status
systemctl enable keepalived    # enable at boot
# Config syntax check
keepalived -t -f /etc/keepalived/keepalived.conf
# Run in foreground for debugging
keepalived -D -l
# VIP management
ip addr show eth0
ip addr add 192.168.1.100/24 dev eth0
ip addr del 192.168.1.100/24 dev eth0
# Log viewing
tail -f /var/log/messages | grep Keepalived
journalctl -u keepalived -f
# Network diagnostics
tcpdump -i eth0 vrrp
iptables -A INPUT -p vrrp -j ACCEPT
firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept'

B. Configuration Parameter Details

global_defs.router_id : Unique identifier, usually hostname.

global_defs.vrrp_mcast_group4 : VRRP multicast address (default 224.0.0.18).

global_defs.script_user : User to run health‑check scripts (non‑root recommended).

global_defs.enable_script_security : Enables script security checks.

vrrp_instance.state : Initial state (MASTER or BACKUP).

vrrp_instance.interface : Network interface name.

vrrp_instance.virtual_router_id : Identifier (1‑255) unique per VRRP domain.

vrrp_instance.priority : Priority (1‑254); higher wins.

vrrp_instance.advert_int : VRRP advertisement interval (seconds).

vrrp_instance.nopreempt : Non‑preempt mode to avoid immediate VIP return after master recovery.

vrrp_script.script : Path to health‑check script.

vrrp_script.interval : Check interval (seconds).

vrrp_script.weight : Weight adjustment on failure (negative value).

vrrp_script.fall : Consecutive failures before marking down.

vrrp_script.rise : Consecutive successes before marking up.

C. Glossary

Term

English

Explanation

高可用

High Availability (HA)

System designed to minimize downtime.

VRRP

Virtual Router Redundancy Protocol

Standard protocol for router redundancy (RFC 3768).

VIP

Virtual IP

IP address that floats between master and backup.

故障转移

Failover

Automatic switch to backup when master fails.

脑裂

Split Brain

Multiple nodes think they are master simultaneously.

抢占模式

Preempt Mode

Master automatically reclaims VIP after recovery.

健康检查

Health Check

Periodic verification of service health.

单点故障

Single Point of Failure (SPOF)

Component whose failure brings down the whole system.

99.99% 可用性

Four Nines

Less than 52.56 minutes of downtime per year.

high availability Load Balancing MySQL Nginx VRRP Keepalived

Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.