Operations 19 min read

Zero-Downtime Failover Explained: The Core Secrets of Keepalived

This comprehensive guide explains Keepalived's VRRP‑based high‑availability architecture, core functions, deployment scenarios, configuration details, troubleshooting steps, performance tuning, and best‑practice recommendations for building reliable Linux load‑balancing and failover solutions.

Ops Community
Ops Community
Ops Community
Zero-Downtime Failover Explained: The Core Secrets of Keepalived

Zero-failure switch! One article to understand Keepalived's high‑availability core secrets

1. Keepalived software introduction

1.1 What is Keepalived

Keepalived is a high‑availability solution based on the VRRP (Virtual Router Redundancy Protocol) protocol, mainly used for load balancing and HA clusters on Linux. It was originally developed for the LVS (Linux Virtual Server) project but has evolved into an independent HA solution.

1.2 Core functions of Keepalived

High Availability :

Master‑backup switching via VRRP

Virtual IP (VIP) failover

Multiple health‑check mechanisms

Automatic fault detection and recovery

Load Balancing :

Integration with LVS

Supports various scheduling algorithms

Backend server health checks

Dynamic load‑balancing configuration adjustments

Service Monitoring :

Supports TCP, HTTP, SSL and custom script checks

Weight dynamic adjustment

Automatic fault isolation

1.3 Application scenarios

Web service HA :

Nginx/Apache load balancer HA

Web application server failover

Database connection HA

Network device HA :

Router failover

Firewall HA deployment

Gateway redundancy

Service cluster :

MySQL master‑slave switch

Redis cluster HA

Application service cluster management

2. Keepalived architecture and principles

2.1 Components

Keepalived mainly consists of:

Keepalived Daemon : main process for coordination and management

VRRP Stack : implements the VRRP protocol

Checkers : health‑check modules monitoring backend services

IPVS Wrapper : LVS integration providing load‑balancing

Configuration file structure :

/etc/keepalived/
├── keepalived.conf     # main config file
├── notify_scripts/     # notification scripts
└── check_scripts/      # health‑check scripts

2.2 Working principle

VRRP mechanism :

Virtual router group: multiple physical routers form a virtual router group

Priority mechanism: each device has a priority; the highest becomes Master

Heartbeat detection: periodic VRRP advertisement packets

Failover: Backup automatically takes over when Master fails

State transition process :

Initialize → Backup → Master
    ↑           ↓
    ←─────── Fault ←────┘

IP address management :

Real IP : physical NIC IP of each device

Virtual IP : IP shared by the VRRP group

IP binding : Master binds VIP, Backup stands by

2.3 Process architecture

Parent process :

Manages child processes

Parses configuration files

Handles signals

Allocates resources

VRRP child process :

Handles VRRP protocol

Manages virtual IP

Performs state transitions

Sends heartbeat packets

Checker child process :

Executes health checks

Monitors backend services

Updates LVS configuration

Adjusts weights

3. VRRP protocol details

3.1 Overview

VRRP (Virtual Router Redundancy Protocol) is a selection protocol that dynamically assigns the responsibility of a virtual router to one of the VRRP routers on a LAN.

Protocol features :

RFC 3768 standard

Priority‑based master selection

Fast fault detection and switch

Preemptive mode support

3.2 Packet structure

VRRP Header Format:
 0               1               2               3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Type  | Virtual RtrID |   Priority   | Count IP Addrs|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Auth Type   |   Adver Int   |          Checksum             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     IP Address (1)                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     ...                                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     IP Address (n)                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               Authentication Data (1)                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               Authentication Data (2)                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Field description :

Version : protocol version (currently 2)

Type : packet type (1 = advertisement)

Virtual RtrID : virtual router ID (1‑255)

Priority : 1‑254 (255 special)

Count IP Addrs : number of IP addresses

Auth Type : authentication type

Adver Int : advertisement interval

Checksum : checksum

3.3 State machine

Three states :

Initialize : initial state at system start, then moves to Backup

Backup : listens for Master advertisements; becomes Master if none received

Master : answers VIP requests and sends VRRP advertisements; reverts to Backup if a higher‑priority advert is received

State transition conditions :

Initialize → Backup: startup event
Backup → Master: timeout without Master advert
Master → Backup: higher‑priority advert received

3.4 Timers

Advertisement Timer : interval at which Master sends adverts (default 1 s, configurable 1‑255 s)

Master Down Timer : Backup's timeout waiting for Master adverts (computed as (3 × Adver_Int) + Skew_Time, where Skew_Time = (256‑Priority)/256)

Preemption Timer : delay before a higher‑priority Backup preempts the current Master, preventing frequent switches

4. Installing and configuring Keepalived

4.1 Prepare system environment

System requirements :

Linux kernel ≥ 2.6

VRRP support

Network interface with multicast support

Dependency installation :

# CentOS/RHEL
yum install -y gcc gcc-c++ openssl-devel libnl3-devel

# Ubuntu/Debian
apt-get install -y gcc g++ libssl-dev libnl-3-dev libnl-genl-3-dev

4.2 Compile and install

Download source :

wget https://www.keepalived.org/software/keepalived-2.2.8.tar.gz
tar -zxf keepalived-2.2.8.tar.gz
cd keepalived-2.2.8

Configure build :

./configure --prefix=/usr/local/keepalived \
            --sysconfdir=/etc \
            --enable-vrrp \
            --enable-lvs \
            --enable-snmp \
            --enable-sha1 \
            --with-kernel-dir=/usr/src/kernels/$(uname -r)

Compile and install :

make && make install

4.3 Package installation

CentOS/RHEL: yum install -y keepalived Ubuntu/Debian:

apt-get install -y keepalived

4.4 Basic configuration example

Master node configuration ( /etc/keepalived/keepalived.conf)

global_defs {
    notification_email {
        [email protected]
    }
    notification_email_from [email protected]
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    router_id LVS_MASTER
    vrrp_skip_check_adv_addr
    vrrp_garp_interval 0
    vrrp_gna_interval 0
}

vrrp_script chk_nginx {
    script "/etc/keepalived/check_nginx.sh"
    interval 3
    weight -2
    fall 2
    rise 1
    timeout 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.1.100/24
    }
}

Notification script example ( /etc/keepalived/notify.sh)

#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3

case $STATE in
    "MASTER")
        echo "$(date) - $NAME switched to MASTER" >> /var/log/keepalived-notify.log
        systemctl start nginx
        curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"Keepalived: node switched to MASTER"}' \
            https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
        ;;
    "BACKUP")
        echo "$(date) - $NAME switched to BACKUP" >> /var/log/keepalived-notify.log
        systemctl stop nginx
        ;;
    "FAULT")
        echo "$(date) - $NAME entered FAULT" >> /var/log/keepalived-notify.log
        echo "Keepalived fault alarm" | mail -s "Keepalived Fault" [email protected]
        ;;
esac

5. Troubleshooting and optimization

5.1 Common issues

VIP cannot bind :

# Check network interface
ip addr show eth0

# Check firewall
iptables -L | grep -i vrrp
firewall-cmd --list-all

# Check kernel modules
lsmod | grep ip_vs
modprobe ip_vs

# Check logs
journalctl -u keepalived -f
tail -f /var/log/messages | grep -i keepalived

VRRP communication problems :

# Packet capture
tcpdump -i eth0 -nn proto 112
tcpdump -i eth0 -nn host 224.0.0.18

# Check multicast configuration
cat /proc/net/igmp

# Test unicast connectivity
ping 192.168.1.102
telnet 192.168.1.102 22

5.2 Performance tuning

System‑level optimizations :

# Enable IP forwarding
echo 'net.ipv4.ip_forward = 1' >> /etc/sysctl.conf
# Disable redirects
echo 'net.ipv4.conf.all.send_redirects = 0' >> /etc/sysctl.conf
echo 'net.ipv4.conf.default.send_redirects = 0' >> /etc/sysctl.conf
sysctl -p

# Increase file descriptor limit
ulimit -n 65535

# Network buffer tuning
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf

Keepalived configuration tuning :

vrrp_script chk_service {
    script "/path/to/check_script.sh"
    interval 2
    weight -10
    fall 2
    rise 1
    timeout 1
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    preempt_delay 30
    dont_track_primary
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.1.100/24
    }
}

5.3 Monitoring and operations

Monitoring script example :

#!/bin/bash
# /usr/local/bin/keepalived_monitor.sh
STATUS=$(systemctl is-active keepalived)
PID=$(pgrep keepalived)
VIP_STATUS=$(ip addr show eth0 | grep -c "192.168.1.100")
VRRP_STATE=$(grep -i "state" /var/log/messages | tail -1 | grep -o "MASTER\|BACKUP\|FAULT")

echo "Keepalived service status: $STATUS"
echo "Keepalived PID: $PID"
echo "VIP bind status: $VIP_STATUS"
echo "VRRP current state: $VRRP_STATE"

curl -X POST -H 'Content-Type: application/json' \
    -d "{\"service\":\"keepalived\",\"status\":\"$STATUS\",\"vip\":\"$VIP_STATUS\",\"state\":\"$VRRP_STATE\"}" \
    http://monitoring.example.com/api/metrics

6. Summary

6.1 Core value of Keepalived

Keepalived, as a mature HA solution, offers:

Standard VRRP compatibility

Lightweight design with low resource consumption

Rich feature set for diverse scenarios

Highly customizable configuration

Application advantages include easy deployment, active community, proven production use, and tight integration with Linux.

6.2 Best‑practice summary

Configuration recommendations :

Choose the appropriate mode based on business needs

Set priorities and check intervals reasonably

Implement comprehensive health checks and notification mechanisms

Perform thorough testing and validation

Operations suggestions :

Establish complete monitoring

Conduct regular failover drills

Maintain configuration consistency

Keep software up‑to‑date

Security advice :

Use authentication to protect VRRP traffic

Restrict VRRP packet access

Regularly audit and update configurations

Enable log auditing

6.3 Future trends

Technical developments include cloud‑native adaptation, containerized deployment, micro‑service integration, and automation‑ops integration. Application extensions cover multi‑cloud, hybrid cloud, edge computing, and IoT high‑availability scenarios.

Keepalived will continue to play a vital role in modern IT infrastructure; mastering its principles and configuration enables building stable, reliable high‑availability service architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

VRRP
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.