Operations 11 min read

How I Built an Automated Redis Sentinel System to Handle Failover

An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.

macrozheng
macrozheng
macrozheng
How I Built an Automated Redis Sentinel System to Handle Failover

I am an operations engineer tasked with monitoring a Redis cluster of one master and three slaves; when the master fails I must promote a slave to master and reconfigure the remaining replicas.

First I connect to each node using redis-cli -h 10.232.0.x -p 6379 and continuously send PING every second to detect failures.

When a PING to the master returns an error, I promote a chosen slave (e.g., 10.232.0.3:6379) by issuing SLAVEOF NO ONE and verify the role with INFO.

After the new master is confirmed, I re‑attach the other two slaves to it with SLAVEOF 10.232.0.3 6379, and finally convert the old master into a slave of the new master.

I wrapped this whole procedure into a program called the “sentinel” that continuously monitors the four nodes, reports problems and executes the fail‑over steps automatically.

To simplify monitoring I only query the current master for the list of its slaves using INFO, which provides the slave IPs, ports, replication offsets and unique IDs.

When multiple slaves are healthy I select the best candidate by filtering out DISCONNECTED or DOWN nodes, discarding those that have not responded within five seconds, then sorting the remaining ones by priority, replication offset and finally by the smallest UID, as implemented in the following C function:

sentinelRedisInstance *sentinelSelectSlave() {
    // remove unsuitable nodes
    while ((de = dictNext(di)) != NULL) {
        if (slave->flags & (DOWN|DISCONNECTED)) continue;
        if (mstime() - slave->last_avail_time > 5000) continue;
        if (slave->slave_priority == 0) continue;
        // other checks …
    }
    // sort remaining nodes
    qsort(..., compareSlavesForPromotion);
    // return the best candidate
    return instance[0];
}

int compareSlavesForPromotion(const void *a, const void *b) {
    if ((*sa)->slave_priority != (*sb)->slave_priority)
        return (*sa)->slave_priority - (*sb)->slave_priority;
    if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) return -1;
    if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) return 1;
    return strcasecmp(sa_runid, sb_runid);
}

To avoid a single point of failure I deploy three sentinel nodes as a sentinel cluster; as long as one sentinel is alive the system can continue operating.

Sentinels first perform a “subjective” down detection; when a majority (e.g., two out of three) agree, the master is considered “objectively” down and the promotion process starts.

The leader that carries out the promotion is elected using the Raft algorithm, ensuring only one sentinel performs the change.

With this automated sentinel system I have handled many unexpected master failures, even during late‑night incidents, without manual intervention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringautomationredissentinelfailover
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.