Operations 11 min read

How I Built an Automated Redis Sentinel System to Handle Failover

An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.

macrozheng
macrozheng
macrozheng
How I Built an Automated Redis Sentinel System to Handle Failover

I am an operations engineer tasked with monitoring a Redis cluster of one master and three slaves; when the master fails I must promote a slave to master and reconfigure the remaining replicas.

First I connect to each node using

redis-cli -h 10.232.0.x -p 6379

and continuously send

PING

every second to detect failures.

When a PING to the master returns an error, I promote a chosen slave (e.g., 10.232.0.3:6379) by issuing

SLAVEOF NO ONE

and verify the role with

INFO

.

After the new master is confirmed, I re‑attach the other two slaves to it with

SLAVEOF 10.232.0.3 6379

, and finally convert the old master into a slave of the new master.

I wrapped this whole procedure into a program called the “sentinel” that continuously monitors the four nodes, reports problems and executes the fail‑over steps automatically.

To simplify monitoring I only query the current master for the list of its slaves using

INFO

, which provides the slave IPs, ports, replication offsets and unique IDs.

When multiple slaves are healthy I select the best candidate by filtering out DISCONNECTED or DOWN nodes, discarding those that have not responded within five seconds, then sorting the remaining ones by priority, replication offset and finally by the smallest UID, as implemented in the following C function:

<code>sentinelRedisInstance *sentinelSelectSlave() {
    // remove unsuitable nodes
    while ((de = dictNext(di)) != NULL) {
        if (slave->flags & (DOWN|DISCONNECTED)) continue;
        if (mstime() - slave->last_avail_time > 5000) continue;
        if (slave->slave_priority == 0) continue;
        // other checks …
    }
    // sort remaining nodes
    qsort(..., compareSlavesForPromotion);
    // return the best candidate
    return instance[0];
}

int compareSlavesForPromotion(const void *a, const void *b) {
    if ((*sa)->slave_priority != (*sb)->slave_priority)
        return (*sa)->slave_priority - (*sb)->slave_priority;
    if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) return -1;
    if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) return 1;
    return strcasecmp(sa_runid, sb_runid);
}
</code>

To avoid a single point of failure I deploy three sentinel nodes as a sentinel cluster; as long as one sentinel is alive the system can continue operating.

Sentinels first perform a “subjective” down detection; when a majority (e.g., two out of three) agree, the master is considered “objectively” down and the promotion process starts.

The leader that carries out the promotion is elected using the Raft algorithm, ensuring only one sentinel performs the change.

With this automated sentinel system I have handled many unexpected master failures, even during late‑night incidents, without manual intervention.

MonitoringautomationoperationsHigh AvailabilityredisSentinelFailover
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.