How I Built an Automated Redis Sentinel System to Handle Failover
An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.
I am an operations engineer tasked with monitoring a Redis cluster of one master and three slaves; when the master fails I must promote a slave to master and reconfigure the remaining replicas.
First I connect to each node using
redis-cli -h 10.232.0.x -p 6379and continuously send
PINGevery second to detect failures.
When a PING to the master returns an error, I promote a chosen slave (e.g., 10.232.0.3:6379) by issuing
SLAVEOF NO ONEand verify the role with
INFO.
After the new master is confirmed, I re‑attach the other two slaves to it with
SLAVEOF 10.232.0.3 6379, and finally convert the old master into a slave of the new master.
I wrapped this whole procedure into a program called the “sentinel” that continuously monitors the four nodes, reports problems and executes the fail‑over steps automatically.
To simplify monitoring I only query the current master for the list of its slaves using
INFO, which provides the slave IPs, ports, replication offsets and unique IDs.
When multiple slaves are healthy I select the best candidate by filtering out DISCONNECTED or DOWN nodes, discarding those that have not responded within five seconds, then sorting the remaining ones by priority, replication offset and finally by the smallest UID, as implemented in the following C function:
<code>sentinelRedisInstance *sentinelSelectSlave() {
// remove unsuitable nodes
while ((de = dictNext(di)) != NULL) {
if (slave->flags & (DOWN|DISCONNECTED)) continue;
if (mstime() - slave->last_avail_time > 5000) continue;
if (slave->slave_priority == 0) continue;
// other checks …
}
// sort remaining nodes
qsort(..., compareSlavesForPromotion);
// return the best candidate
return instance[0];
}
int compareSlavesForPromotion(const void *a, const void *b) {
if ((*sa)->slave_priority != (*sb)->slave_priority)
return (*sa)->slave_priority - (*sb)->slave_priority;
if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) return -1;
if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) return 1;
return strcasecmp(sa_runid, sb_runid);
}
</code>To avoid a single point of failure I deploy three sentinel nodes as a sentinel cluster; as long as one sentinel is alive the system can continue operating.
Sentinels first perform a “subjective” down detection; when a majority (e.g., two out of three) agree, the master is considered “objectively” down and the promotion process starts.
The leader that carries out the promotion is elected using the Raft algorithm, ensuring only one sentinel performs the change.
With this automated sentinel system I have handled many unexpected master failures, even during late‑night incidents, without manual intervention.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.