Understanding Redis Split‑Brain: Causes, Data Loss, and Prevention Strategies
This article explains Redis split‑brain behavior, describing its definition, causes such as network failures and Sentinel elections, the resulting data loss during master‑slave switches, and practical prevention measures including quorum configuration, timeout tuning, network monitoring, proxy layers, and the min‑slaves‑to‑write and min‑slaves‑max‑lag settings.
Hello, I am Tianlu. This article shares a common interview question from large tech companies: explain Redis master‑slave split‑brain.
We can answer it from four dimensions:
What is split‑brain behavior?
Why does it occur in a master‑slave cluster?
How does it lead to data loss?
How can we avoid and handle it?
1. What is split‑brain
Split‑brain (Split‑Brain) refers to a situation in a distributed system where a network partition causes nodes to lose contact with each other, creating two or more independent "brains" that each believe they are the primary, leading to write conflicts and inconsistency.
In Redis master‑slave architecture, if the master and its slaves lose communication due to network issues, Sentinel may elect a new master while the old master recovers and continues accepting writes, resulting in two masters – this is Redis split‑brain.
2. Why does split‑brain happen in a master‑slave cluster?
Typical reasons include:
Network failure : Unstable or broken network interrupts communication between the master, Sentinel, and slaves, causing Sentinel to mistakenly think the master is down.
Sentinel election mechanism : When Sentinel cannot reach the master, it starts an election and promotes a slave to master. If the original master later recovers, two masters coexist.
False failure : Sentinel’s failover logic can be overly sensitive to network anomalies, mistakenly triggering a master election and creating an extra master.
3. Why does split‑brain cause data loss?
After a master‑slave switch, the promoted slave becomes the new master. Sentinel instructs the old master to replicate from the new master, which involves loading an RDB file. During this full‑sync phase, any writes that the old master receives are lost.
Illustration:
If the old master experiences a false failure, Sentinel elects a new master. While the election is in progress, the old master may recover and continue accepting writes.
Once the new master is elected, the old master becomes a slave and must sync from the new master, causing any data written to the old master during the transition to be lost.
4. How to avoid or handle split‑brain?
Recommended mitigation methods:
Use quorum configuration : Ensure an odd number of Sentinels and set appropriate voting rules to reduce misjudgment.
Adjust timeout parameters : Tune Sentinel’s down-after-milliseconds and failover-timeout to match the actual network conditions.
Network isolation and monitoring : Keep the network stable and monitor latency to react promptly to issues.
Introduce a proxy layer : Use a proxy (e.g., Codis) to manage client connections to Redis, avoiding direct connections that can cause split‑brain.
A highly recommended approach is to configure min-slaves-to-write and min-slaves-max-lag :
min-slaves-to-write : Requires a minimum number of online, synchronized slaves before the master accepts write operations, preventing writes when consistency cannot be guaranteed.
min-slaves-max-lag : Defines the maximum acceptable replication lag (in seconds). Slaves exceeding this lag are not considered for write eligibility, reducing inconsistency caused by lagging replicas.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.