Why Redis High Availability Fails: Split‑Brain and Replication Storm Explained
The article examines the two most dangerous production failures in Redis high‑availability—split‑brain and replication storm—explaining their causes, real‑world impact, and practical engineering safeguards such as write‑protection parameters, network isolation, backlog sizing, and cascading replication.
What Is Split‑Brain?
Split‑brain occurs when, at the same moment, two nodes act as masters and both accept write requests, leading to irreconcilable data divergence.
ClientA → Master A (write success)
ClientB → New Master B (write success)The two masters cannot merge their data, resulting in permanent inconsistency, no automatic recovery, and the need for manual intervention, often causing severe data loss.
Data permanently inconsistent
Cannot auto‑recover
Requires manual intervention
Usually accompanied by serious data loss
Split‑brain is not unique to Redis; it is a universal nightmare for distributed systems.
Typical Split‑Brain Scenario
The most common trigger is a network partition.
┌─────────────┐
Client → Master A │ Network break │ Sentinel
└─────────────┘
Client → Slave B (promoted to new master)Process:
Master A stays alive but Sentinel cannot reach it.
Sentinel marks A as ODOWN.
Sentinel elects B as the new master.
Clients can still access A.
Both A and B accept writes → split‑brain forms.
Why Sentinel Cannot Fully Prevent Split‑Brain
Sentinel guarantees consistency only within the network range it can see.
Sentinel cannot control:
Client access paths
Network‑partition topology
External load‑balancer behavior
Sentinel solves “automatic failover”, not “global consistency”.
Engineering Safeguards Against Split‑Brain
Redis offers write‑protection parameters:
min-slaves-to-write 1
min-slaves-max-lag 5When a master has no healthy slaves, it rejects write requests.
In a split‑brain situation this forces the original master to become read‑only, allowing only the newly elected master to continue writes and preventing dual writes.
Original master loses slaves → becomes read‑only.
New master B can keep writing.
Dual‑write scenario avoided.
Protection Levels
No protection – extremely prone to split‑brain.
Sentinel – automatic failover but no split‑brain protection.
+ Write‑protection parameters – blocks dual writes.
+ Network isolation design – industrial‑grade safety.
What Is Replication Storm?
Multiple replicas simultaneously trigger a full resynchronization (FULLRESYNC), instantly overwhelming the master.
CPU spikes
IO saturated
Network congestion
Redis OOM or freeze
Replica1 ─┐
Replica2 ─┼→ simultaneous FULLRESYNC → master crash
Replica3 ─┘Triggers for Replication Storm
Backlog size too small.
Frequent network jitter.
Sentinel frequent failovers.
Mass replica restarts at once.
Excessive write pressure on the master.
Engineering Solutions for Replication Storm
1. Increase Backlog Size
repl-backlog-size 128mb # recommended minimumFor heavy‑write workloads:
repl-backlog-size 256mb2. Enable Cascading Replication
Master → Replica A → Replica B → Replica CLimiting the master to a single replication chain dramatically reduces storm probability.
3. Stagger Replica Restarts
# Bad: systemctl restart redis* # restart all nodes at once # Good: restart nodes one by one (rolling restart)4. Stabilize Sentinel
Avoid overly small down-after-milliseconds.
Maintain a stable network.
Do not colocate Sentinel with Redis on fragile nodes.
Comparison: Split‑Brain vs Replication Storm
Risk: permanent data inconsistency vs full system crash.
Automatic recovery: none vs usually possible.
Main cause: network partition vs mis‑configuration + bulk actions.
Defense focus: write protection + architecture vs backlog + cascading replication.
True Essence of Redis High Availability
HA means that even when incidents occur, data remains safe, the system stays controllable, and recovery is predictable.
Engineering Checklist
Configure min-slaves-to-write (mandatory).
Set backlog ≥128 MB.
Ensure high Sentinel stability.
Use cascading replication for large clusters.
Prohibit concurrent restarts; adopt rolling restarts.
Final Takeaway
Sentinel provides automatic failover, write protection prevents dual‑master writes, and adequate backlog with cascading replication stops replication‑storm crashes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
