Operations 7 min read

Why Redis High Availability Fails: Split‑Brain and Replication Storm Explained

The article examines the two most dangerous production failures in Redis high‑availability—split‑brain and replication storm—explaining their causes, real‑world impact, and practical engineering safeguards such as write‑protection parameters, network isolation, backlog sizing, and cascading replication.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Why Redis High Availability Fails: Split‑Brain and Replication Storm Explained

What Is Split‑Brain?

Split‑brain occurs when, at the same moment, two nodes act as masters and both accept write requests, leading to irreconcilable data divergence.

ClientA → Master A (write success)
ClientB → New Master B (write success)

The two masters cannot merge their data, resulting in permanent inconsistency, no automatic recovery, and the need for manual intervention, often causing severe data loss.

Data permanently inconsistent

Cannot auto‑recover

Requires manual intervention

Usually accompanied by serious data loss

Split‑brain is not unique to Redis; it is a universal nightmare for distributed systems.

Typical Split‑Brain Scenario

The most common trigger is a network partition.

┌─────────────┐
Client → Master A  │ Network break │ Sentinel
          └─────────────┘
Client → Slave B (promoted to new master)

Process:

Master A stays alive but Sentinel cannot reach it.

Sentinel marks A as ODOWN.

Sentinel elects B as the new master.

Clients can still access A.

Both A and B accept writes → split‑brain forms.

Why Sentinel Cannot Fully Prevent Split‑Brain

Sentinel guarantees consistency only within the network range it can see.

Sentinel cannot control:

Client access paths

Network‑partition topology

External load‑balancer behavior

Sentinel solves “automatic failover”, not “global consistency”.

Engineering Safeguards Against Split‑Brain

Redis offers write‑protection parameters:

min-slaves-to-write 1
min-slaves-max-lag 5
When a master has no healthy slaves, it rejects write requests.

In a split‑brain situation this forces the original master to become read‑only, allowing only the newly elected master to continue writes and preventing dual writes.

Original master loses slaves → becomes read‑only.

New master B can keep writing.

Dual‑write scenario avoided.

Protection Levels

No protection – extremely prone to split‑brain.

Sentinel – automatic failover but no split‑brain protection.

+ Write‑protection parameters – blocks dual writes.

+ Network isolation design – industrial‑grade safety.

What Is Replication Storm?

Multiple replicas simultaneously trigger a full resynchronization (FULLRESYNC), instantly overwhelming the master.

CPU spikes

IO saturated

Network congestion

Redis OOM or freeze

Replica1 ─┐
Replica2 ─┼→ simultaneous FULLRESYNC → master crash
Replica3 ─┘

Triggers for Replication Storm

Backlog size too small.

Frequent network jitter.

Sentinel frequent failovers.

Mass replica restarts at once.

Excessive write pressure on the master.

Engineering Solutions for Replication Storm

1. Increase Backlog Size

repl-backlog-size 128mb   # recommended minimum

For heavy‑write workloads:

repl-backlog-size 256mb

2. Enable Cascading Replication

Master → Replica A → Replica B → Replica C

Limiting the master to a single replication chain dramatically reduces storm probability.

3. Stagger Replica Restarts

# Bad: systemctl restart redis*   # restart all nodes at once
# Good: restart nodes one by one (rolling restart)

4. Stabilize Sentinel

Avoid overly small down-after-milliseconds.

Maintain a stable network.

Do not colocate Sentinel with Redis on fragile nodes.

Comparison: Split‑Brain vs Replication Storm

Risk: permanent data inconsistency vs full system crash.

Automatic recovery: none vs usually possible.

Main cause: network partition vs mis‑configuration + bulk actions.

Defense focus: write protection + architecture vs backlog + cascading replication.

True Essence of Redis High Availability

HA means that even when incidents occur, data remains safe, the system stays controllable, and recovery is predictable.

Engineering Checklist

Configure min-slaves-to-write (mandatory).

Set backlog ≥128 MB.

Ensure high Sentinel stability.

Use cascading replication for large clusters.

Prohibit concurrent restarts; adopt rolling restarts.

Final Takeaway

Sentinel provides automatic failover, write protection prevents dual‑master writes, and adequate backlog with cascading replication stops replication‑storm crashes.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityRedisSentinelReplication StormSplit-BrainWrite Protection
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.