Databases 8 min read

How to Diagnose and Prevent Split-Brain Data Loss in Redis Master-Slave Clusters

This article explains why data loss can occur in Redis master‑slave clusters due to split‑brain scenarios, outlines step‑by‑step troubleshooting methods—including checking replication lag, analyzing client logs, and monitoring resource usage—and recommends configuration settings such as min‑slaves‑to‑write and min‑slaves‑max‑lag to prevent the issue.

Ziru Technology
Ziru Technology
Ziru Technology
How to Diagnose and Prevent Split-Brain Data Loss in Redis Master-Slave Clusters

Overview

When using a Redis master‑slave cluster we encountered a problem: the recommended setup (one master, one slave, three sentinel instances) sometimes lost data on the client side, affecting recommendation reliability. Logs showed no errors and writes appeared successful, but the business layer could not retrieve the data.

The root cause was a split‑brain situation, where two master nodes simultaneously accept write requests, causing different clients to write to different masters and leading to data loss.

Analysis Framework

Step 1: Confirm whether replication delay caused the issue

Data loss often occurs because the master’s data has not yet been synchronized to the slave. If the master fails before synchronization completes, the unsynced data is lost when the slave is promoted.

We can compare the replication offsets master_repl_offset and slave_repl_offset. If the slave’s offset is smaller than the original master’s offset, the loss is due to incomplete replication.

Step 2: Examine client operation logs and monitoring

Using Sentinel for failover, a switch occurs only after a quorum of sentinels detects a heartbeat timeout. After the switch, clients communicate with the new master. However, some client logs showed continued communication with the original master during the switch, indicating the original master had not truly failed.

We suspect the original master experienced a “false fault” – it could not respond to Sentinel heartbeats due to resource exhaustion, yet it continued processing client writes.

Monitoring revealed a spike in CPU usage on the master’s host caused by an Elasticsearch node, which saturated the CPU and prevented the master from responding to heartbeats.

Why Split‑Brain Leads to Data Loss

During a failover, the former master receives a SLAVEOF command and performs a full synchronization with the new master, loading an RDB file. Any writes that occurred on the former master during this period are discarded, resulting in data loss.

How to Respond

Redis provides two configuration parameters to limit master writes: min‑slaves‑to‑write and min‑slaves‑max‑lag .

min‑slaves‑to‑write : Minimum number of slaves that must be connected and receiving data for the master to accept writes.

min‑slaves‑max‑lag : Maximum allowed replication lag (in seconds) for slaves to acknowledge writes.

By setting appropriate thresholds (e.g., N slaves and T seconds), the master will stop accepting client writes if these conditions are not met, effectively preventing split‑brain‑induced data loss.

Even if the original master experiences a false fault, it cannot respond to Sentinel heartbeats or acknowledge slaves, causing the min‑slaves‑to‑write and min‑slaves‑max‑lag conditions to fail and the master to stop accepting writes.

Common Causes of False Faults

Other programs on the same server temporarily consume excessive resources (e.g., CPU), limiting the master’s ability to respond to heartbeats.

The master itself becomes blocked (e.g., processing a big key or swapping memory), preventing heartbeat responses until the blockage is cleared.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RedisConfigurationmaster‑slaveReplicationData lossSplit-Brain
Ziru Technology
Written by

Ziru Technology

Ziru Official Tech Account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.