How to Diagnose and Prevent Split-Brain Data Loss in Redis Master-Slave Clusters
This article explains why data loss can occur in Redis master‑slave clusters due to split‑brain scenarios, outlines step‑by‑step troubleshooting methods—including checking replication lag, analyzing client logs, and monitoring resource usage—and recommends configuration settings such as min‑slaves‑to‑write and min‑slaves‑max‑lag to prevent the issue.
Overview
When using a Redis master‑slave cluster we encountered a problem: the recommended setup (one master, one slave, three sentinel instances) sometimes lost data on the client side, affecting recommendation reliability. Logs showed no errors and writes appeared successful, but the business layer could not retrieve the data.
The root cause was a split‑brain situation, where two master nodes simultaneously accept write requests, causing different clients to write to different masters and leading to data loss.
Analysis Framework
Step 1: Confirm whether replication delay caused the issue
Data loss often occurs because the master’s data has not yet been synchronized to the slave. If the master fails before synchronization completes, the unsynced data is lost when the slave is promoted.
We can compare the replication offsets master_repl_offset and slave_repl_offset. If the slave’s offset is smaller than the original master’s offset, the loss is due to incomplete replication.
Step 2: Examine client operation logs and monitoring
Using Sentinel for failover, a switch occurs only after a quorum of sentinels detects a heartbeat timeout. After the switch, clients communicate with the new master. However, some client logs showed continued communication with the original master during the switch, indicating the original master had not truly failed.
We suspect the original master experienced a “false fault” – it could not respond to Sentinel heartbeats due to resource exhaustion, yet it continued processing client writes.
Monitoring revealed a spike in CPU usage on the master’s host caused by an Elasticsearch node, which saturated the CPU and prevented the master from responding to heartbeats.
Why Split‑Brain Leads to Data Loss
During a failover, the former master receives a SLAVEOF command and performs a full synchronization with the new master, loading an RDB file. Any writes that occurred on the former master during this period are discarded, resulting in data loss.
How to Respond
Redis provides two configuration parameters to limit master writes: min‑slaves‑to‑write and min‑slaves‑max‑lag .
min‑slaves‑to‑write : Minimum number of slaves that must be connected and receiving data for the master to accept writes.
min‑slaves‑max‑lag : Maximum allowed replication lag (in seconds) for slaves to acknowledge writes.
By setting appropriate thresholds (e.g., N slaves and T seconds), the master will stop accepting client writes if these conditions are not met, effectively preventing split‑brain‑induced data loss.
Even if the original master experiences a false fault, it cannot respond to Sentinel heartbeats or acknowledge slaves, causing the min‑slaves‑to‑write and min‑slaves‑max‑lag conditions to fail and the master to stop accepting writes.
Common Causes of False Faults
Other programs on the same server temporarily consume excessive resources (e.g., CPU), limiting the master’s ability to respond to heartbeats.
The master itself becomes blocked (e.g., processing a big key or swapping memory), preventing heartbeat responses until the blockage is cleared.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
