Understanding Redis Master‑Slave Replication Storm and Mitigation Strategies
The article explains Redis master‑slave asynchronous replication, describes how repeated full‑sync requests from slaves can cause a replication storm that overloads CPU, memory and network, and offers practical solutions such as limiting data size, adjusting buffer limits, and redesigning deployment topology.
1. Overview of Master‑Slave Replication
Redis uses an asynchronous replication model where the master creates an RDB snapshot to send to slaves when they request a full synchronization, providing low latency and high performance.
2. Replication Storm
A replication storm occurs when a slave fails to load the RDB snapshot, repeatedly requests a full sync, and multiple slaves enter this loop simultaneously, causing continuous full‑sync requests.
3. Problem Symptoms
3.1 CPU
The master forks a subprocess to generate the RDB snapshot; large data volumes make this fork operation time‑consuming, causing CPU spikes and affecting normal service response.
3.2 Disk
Since Redis 2.8.18 supports disk‑less replication, the RDB snapshot is generated in memory and streamed directly to slaves, so disk I/O is not a bottleneck.
3.3 Memory and Network
When a storm triggers, the master creates an in‑memory RDB snapshot and streams it to many slaves, consuming significant memory and network bandwidth, increasing latency, and potentially breaking connections; failed slaves then retry, forming a vicious cycle.
4. Scenarios That Trigger Storms
Single‑master instance experiences network interruption or restart.
Multiple master instances on the same machine experience network interruption or restart.
Many slave nodes restart simultaneously.
Replication buffer is too small (client‑output‑buffer‑limit), causing buffer overflow when the master continues generating data during slave recovery.
Long‑term network interruptions (cross‑region, DNS issues) lead to timeout and data loss beyond the replication backlog.
Excessive data size makes RDB generation take too long, causing slaves to timeout and repeatedly request full sync.
5. Mitigation Strategies
5.1 Reduce Storage Upper Limit
Avoid storing excessively large datasets in a single Redis instance to keep RDB generation and transmission times reasonable.
5.2 Adjust Replication Buffers
Increase the master’s client-output-buffer-limit and repl-timeout values so slaves have enough time and buffer space to recover the RDB snapshot without being disconnected.
5.3 Change Deployment Pattern
Do not deploy multiple master nodes on the same host; this prevents a single host failure from causing a flood of full‑sync requests to many slaves.
5.4 Architecture Adjustments
Reduce the number of slave nodes or introduce a hierarchical slave architecture (sub‑slaves) available since Redis 4.0, so lower‑level slaves receive the same data stream as the master, alleviating load on the primary master.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.