Databases 6 min read

Understanding Redis Master‑Slave Replication Storm and Mitigation Strategies

The article explains Redis master‑slave asynchronous replication, describes how repeated full‑sync requests from slaves can cause a replication storm that overloads CPU, memory and network, and offers practical solutions such as limiting data size, adjusting buffer limits, and redesigning deployment topology.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Understanding Redis Master‑Slave Replication Storm and Mitigation Strategies

1. Overview of Master‑Slave Replication

Redis uses an asynchronous replication model where the master creates an RDB snapshot to send to slaves when they request a full synchronization, providing low latency and high performance.

2. Replication Storm

A replication storm occurs when a slave fails to load the RDB snapshot, repeatedly requests a full sync, and multiple slaves enter this loop simultaneously, causing continuous full‑sync requests.

3. Problem Symptoms

3.1 CPU

The master forks a subprocess to generate the RDB snapshot; large data volumes make this fork operation time‑consuming, causing CPU spikes and affecting normal service response.

3.2 Disk

Since Redis 2.8.18 supports disk‑less replication, the RDB snapshot is generated in memory and streamed directly to slaves, so disk I/O is not a bottleneck.

3.3 Memory and Network

When a storm triggers, the master creates an in‑memory RDB snapshot and streams it to many slaves, consuming significant memory and network bandwidth, increasing latency, and potentially breaking connections; failed slaves then retry, forming a vicious cycle.

4. Scenarios That Trigger Storms

Single‑master instance experiences network interruption or restart.

Multiple master instances on the same machine experience network interruption or restart.

Many slave nodes restart simultaneously.

Replication buffer is too small (client‑output‑buffer‑limit), causing buffer overflow when the master continues generating data during slave recovery.

Long‑term network interruptions (cross‑region, DNS issues) lead to timeout and data loss beyond the replication backlog.

Excessive data size makes RDB generation take too long, causing slaves to timeout and repeatedly request full sync.

5. Mitigation Strategies

5.1 Reduce Storage Upper Limit

Avoid storing excessively large datasets in a single Redis instance to keep RDB generation and transmission times reasonable.

5.2 Adjust Replication Buffers

Increase the master’s client-output-buffer-limit and repl-timeout values so slaves have enough time and buffer space to recover the RDB snapshot without being disconnected.

5.3 Change Deployment Pattern

Do not deploy multiple master nodes on the same host; this prevents a single host failure from causing a flood of full‑sync requests to many slaves.

5.4 Architecture Adjustments

Reduce the number of slave nodes or introduce a hierarchical slave architecture (sub‑slaves) available since Redis 4.0, so lower‑level slaves receive the same data stream as the master, alleviating load on the primary master.

performanceDatabaseRedisMaster‑SlaveReplicationReplication Storm
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.