Databases 10 min read

Redis Cluster Replication Failure Diagnosis and Resolution After Node Restart

The article analyzes a Redis 5.0 six‑node cluster where a node reboot caused the replica to repeatedly fail partial resynchronization, trigger full resync loops, and lose connection to the master, and it proposes increasing repl‑timeout as an effective fix.

Aikesheng Open Source Community

Feb 16, 2023

Redis Cluster Replication Failure Diagnosis and Resolution After Node Restart

Background

A six‑node Redis cluster (6 shards × 2 replicas) runs two instances per node on ports 7000 and 7001. On the night of January 20, a hardware failure caused one node to restart; after reboot the instance became a replica but could not synchronize with the new master, triggering alerts.

Diagnosis

Log inspection on the affected node shows the following sequence:

22996:S 20 Jan 2023 07:27:15.091 * Connecting to MASTER x.x.x.46:7001
22996:S 20 Jan 2023 07:27:15.091 * MASTER <‑> REPLICA sync started
22996:S 20 Jan 2023 07:27:15.106 * Non blocking connect for SYNC fired the event.
22996:S 20 Jan 2023 07:27:15.106 * Master replied to PING, replication can continue...
22996:S 20 Jan 2023 07:27:15.106 * Trying a partial resynchronization (request 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180528579709).
22996:S 20 Jan 2023 07:35:29.263 * Full resync from master: 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180734380451
22996:S 20 Jan 2023 07:35:29.263 * Discarding previously cached master state.
22996:S 20 Jan 2023 07:44:47.717 * MASTER <‑> REPLICA sync: receiving 22930214160 bytes from master

The replica first attempts a partial resync, fails, then performs a full resync.

22996:S 20 Jan 2023 07:48:07.305 * MASTER <‑> REPLICA sync: Flushing old data
22996:S 20 Jan 2023 07:53:24.576 * MASTER <‑> REPLICA sync: Loading DB in memory
22996:S 20 Jan 2023 07:59:59.491 * MASTER <‑> REPLICA sync: Finished with success

The full sync takes about 11 minutes, after which the connection to the master is lost:

22996:S 20 Jan 2023 07:59:59.521 # Connection with master lost.
22996:S 20 Jan 2023 07:59:59.521 * Caching the disconnected master state.

The replica then retries the connection, entering a loop of full sync → data flush & load → connection loss, restarting the process each time.

Two questions arise:

Why does the partial resync fail?

Why does the full resync eventually lose the connection to the master?

Analysis of the master logs shows a BGSAVE every 9 minutes, each producing ~2.6 GB of copy‑on‑write memory while the default repl-backlog-size is only 100 MB. The replica was down for about 15 minutes, during which the replication offset was overwritten, causing the partial resync to fail.

38241:C 20 Jan 2023 07:35:25.836 * DB saved on disk
38241:C 20 Jan 2023 07:35:26.552 * RDB: 2663 MB of memory used by copy‑on‑write
40362:M 20 Jan 2023 07:35:27.950 * Background saving terminated with success
40362:M 20 Jan 2023 07:35:27.950 * Starting BGSAVE for SYNC with target:disk
40362:M 20 Jan 2023 07:35:29.263 * Background saving started by pid 11680
11680:C 20 Jan 2023 07:44:44.585 * DB saved on disk
11680:C 20 Jan 2023 07:44:45.811 * RDB: 2681 MB of memory used by copy‑on‑write

Further master logs reveal the timeline:

07:48:03 – Master successfully sends full RDB to replica.

07:48:07 – Replica flushes old data; during this ~10 s block the cluster detects a timeout (cluster‑node‑timeout = 10 s) and logs a FAIL message.

07:50:17 – Master times out the replica connection and disconnects.

07:53:24 – Replica finishes data flush, starts loading new RDB; cluster re‑recognizes the pair but master_link_status remains down.

07:59:59 – Replica finishes loading RDB, attempts to greet the master, but the master had already closed the connection 9 minutes earlier, so the sync restarts.

The heavy write load (one instance >40 GB) and large BGSAVE memory usage shorten the effective lifespan of the replication backlog, while the long full‑sync duration triggers master‑replica connection timeouts.

Solution

Increase repl-backlog-size. The default 100 MB is insufficient for this workload, but raising it too high may exhaust OS memory and cause OOM.

Increase repl-timeout from the default 60 seconds to a value larger than the full‑sync duration (e.g., 1200 seconds). This was the chosen fix.

Reduce Redis instance memory usage to ≤10 GB, requiring cooperation from developers.

After raising repl-timeout to 1200 seconds, the replication issue was resolved.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

repl-timeout

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.