Backend Development 19 min read

Designing Multi‑Data‑Center Redis Cache with Strong Consistency and Failover

This article walks through the evolution of a Redis‑based cache layer for multi‑data‑center deployments, addressing consistency, safety, performance, disk‑space, data loops, timestamp versioning, master‑slave failover, and global numeric aggregation, and culminates in a ready‑to‑use middleware solution.

dbaplus Community

Mar 22, 2020

Designing Multi‑Data‑Center Redis Cache with Strong Consistency and Failover

Introduction

The classic four‑layer backend architecture consists of an access layer (security, rate limiting), a logic layer (business logic), a cache layer (e.g., Redis, Memcached) for performance, and a storage layer (MySQL, LevelDB). When an entire data‑center goes down, traffic must be shifted to another site, but each layer presents different challenges, especially for the cache layer where a sudden drop in hit‑rate can overload the storage tier.

The article focuses on two core challenges in multi‑data‑center cache deployments: data consistency and data safety.

Origin Version

The initial design writes all modifications (writes and deletes) to disk and runs a separate process that reads the disk and pushes the data to other data‑centers.

Two problems arise:

The synchronization process may restart and lose its position.

Data may become incomplete (e.g., Redis AOF rewrite deletes the old file, making previous offsets invalid).

Version 1

To solve the above issues:

Reliable DB: Store the current sync offset in a database so a restarted process can resume from the saved offset.

Write‑ahead log: Record every write in an immutable, timestamped log (similar to MySQL binlog) that cannot be modified or deleted before being replicated.

After running in production, three new issues appear:

Performance degradation: Updating the MySQL offset for every synced record slows the sync process.

Insufficient disk space: Multiple services write to the same machine, and cross‑data‑center sync adds extra load.

Data loop: Data synced from A to B may be sent back from B to A, wasting bandwidth.

Version 2

Addressing the two problems of Version 1:

Periodic DB offset writes: Sync process writes its offset to the DB at intervals instead of after every record, improving performance but risking duplicate processing after a restart.

Idempotent logs: Convert each write into an idempotent operation (e.g., inc cnt becomes set cnt 101) so replaying logs does not corrupt data. This works for strings, hashes, sets, and sorted sets; lists are excluded.

Message queue: Use Kafka or RocketMQ to replicate data across multiple machines, providing redundancy and capacity.

Log IDs (idcid): Tag each log entry with its source data‑center ID; a sync process can discard entries that originated from itself, preventing loops.

Version 3

Even with the above fixes, data inconsistency persists when timestamps differ across data‑centers. The solution is to use timestamps as version numbers:

Higher timestamps win; a write with a newer timestamp overwrites older data.

If timestamps are equal, a pre‑configured priority (e.g., data‑center A over B) resolves the conflict.

This approach still fails when a data‑center’s clock is slower, causing its writes to have smaller timestamps than already‑replicated data. The article proposes rejecting such writes or alerting operators, but acknowledges that perfectly synchronized clocks are unrealistic.

Version 4

Introduce a distributed logical clock that only increments. Each key’s version is a monotonic logical timestamp, never decreasing, ensuring eventual consistency regardless of physical clock drift.

Version 5

When a master fails over to a replica, some writes may not have been replicated, causing inconsistency across data‑centers. The fix adds a monotonically increasing sequence number (seq) to every write and to the binlog. During sync, only entries with seq less than or equal to the replica’s last known seq are propagated.

Version 6

To correctly aggregate counters across data‑centers, store per‑data‑center values in a hash (e.g., HSET cnt A 5, HSET cnt B 3) and sum them when reading, avoiding overwrite conflicts.

RedisPlus Middleware

All the described features—reliable offset storage, idempotent logs, message‑queue integration, logical clocks, sequence numbers, and per‑data‑center aggregation—have been implemented in a middleware called RedisPlus , which is ready for production use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis Cache consistency failover multi-datacenter Logical Clock

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Origin Version

Version 1

Version 2

Version 3

Version 4

Version 5

Version 6

RedisPlus Middleware

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Version 1

Version 2

Version 3

Version 4

Version 5

Version 6