Databases 20 min read

From Single‑Node to Scalable Redis Cluster: A Step‑by‑Step Architecture Guide

This article walks through Redis's evolution from a simple single‑instance cache to a highly available, high‑performance cluster, explaining persistence mechanisms (RDB, AOF, hybrid), master‑slave replication, Sentinel automatic failover, and sharding strategies with concrete examples and trade‑offs.

Architect
Architect
Architect
From Single‑Node to Scalable Redis Cluster: A Step‑by‑Step Architecture Guide

Starting with a Single‑Node Redis

Assume an application needs a cache to speed up MySQL queries; the simplest solution is to deploy a single Redis instance. The application writes data to Redis, reads it back, and benefits from in‑memory speed. When traffic is low, this model suffices.

As data volume grows, the single node becomes a single point of failure: if Redis crashes, all traffic falls back to MySQL, causing a massive load spike.

Adding Data Persistence

To avoid data loss on restart, the memory state must be persisted to disk. The naïve approach writes every command to both memory and disk, but disk I/O is far slower than memory writes, degrading performance.

Redis solves this by separating the write path into two steps:

Write to the OS page cache (fast, in‑memory).

Flush the page cache to disk with fsync (slow).

Three AOF (Append‑Only File) flushing policies illustrate the trade‑offs:

appendfsync always : every write triggers an fsync – safest but slowest.

appendfsync no : relies on the OS to flush – fastest but riskier.

appendfsync everysec : background thread calls fsync once per second – a balanced default.

AOF files grow indefinitely, so Redis provides an AOF rewrite that compacts the log by keeping only the latest value for each key.

Alternatively, Redis can take periodic snapshots (RDB). A snapshot captures the entire memory state at a moment, writes it once to disk, and thus incurs minimal I/O during normal operation. The downside is that data between snapshots can be lost.

Choosing between RDB and AOF depends on data‑loss tolerance:

Cache‑only workloads (loss‑tolerant) → RDB.

Workloads requiring full durability → AOF.

Hybrid Persistence (Redis 4.0+)

Redis 4.0 introduced hybrid persistence: during an AOF rewrite, Redis first writes a binary RDB snapshot into the AOF file, then appends subsequent commands. This combines RDB’s compact size with AOF’s near‑real‑time durability, reducing recovery time.

Hybrid persistence is an optimization of AOF rewrite and requires AOF to be enabled.

Master‑Slave Replication for High Availability

Deploying a master and one or more slaves copies data in real time. If the master fails, a slave can be promoted manually, shortening downtime and allowing read scaling.

Manual promotion still incurs human reaction time, prompting the need for automation.

Sentinel: Automatic Failover

Sentinel processes periodically ping the master. If a majority of Sentinels deem the master unreachable, they elect a leader (using a Raft‑like consensus) and the leader promotes a slave to master.

The election algorithm works as follows:

Each Sentinel requests votes from the others.

Each Sentinel votes for the first requester and votes only once.

The candidate that gathers >50% of votes becomes the leader and triggers the failover.

This consensus avoids split‑brain scenarios and ensures a reliable automatic switch.

Sharding for Horizontal Scalability

When a single master cannot handle write throughput, multiple Redis instances are partitioned by key. Two common approaches:

Client‑side sharding : the application computes a hash of the key and routes the request to the appropriate node. This requires the routing logic to be embedded in the client code.

Proxy‑based sharding : a proxy layer (e.g., Twemproxy, Codis) holds the routing table, allowing clients to interact with a single endpoint while the proxy forwards commands to the correct shard.

Both methods rely on Sentinel for failover of individual shards.

Official Redis Cluster

Redis 3.0 introduced an official cluster that uses the Gossip protocol for health checks, eliminating the need for external Sentinels. The cluster automatically rebalances slots and handles failover.

Clients use a provided SDK that maps keys to slots and discovers node locations, making the cluster transparent to the application.

For legacy codebases that cannot upgrade the SDK, many companies build a custom proxy in front of the official cluster, allowing a seamless switch without code changes.

Putting It All Together

By progressing from a single node → persistence (RDB/AOF/hybrid) → master‑slave replication → Sentinel automatic failover → sharding (client or proxy) → official Redis Cluster, a Redis deployment can evolve to meet increasing data‑loss sensitivity, recovery‑time, availability, read‑scale, and write‑scale requirements.

The final architecture combines high‑performance in‑memory caching with robust durability and fault tolerance, ready for long‑term production use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

clusteringshardinghigh availabilityredisDatabase ArchitecturePersistenceReplication
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.