Databases 21 min read

Deep Dive into Redis Cluster Architecture and Principles

This article provides a comprehensive analysis of Redis Cluster, covering node and slot assignment, command execution, resharding, redirection, fault‑tolerance, gossip communication, scaling strategies, configuration limits, and practical code examples for building and operating a high‑availability sharded Redis deployment.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Deep Dive into Redis Cluster Architecture and Principles

Why Use Redis Cluster

Redis Cluster solves the performance and scalability problems caused by large data volumes by distributing data across multiple nodes, enabling both vertical (scale‑up) and horizontal (scale‑out) expansion.

Vertical vs Horizontal Scaling

Vertical scaling upgrades a single instance’s hardware, while horizontal scaling adds more instances, each responsible for a subset of data.

What Is a Redis Cluster

A Redis Cluster is a distributed database that partitions data into 16,384 hash slots; each node manages a range of slots. Nodes communicate via the Gossip protocol to share slot‑to‑node mappings.

Data Partitioning

Keys are hashed with CRC16, the 16‑bit result is modulo‑ed by 16,384 to obtain a slot number. Optional key tags can force a key into a specific slot.

Slot‑to‑Node Mapping

When a cluster is created, Redis automatically distributes the 16,384 slots evenly across all nodes (e.g., with three nodes: slots 0‑5460, 5461‑10922, 10923‑16383). Manual assignment uses CLUSTER ADDSLOTS :

redis-cli -h 172.16.19.1 -p 6379 cluster addslots 0,5460
redis-cli -h 172.16.19.2 -p 6379 cluster addslots 5461,10922
redis-cli -h 172.16.19.3 -p 6379 cluster addslots 10923,16383

Replication and Failover

Each master node can have one or more slaves that replicate data via the standard Redis replication mechanism. If a master fails, a slave is promoted automatically. The cluster-require-full-coverage setting can allow the cluster to stay operational with partial node loss.

Failure Detection

Nodes broadcast their status using Gossip . When a majority of nodes report a peer as PFAIL , the cluster marks it as FAIL and initiates failover.

Failover Process

Select a slave of the failed master as the new master.

Reassign the failed master’s slots to the new master.

Broadcast a PONG message to inform the cluster of the new master.

The new master begins handling requests for those slots.

Leader Election

During failover, nodes increment a configuration epoch, request votes via CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST , and elect a new master when a majority (> N/2 + 1) of votes is collected.

Client Slot Resolution and Redirection

Clients compute the slot locally, then use the cached slot‑to‑node map received from any node. If a request reaches a node that does not own the slot, the node replies with either a MOVED or ASK error.

MOVED error example:

GET mykey
(error) MOVED 16330 172.17.18.2:6379

The client updates its cache and retries the command on the indicated node.

ASK error example (partial migration):

GET mykey
(error) ASK 16330 172.17.18.2:6379

The client must send an ASKING command to the target node before issuing the actual operation; the cache is not updated.

Cluster Size Limits

Officially Redis Cluster supports up to 1,000 nodes. The primary limitation is the bandwidth consumed by the Gossip heartbeat traffic. Each node sends ~12 KB per second (PING/PONG + bitmap) which scales linearly with node count.

Communication Overhead

Nodes send PING messages to a random subset of peers every second and monitor PONG responses. The cluster-node-timeout (default 15 s) controls how long a node waits before declaring a peer dead. Adjusting this timeout (e.g., to 20–30 s) can reduce traffic but may delay failure detection.

Conclusion

Redis Cluster provides sharding, replication, and automatic failover, making it suitable for million‑scale workloads, but its scalability is bounded by inter‑node communication overhead. Proper configuration of timeouts and understanding of slot mapping are essential for reliable operation.

ShardingHigh AvailabilityclusterScalingfailovergossip protocol
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.