Deep Dive into Redis Cluster: Architecture, Sharding, Replication, and Failover
This article provides a comprehensive analysis of Redis Cluster, covering node and slot assignment, command execution, resharding, redirection, failover, gossip messaging, and communication overhead, while explaining why clustering is needed, how it works, and how to deploy and manage it effectively.
Why Use Redis Cluster
When a single Redis instance cannot handle large data volumes or high traffic, clustering solves storage bottlenecks, enables horizontal scaling, and provides automatic failover.
What Is a Redis Cluster
A Redis Cluster is a distributed database that shards data into 16,384 slots, each managed by one or more nodes. Nodes exchange state via the Gossip protocol, allowing every node to know the full slot‑to‑node mapping.
Cluster Installation
To create a working cluster, connect independent nodes using the CLUSTER MEET <ip> <port> command. This handshake adds the target node to the cluster.
CLUSTER MEET 192.168.1.10 6379Implementation Principles
Data Sharding
Each key is hashed with CRC16, producing a 16‑bit value that is modulo‑ed by 16,384 to determine its slot. Optional hash tags can force a key into a specific slot.
Slot‑to‑Node Mapping
When a cluster is created (e.g., with cluster create ), Redis automatically distributes the 16,384 slots evenly across all nodes. Administrators can also assign slots manually with cluster addslots .
redis-cli -h 172.16.19.1 -p 6379 cluster addslots 0-5460
redis-cli -h 172.16.19.2 -p 6379 cluster addslots 5461-10922
redis-cli -h 172.16.19.3 -p 6379 cluster addslots 10923-16383Replication and Failover
Each master node can have one or more slaves that replicate its data. If a master fails, a slave is promoted to master. The cluster can be configured with cluster-require-full-coverage to allow partial availability when some nodes are down.
Failure Detection
Nodes use the Gossip protocol to broadcast their status. When a majority of nodes agree that a peer is unreachable (PFAIL), the cluster marks it as FAIL and initiates a failover.
Failover Process
A slave of the failed master is selected as the new master.
The new master claims the slots previously owned by the failed node.
It broadcasts a PONG message to inform the rest of the cluster.
Clients start sending commands to the new master.
Leader Election
The election follows a Raft‑like protocol: a configuration epoch is incremented, slaves request votes via CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST , and a candidate becomes leader when it receives a majority of votes.
Client Slot Location
Clients compute the slot locally (CRC16 + modulo) and cache the slot‑to‑node map received from any node. When a request hits the wrong node, the server returns a redirection error.
MOVED Error
If the target slot belongs to another node, the server replies with MOVED , prompting the client to update its cache and retry the command on the correct node.
GET mykey
(error) MOVED 16330 172.17.18.2:6379ASK Error
During a live migration, a node may return ASK , indicating the client should temporarily query the target node after sending an ASKING command. The client cache is not updated.
GET mykey
(error) ASK 16330 172.17.18.2:6379Cluster Size Limits
Officially, Redis Cluster supports up to 1,000 nodes. The main limitation is the communication overhead of the gossip protocol, which exchanges slot bitmaps (≈12 KB per PING/PONG) among all nodes.
Gossip Message Structure
typedef struct {
char nodename[CLUSTER_NAMELEN]; // 40 bytes
uint32_t ping_sent; // 4 bytes
uint32_t pong_received; // 4 bytes
char ip[NET_IP_STR_LEN]; // 46 bytes
uint16_t port; // 2 bytes
uint16_t cport; // 2 bytes
uint16_t flags; // 2 bytes
uint32_t notused1; // 4 bytes
} clusterMsgDataGossip;Instance Communication Frequency
Each instance sends a PING to a randomly chosen peer every second (default 5 peers per second). If a node has not received a PONG for > cluster-node-timeout/2 , it immediately pings that node. Adjusting cluster-node-timeout can reduce traffic but may delay fault detection.
Overall, the article walks through the full lifecycle of a Redis Cluster—from motivation and architecture to deployment, slot management, replication, failover, and performance considerations.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.