Databases 26 min read

How to Build and Understand a Redis Cluster: Setup, Mechanics, and Failover

This guide walks through installing a Redis cluster with three masters and three slaves using local ports, explains slot allocation, key hashing, gossip communication, failover, node addition, resharding, and best practices for high availability, while providing practical commands and configuration examples.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build and Understand a Redis Cluster: Setup, Mechanics, and Failover

Cluster Environment Setup

Redis Cluster requires at least three master nodes. In this example we create three masters and three slaves using local ports (7000‑7005). This method is for experimentation only and should not be used in production.

Define the ports for the nodes: 7000-7005 and copy redis.conf to a separate file for each port.

Configuration files:

IP: 127.0.0.1 Port: 7000‑7005 Config: 7000/redis-7000.conf, 7001/redis-7001.conf, …, 7005/redis-7005.conf

Edit each redis.conf to enable clustering and set the required options (e.g., requirepass, masterauth if a password is needed).

daemonize yes
# port must match the configuration above
port 7000
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 5000
appendonly yes

Start all nodes:

# start all services 7000‑7005
cd 7000
redis-server ./redis-7000.conf

Initialize the cluster:

redis-cli --cluster create 127.0.0.1:7000 127.0.0.1:7001 \
127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005 \
--cluster-replicas 1

Query cluster status:

redis-cli -c -h 127.0.0.1 -p 7000
cluster info

Other creation methods are documented in the Redis manual ( utils/create-cluster).

Cluster Principles

Slot Assignment Mechanism

Redis Cluster divides the key space into 16,384 slots. Each node is responsible for a subset of slots. Clients receive the slot map from the cluster and cache it locally, allowing direct routing of commands to the correct node.

Slot Location Algorithm

The key is hashed with CRC16, and the result is masked with 0x3FFF to obtain the slot number. The implementation resides in src/cluster.c (function keyHashSlot).

crc16(key,keylen) & 0x3FFF

To find the slot of a key:

# query the slot of a key
127.0.0.1:7000> cluster keyslot mykey
(integer) 12318
# list all slot ranges
127.0.0.1:7000> cluster slots
…

Redis automatically redirects the client when a key is accessed on the wrong node (ASK/MOVED redirection).

Redirection (ASK)

If a node receives a command for a key whose slot it does not own, it replies with a special redirection containing the target node address. The client follows the redirect and updates its slot cache.

In plain terms: if the key belongs to another node, the request is forwarded to that node.
set abc sdl
set sbc sdl

Cluster Communication Mechanism

Nodes communicate via a gossip protocol, exchanging messages such as PING, PONG, MEET, and FAIL. Gossip can be centralized (e.g., using ZooKeeper) or fully distributed.

Centralized

Metadata updates are immediate but can become a bottleneck.

Gossip

Nodes periodically send PING messages containing their state and metadata. MEET adds a new node to the cluster. FAIL notifies others that a node is down.

The gossip approach distributes load but introduces a small delay in metadata propagation.

Gossip Port

Each node uses port + 10000 for gossip communication (e.g., node 7001 uses 17001).

Cluster Election Principle

When a master fails, its slaves attempt a failover. The process involves broadcasting FAILOVER_AUTH_REQUEST, collecting acknowledgments from a majority of masters, and promoting a slave to master.

Slave detects master FAIL.

Slave increments its currentEpoch and broadcasts FAILOVER_AUTH_REQUEST.

Masters that have not voted yet respond with FAILOVER_AUTH_ACK.

Slave collects ACKs; if it receives a majority, it becomes the new master.

New master broadcasts a PONG to inform the cluster.

The election requires at least three masters; with only two masters a majority cannot be reached.

Split‑Brain and Data Loss

If a network partition causes multiple masters to accept writes, data loss can occur when the partition heals. Setting min-replicas-to-write 1 mitigates the risk but may affect availability.

// minimum number of replicas that must acknowledge a write
min-replicas-to-write 1

Full Coverage

When cluster-require-full-coverage is set to no, the cluster remains available even if a master responsible for a slot goes down without a replica.

Batch Operations

Commands like MSET and MGET only work if all keys map to the same slot. Prefix keys with a hash tag (e.g., {user1}) to force them into the same slot.

Example: mset {user1}:1:name zhangsan {user1}:1:age 18

Sentinel vs. Cluster Leader Election

Sentinel elects a leader when a master is marked down, using a similar majority‑vote mechanism based on Raft‑style epochs.

Cluster Fault Tolerance

Failure Detection

Nodes periodically send PING messages. If a node does not reply within the timeout, it is marked PFAIL. When a majority of masters report a node as FAIL, the node is considered down.

Failover Process

A slave of the failed master is selected.

The selected slave runs SLAVEOF NO ONE to become a master.

The new master takes over the slots of the failed node.

The new master broadcasts a PONG to inform the cluster.

Clients start sending commands to the new master.

Adding Nodes and Resharding

To expand the cluster, start new nodes and add them with redis-cli --cluster add-node. Then use redis-cli --cluster reshard to move slots.

# start new nodes
redis-server redis-7006.conf
redis-server redis-7007.conf
# add node 7006 as a master
redis-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7001
# reshard slots to the new master
redis-cli --cluster reshard 127.0.0.1:7001

After adding a slave, set its master with CLUSTER REPLICATE:

# on the slave (7007)
cluster replicate 2109c2832177e8514174c6ef8fefd681076e28df

Removing Nodes

Before removing a master, migrate its slots to other masters using redis-cli --cluster del-node after a reshard.

# delete node 7007 (example)
redis-cli --cluster del-node 127.0.0.1:7007 8d935918d877a63283e1f3a1b220cdc8cb73c414

References

《Redis 设计与实现》黄健宏

Why Redis uses 16384 slots

https://blog.csdn.net/wanderstarrysky/article/details/118157751

https://segmentfault.com/a/1190000038373546

Images sourced from the internet; please notify of any infringement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

shardingredisClusterDistributedfailoverGossipResharding
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.