Understanding Redis Cluster: Architecture, Data Distribution, and Fault Tolerance
Redis Cluster provides a scalable, fault‑tolerant distributed Redis solution, explaining why it’s needed, its architecture, virtual slot partitioning, data distribution methods, limitations, smart client optimization, and automatic failover mechanisms, while highlighting key operational considerations for high‑performance deployments.
1. Redis Cluster Overview
Redis Cluster is the official Redis clustering feature.
Why implement Redis Cluster?
Redis is single‑threaded; to increase CPU resources for heavy workloads, a distributed approach is required.
Growing user numbers and concurrency demand higher QPS, which a single master‑replica setup may not satisfy.
When a single server’s memory cannot hold all data, data must be sharded across multiple servers.
Network traffic may exceed a server’s NIC capacity, requiring distribution.
Offline computation and buffering needs.
Redis Cluster Drawbacks
Performance may degrade when the number of nodes is large.
Solution: Use a smart client that internally maps keys to slots and nodes, maximizing communication efficiency and handling slot‑node changes automatically.
Cluster Limitations
Batch key operations are limited: e.g., MGET, MSET must be within a single slot. Key transactions and Lua scripts are limited: keys must reside on the same node. Key is the smallest partition granularity; big‑key partitioning is not supported. Only one database (db0) is available in cluster mode. Replication supports only one level; tree‑shaped replication is not supported. Cluster meets capacity and performance scaling needs for many workloads. Client performance may decrease because commands cannot span nodes (MGET, KEYS, SCAN, FLUSH, SINTER, etc.). Lua and transactions cannot cross nodes. Client maintenance becomes more complex, increasing connection pool usage.
Data Distribution
Why distribute data?
Full data cannot fit on a single Redis node; data is partitioned into subsets according to sharding rules.
Sequential Distribution
Sequential partitioning is commonly used in relational database design.
Hash Distribution
Virtual Slot Partitioning
Virtual slot partitioning is the method used by Redis Cluster.
Predefined virtual slots (0‑16383) map each slot to a data subset; the number of slots is usually larger than the number of nodes.
Redis Cluster’s virtual slot range is 0 to 16383.
Each key is hashed with CRC16 and the result modulo 16384 determines its slot.
Steps:
Distribute the 16384 slots evenly among the nodes.
Hash each key using CRC16.
Take the hash result modulo 16384.
Send the remainder to a Redis node.
The node verifies whether the slot belongs to its managed range.
If it does, the node stores the data and returns the result.
If not, the node forwards the data to the correct node responsible for that slot.
All nodes share messages so each node knows which node manages which slot range.
Virtual slot distribution allows data to be re‑balanced when nodes are added or removed without data loss.
Characteristics of virtual slot partitioning:
Server manages nodes, slots, and data (e.g., Redis Cluster).
Data is scattered while maintaining uniform distribution.
2. Redis Cluster Architecture
1) Nodes
Redis Cluster is a distributed architecture with multiple nodes, each handling read/write operations.
Nodes communicate with each other.
2) Meet operation
The meet command establishes communication between nodes.
All Redis nodes are interconnected using a binary protocol optimized for speed and bandwidth.
Clients connect directly to any available node; no proxy layer is required.
3) Slot allocation
The 16384 slots are evenly assigned to nodes; each node can read/write only its own slots.
Because nodes communicate, each knows the slot ranges managed by other nodes.
When a client accesses any node, the key is hashed with CRC16, the result modulo 16384 determines the slot; if the slot belongs to the accessed node, the data is returned, otherwise the client is redirected to the correct node.
4) Replication
Cluster automatically performs master‑slave replication, read/write separation, high availability, and failover, supporting multiple masters with hash slots for distributed storage.
3. Failover
Automatic failover consists of fault detection and node recovery. Nodes can be marked subjectively down or objectively down.
If a majority of masters consider a node subjectively down, it becomes objectively down.
Slaves of the downed master trigger the recovery process to maintain cluster availability.
Node failure detection: election
Ping/Pong mode
Redis Cluster uses ping/pong messages for fault detection.
These messages convey node‑slot mapping, master‑slave status, and fault information.
Fault detection distinguishes subjective and objective down states.
All masters participate in voting; if a majority cannot communicate with a master within the timeout, that master is considered failed.
When does the whole cluster become unavailable (cluster_state: fail)?
If any master fails without a slave, the cluster enters fail state (incomplete slot mapping).
If more than half of the masters fail, the cluster also enters fail state.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
