Why Consistent Hashing Fails: Why Redis, HBase, TiDB and Ceph Have Dropped It
The article examines the fundamental limitations of consistent hashing—its inability to preserve data locality, support range queries, and handle topology awareness—explaining why major storage systems such as Redis Cluster, TiDB, Ceph, and HBase have adopted alternative sharding strategies like hash slots, range partitioning, and CRUSH.
Consistent Hashing Basics
Consistent hashing maps a key to a point on a virtual ring (typically 2^32 positions). Nodes are also placed on the ring; a key is stored on the first node encountered clockwise. This design reduces data movement during scaling to roughly 1/N of the total keys, where N is the number of nodes. int nodeIndex = hash(key) % N; // simple modulo When a new node is added, only the keys that fall between the new node and its predecessor need to be moved.
Critical Drawbacks for Large‑Scale Storage
Topology blindness : The algorithm has no knowledge of physical placement (data‑center, rack, host). Replicas may be co‑located on the same failure domain, violating durability requirements.
Range‑query failure : Because keys are hashed, their natural order is destroyed. A range scan such as SELECT * FROM orders WHERE id BETWEEN 1000 AND 2000 must be broadcast to every node, causing high latency and network overhead.
Why Redis Cluster Uses Fixed Hash Slots
Redis maps a key to one of 16 384 slots using CRC16(key) % 16384. Slots are manually assigned to nodes, giving operators fine‑grained control over data distribution and migration. int slot = CRC16(key) % 16384; Uniform data distribution across nodes.
Deterministic migration: moving a slot moves a predictable subset of keys.
Gossip metadata is only 2 KB (16384 bits), keeping heartbeat traffic low.
TiDB’s Abandonment of Consistent Hashing
TiDB is a NewSQL database that requires:
Strong ACID transactions (Raft‑based replication).
Efficient range queries.
Topology‑aware replica placement.
It therefore uses Range‑Based Sharding (called Regions ). A Region is a contiguous key interval (default ~96 MB) stored on three TiKV replicas. The Placement Driver (PD) schedules Regions, enforces anti‑affinity rules, and automatically rebalances load.
// Example Region split
Region 1: [0, 1,000,000)
Region 2: [1,000,000, 2,000,000)Range scans touch only the relevant Regions, preserving locality.
PD can enforce “different racks, different data centers” for each replica.
Only ~5 % of data moves during scale‑out (vs ~30 % for pure consistent hashing).
Ceph’s CRUSH Algorithm
Ceph replaces consistent hashing with CRUSH (Controlled Replication Under Scalable Hashing) . CRUSH builds a hierarchical topology tree (root → data‑center → rack → host → OSD) and evaluates placement rules that respect failure domains.
Bucket‑Tree : Represents the physical layout.
Rules Engine : Expresses policies such as “place one replica per rack”.
Deterministic pseudo‑random walk (Straw2) : Guarantees the same input always maps to the same set of OSDs while weighting selections by device capacity.
Example placement for object invoice‑7788 with three replicas and rack‑level anti‑affinity:
// Walk‑through (simplified)
root → select DC A → select Rack 1 → OSD 2 // replica 1
root → reject Rack 1 → select Rack 2 → OSD 4 // replica 2
root → select DC B → Rack 3 → OSD 5 // replica 3Full topology awareness prevents single‑point failures.
Automatic rebalancing moves only the placement groups affected by node addition or removal.
Weight‑based distribution aligns data volume with device capacity.
HBase’s Range Partitioning
HBase stores rows sorted by rowkey. Data is split into Regions (key ranges). When a Region exceeds a size threshold (default 10 GB), it automatically splits, preserving locality for scans.
// Scan example
Scan scan = new Scan(Bytes.toBytes("user1000"), Bytes.toBytes("user2000"));
ResultScanner rs = table.getScanner(scan);Adjacent rows reside in the same Region, so a range scan involves only a few RegionServers.
HMaster monitors Region load and migrates hot Regions to balance the cluster.
Comparative Metrics (Typical Values)
Data moved on scale‑out : Consistent hashing ~30 %; Redis slots ~5 %; TiDB range sharding ~5 %; CRUSH moves only data that maps to new OSDs.
Range‑query latency : Consistent hashing requires full‑cluster scans; Redis slots can be node‑local if slots are aligned; TiDB and HBase achieve single‑node or few‑node latency; CRUSH‑based storage can also serve range queries efficiently when data is stored in objects that preserve order.
Topology awareness : None for pure consistent hashing; manual for Redis slots; PD for TiDB; CRUSH rules for Ceph; HBase relies on HDFS rack awareness plus Region placement.
Conclusion
Consistent hashing excels for pure key‑value caches where only point lookups matter. For durable storage systems that need range queries, strong consistency, and topology‑aware replica placement, it becomes a liability. Redis Cluster, TiDB, Ceph, and HBase each adopt a sharding mechanism that preserves data locality, limits migration scope, and incorporates physical topology, achieving a better balance of scalability, performance, and reliability.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
