Backend Development 12 min read

How Consistent Hashing Solves Cache Scaling and Reduces Data Skew

This article explains the principles of consistent hashing, compares it with simple modulo hashing, and shows how the hash ring, virtual nodes, and key mapping improve load balancing, minimize data loss during node changes, and prevent cache avalanche in distributed caching systems.

Su San Talks Tech

Feb 20, 2022

How Consistent Hashing Solves Cache Scaling and Reduces Data Skew

Hello everyone, I'm Su San. Recently I saw a discussion about consistent hashing in a tech group, so I decided to introduce its principles using a classic distributed cache scenario that often appears in interviews.

Build Scenario

Assume we have three cache servers named node0, node1, and node2, and 30 million key objects that need to be evenly cached across the three machines.

Hash Problem

The straightforward solution is the modulo algorithm hash(key) % N, where N is the number of servers. This maps each key to one of the three nodes directly, which works but has a major limitation when the number of servers changes.

If a node fails or a new node is added, the expression hash(key) % N changes, causing many keys to be remapped to different servers. This leads to massive cache invalidation and potential cache avalanche, which is unacceptable in production.

Consistent Hash

Consistent hashing also uses a modulo operation, but instead of modulo N it takes modulo 2^32, effectively mapping both servers and keys onto a circular hash ring.

“IPv4 addresses consist of four 8‑bit groups, so using 2^32 ensures a unique mapping for each IP address.”

The 2^32 possible values are visualized as a circle (the hash ring), with positions ranging from 0 to 2^32‑1.

Mapping Servers to the Hash Ring

Each server’s IP address is hashed, and the result modulo 2^32 determines its position on the ring. Thus node0, node1, and node2 are placed at distinct points on the ring.

Mapping Keys to the Hash Ring

Each key is also hashed and placed on the same ring using hash(key) % 2^32. The key is stored on the first server encountered when moving clockwise from the key’s position.

“Starting from the key’s position, the first server encountered clockwise is the server that will cache the object.”

Because both server and key hashes are stable, a given key will consistently map to the same server as long as the server set does not change.

key-1 -> node-1

key-3 -> node-2

key-4 -> node-2

key-5 -> node-2

key-2 -> node-0

Advantages of Consistent Hash

When adding a new node (e.g., node-4), only the keys that fall between the new node and its predecessor on the ring need to be remapped, affecting a small portion of data. Similarly, if a node fails, only the keys that were mapped to that node are reassigned to the next clockwise node, minimizing disruption.

Thus, changes in the number of servers affect only a small subset of keys, keeping the cache service largely available.

Data Skew Problem

If nodes are unevenly distributed on the ring, some nodes may receive a disproportionate amount of keys, leading to data skew and resource imbalance.

Introducing virtual nodes solves this by hashing multiple points for each physical server, spreading the load more evenly.

Virtual Nodes

Each server generates several virtual nodes, e.g., node-1#1, node-1#2, node-1#3, each hashed with hash(IP#i) % 2^32. These virtual positions are placed on the ring, making the distribution of keys more uniform.

hash(10.24.23.227#1) % 2^32

hash(10.24.23.227#2) % 2^32

hash(10.24.23.227#3) % 2^32

“The more virtual nodes you add, the more evenly the hash ring is populated, reducing skew.”

When a key is looked up, the process becomes key → virtual node → real node.

Application Scenarios

Consistent hashing is the preferred load‑balancing algorithm in distributed systems. It can be implemented on the client side or within middleware such as the cache servers memcached and redis. Memcached uses it for client‑side routing, while Redis clusters use a similar slot‑based approach. RPC frameworks like Dubbo for service provider selection

Sharding in distributed relational databases

LVS load‑balancer scheduling

…and many other cases

Summary

Consistent hashing provides a robust way to distribute keys across a dynamic set of nodes, minimizing cache invalidation during scaling or failures. However, when the number of nodes becomes very large or updates are frequent, lookup performance may degrade, and a reliable routing service is required to avoid a single point of failure.

Overall, the technique’s benefits outweigh its drawbacks, making it a valuable tool for solving real‑world scaling problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

consistent hashing distributed caching virtual nodes

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.