Backend Development 12 min read

How Consistent Hashing Solves Cache Scaling and Reduces Data Skew

This article explains the principles of consistent hashing, compares it with simple modulo hashing, and shows how the hash ring, virtual nodes, and key mapping improve load balancing, minimize data loss during node changes, and prevent cache avalanche in distributed caching systems.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How Consistent Hashing Solves Cache Scaling and Reduces Data Skew

Hello everyone, I'm Su San. Recently I saw a discussion about consistent hashing in a tech group, so I decided to introduce its principles using a classic distributed cache scenario that often appears in interviews.

Build Scenario

Assume we have three cache servers named

node0

,

node1

, and

node2

, and 30 million

key

objects that need to be evenly cached across the three machines.

Hash Problem

The straightforward solution is the modulo algorithm

hash(key) % N

, where N is the number of servers. This maps each key to one of the three nodes directly, which works but has a major limitation when the number of servers changes.

If a node fails or a new node is added, the expression

hash(key) % N

changes, causing many keys to be remapped to different servers. This leads to massive cache invalidation and potential cache avalanche, which is unacceptable in production.

Consistent Hash

Consistent hashing also uses a modulo operation, but instead of modulo N it takes modulo 2^32, effectively mapping both servers and keys onto a circular hash ring.

“IPv4 addresses consist of four 8‑bit groups, so using 2^32 ensures a unique mapping for each IP address.”

The 2^32 possible values are visualized as a circle (the

hash ring

), with positions ranging from 0 to 2^32‑1.

Mapping Servers to the Hash Ring

Each server’s IP address is hashed, and the result modulo 2^32 determines its position on the ring. Thus

node0

,

node1

, and

node2

are placed at distinct points on the ring.

Mapping Keys to the Hash Ring

Each key is also hashed and placed on the same ring using

hash(key) % 2^32

. The key is stored on the first server encountered when moving clockwise from the key’s position.

“Starting from the key’s position, the first server encountered clockwise is the server that will cache the object.”

Because both server and key hashes are stable, a given key will consistently map to the same server as long as the server set does not change.

key-1 -> node-1
key-3 -> node-2
key-4 -> node-2
key-5 -> node-2
key-2 -> node-0

Advantages of Consistent Hash

When adding a new node (e.g.,

node-4

), only the keys that fall between the new node and its predecessor on the ring need to be remapped, affecting a small portion of data. Similarly, if a node fails, only the keys that were mapped to that node are reassigned to the next clockwise node, minimizing disruption.

Thus, changes in the number of servers affect only a small subset of keys, keeping the cache service largely available.

Data Skew Problem

If nodes are unevenly distributed on the ring, some nodes may receive a disproportionate amount of keys, leading to data skew and resource imbalance.

Introducing virtual nodes solves this by hashing multiple points for each physical server, spreading the load more evenly.

Virtual Nodes

Each server generates several virtual nodes, e.g.,

node-1#1

,

node-1#2

,

node-1#3

, each hashed with

hash(IP#i) % 2^32

. These virtual positions are placed on the ring, making the distribution of keys more uniform.

hash(10.24.23.227#1) % 2^32
hash(10.24.23.227#2) % 2^32
hash(10.24.23.227#3) % 2^32
“The more virtual nodes you add, the more evenly the hash ring is populated, reducing skew.”

When a key is looked up, the process becomes

key → virtual node → real node

.

Application Scenarios

Consistent hashing is the preferred load‑balancing algorithm in distributed systems. It can be implemented on the client side or within middleware such as the cache servers

memcached

and

redis

. Memcached uses it for client‑side routing, while Redis clusters use a similar slot‑based approach.

RPC

frameworks like Dubbo for service provider selection

Sharding in distributed relational databases

LVS load‑balancer scheduling

…and many other cases

Summary

Consistent hashing provides a robust way to distribute keys across a dynamic set of nodes, minimizing cache invalidation during scaling or failures. However, when the number of nodes becomes very large or updates are frequent, lookup performance may degrade, and a reliable routing service is required to avoid a single point of failure.

Overall, the technique’s benefits outweigh its drawbacks, making it a valuable tool for solving real‑world scaling problems.

Load Balancingconsistent hashingdistributed cachingvirtual nodes
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.