Do Consistent Hashes Really Keep Your Cluster Balanced? Real-World Tests and Fixes

This article experimentally validates consistent hashing by adding and removing nodes in a 5‑node cluster, reveals key distribution imbalance and avalanche risks, and proposes solutions such as node scaling and virtual node virtualization, complete with Python code examples and detailed analysis.

ITPUB
ITPUB
ITPUB
Do Consistent Hashes Really Keep Your Cluster Balanced? Real-World Tests and Fixes

Node Expansion Test

We start with a 4‑node cluster (192.168.1.1‑192.168.1.4) and add a fifth node (192.168.1.5). The following Python snippet prints the number of keys that need to be migrated and the specific key‑node pairs:

print '上线一个节点,需要迁移节点个数:{}'.format(len(list(set(addNodeMap).difference(set(oldMap))))
print '需要迁移的key-node信息:',list(set(addNodeMap).difference(set(oldMap)))

The result shows that three keys are migrated to the new node:

上线一个节点,需要迁移节点个数:3
需要迁移的key-node信息: ['testKey15_192.168.1.5', 'testKey36_192.168.1.5', 'testKey23_192.168.1.5']

Node Shrink (Removal) Test

With all five nodes running, we remove node 192.168.1.5 and observe the migration:

print '需要迁移节点个数:{}' .format(len(list(set(removeNodeMap).difference(set(addNodeMap))))
print '等待迁移的key信息:{}' .format(list(set(removeNodeMap).difference(set(addNodeMap))) )

The output indicates that the same three keys move back to node 192.168.1.2:

下线节点['192.168.1.5'],需要迁移节点个数:3
等待迁移的key信息:['testKey23_192.168.1.2', 'testKey36_192.168.1.2', 'testKey15_192.168.1.2']

Conclusion from Basic Tests

Both adding and removing a node only affect the three keys located between the two nodes on the hash ring, confirming the theoretical expectation that consistent hashing limits the impact of node changes to a small subset of keys.

Further Investigation – Systematic Node Removal

We then iteratively remove each node while keeping the cluster size at five, using the following loop:

nodeList = [
    ['192.168.1.1'],
    ['192.168.1.2'],
    ['192.168.1.3'],
    ['192.168.1.4'],
    ['192.168.1.5']
]
for node in nodeList:
    h.removeNodes(node)
    removeNodeMap = h.allocateKey(40)
    print '下线节点{},需要迁移节点个数:{}'.format(node, len(list(set(removeNodeMap).difference(set(addNodeMap)))))
    print '等待迁移的key信息:{}'.format(list(set(removeNodeMap).difference(set(addNodeMap))))
    h.addNodes(node)

The results show varying numbers of migrated keys, e.g., removing 192.168.1.2 moves 14 keys, while removing 192.168.1.5 moves only 3. This reveals two practical problems.

Problem 1 – Uneven Key Distribution

When different nodes are removed, the number of affected keys differs dramatically, leading to load imbalance. Nodes handling more keys become hotspots, increasing read/write latency under load.

Two straightforward remedies are:

Solution 1: Scale out physical nodes (or keep the current setup if the load is low).

Solution 2: Introduce virtual nodes to increase the logical node count without adding hardware.

Virtualization is illustrated below:

Problem 2 – Avalanche Effect

When a node goes down, all its keys are reassigned to the next node clockwise on the ring. If that node is already heavily loaded, it may also fail, causing a cascade of failures that can bring the entire cluster down.

Virtual nodes mitigate this by spreading keys across many logical points, reducing the chance that a single physical node receives a sudden surge.

Enhanced Consistent Hash Implementation with Virtual Nodes

The following Python class adds support for a configurable replica factor (virtual nodes) and provides methods to add/remove nodes and allocate keys:

#!/usr/bin/env python
# -*- coding:utf8 -*-
import hashlib
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class ConsistentHash(object):
    def __init__(self, nodes=None, replicaFactor=10):
        self.nodes = nodes
        self.replicaFactor = replicaFactor
        self.ring = {}
        self.sortedKeys = []
        self.addNodes(nodes)

    def addNodes(self, nodes):
        if nodes:
            for node in nodes:
                for number in range(self.replicaFactor):
                    replicaNode = "%s%s" % (node, number)
                    replicaNodeHashResult = hashlib.sha1(replicaNode).hexdigest()
                    intReplicaNodeHashResult = int(replicaNodeHashResult, 16)
                    modIntReplicaNodeHashResult = intReplicaNodeHashResult % (2 ** 32)
                    self.ring[modIntReplicaNodeHashResult] = node
                    self.sortedKeys.append(modIntReplicaNodeHashResult)
            self.sortedKeys.sort()

    def removeNodes(self, nodes):
        if nodes:
            for node in nodes:
                for number in range(self.replicaFactor):
                    replicaNode = "%s%s" % (node, number)
                    replicaNodeHashResult = hashlib.sha1(replicaNode).hexdigest()
                    intReplicaNodeHashResult = int(replicaNodeHashResult, 16)
                    modIntReplicaNodeHashResult = intReplicaNodeHashResult % (2 ** 32)
                    self.ring.pop(modIntReplicaNodeHashResult)
                    self.sortedKeys.remove(modIntReplicaNodeHashResult)

    def getNode(self, modKeyHashResult):
        for _modIntNodeHashResult in self.sortedKeys:
            if modKeyHashResult < _modIntNodeHashResult:
                return self.ring[_modIntNodeHashResult]
        return self.ring[self.sortedKeys[0]]

    def allocateKey(self, number):
        keyNodeMap = []
        for i in range(number):
            keyName = 'testKey' + str(i)
            keyHashResult = hashlib.sha1(keyName).hexdigest()
            intKeyHashResult = int(keyHashResult, 16)
            modKeyHashResult = intKeyHashResult % (2 ** 32)
            _node = self.getNode(modKeyHashResult)
            keyNodeMap.append(keyName + '_' + _node)
        return keyNodeMap

Running the same removal tests with this enhanced hash shows that keys are now redistributed across multiple physical nodes, alleviating both load imbalance and avalanche risks.

下线节点['192.168.1.1'],需要迁移节点个数:15
等待迁移的key信息:['testKey27_192.168.1.5', 'testKey15_192.168.1.5', ...]
下线节点['192.168.1.2'],需要迁移节点个数:6
等待迁移的key信息:['testKey2_192.168.1.1', 'testKey10_192.168.1.1', ...]
...

Further Thoughts

Even with virtual nodes, perfect balance is unattainable; the goal is relative balance. The optimal number of virtual nodes per physical node depends on workload characteristics and must be determined empirically.

In real‑world scenarios, key request rates vary, so a one‑size‑fits‑all replica factor may not suffice. Possible extensions include assigning different replica factors per node based on hotspot detection or maintaining a global key‑node map to migrate hot keys proactively.

Overall, the experiments confirm the theoretical properties of consistent hashing while highlighting practical considerations such as uneven distribution and cascade failures, and demonstrate that virtual node techniques can effectively mitigate these issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

consistent hashingvirtual nodeskey distributionNode Scaling
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.