Do Consistent Hashes Really Keep Your Cluster Balanced? Real-World Tests and Fixes
This article experimentally validates consistent hashing by adding and removing nodes in a 5‑node cluster, reveals key distribution imbalance and avalanche risks, and proposes solutions such as node scaling and virtual node virtualization, complete with Python code examples and detailed analysis.
Node Expansion Test
We start with a 4‑node cluster (192.168.1.1‑192.168.1.4) and add a fifth node (192.168.1.5). The following Python snippet prints the number of keys that need to be migrated and the specific key‑node pairs:
print '上线一个节点,需要迁移节点个数:{}'.format(len(list(set(addNodeMap).difference(set(oldMap))))
print '需要迁移的key-node信息:',list(set(addNodeMap).difference(set(oldMap)))The result shows that three keys are migrated to the new node:
上线一个节点,需要迁移节点个数:3
需要迁移的key-node信息: ['testKey15_192.168.1.5', 'testKey36_192.168.1.5', 'testKey23_192.168.1.5']Node Shrink (Removal) Test
With all five nodes running, we remove node 192.168.1.5 and observe the migration:
print '需要迁移节点个数:{}' .format(len(list(set(removeNodeMap).difference(set(addNodeMap))))
print '等待迁移的key信息:{}' .format(list(set(removeNodeMap).difference(set(addNodeMap))) )The output indicates that the same three keys move back to node 192.168.1.2:
下线节点['192.168.1.5'],需要迁移节点个数:3
等待迁移的key信息:['testKey23_192.168.1.2', 'testKey36_192.168.1.2', 'testKey15_192.168.1.2']Conclusion from Basic Tests
Both adding and removing a node only affect the three keys located between the two nodes on the hash ring, confirming the theoretical expectation that consistent hashing limits the impact of node changes to a small subset of keys.
Further Investigation – Systematic Node Removal
We then iteratively remove each node while keeping the cluster size at five, using the following loop:
nodeList = [
['192.168.1.1'],
['192.168.1.2'],
['192.168.1.3'],
['192.168.1.4'],
['192.168.1.5']
]
for node in nodeList:
h.removeNodes(node)
removeNodeMap = h.allocateKey(40)
print '下线节点{},需要迁移节点个数:{}'.format(node, len(list(set(removeNodeMap).difference(set(addNodeMap)))))
print '等待迁移的key信息:{}'.format(list(set(removeNodeMap).difference(set(addNodeMap))))
h.addNodes(node)The results show varying numbers of migrated keys, e.g., removing 192.168.1.2 moves 14 keys, while removing 192.168.1.5 moves only 3. This reveals two practical problems.
Problem 1 – Uneven Key Distribution
When different nodes are removed, the number of affected keys differs dramatically, leading to load imbalance. Nodes handling more keys become hotspots, increasing read/write latency under load.
Two straightforward remedies are:
Solution 1: Scale out physical nodes (or keep the current setup if the load is low).
Solution 2: Introduce virtual nodes to increase the logical node count without adding hardware.
Virtualization is illustrated below:
Problem 2 – Avalanche Effect
When a node goes down, all its keys are reassigned to the next node clockwise on the ring. If that node is already heavily loaded, it may also fail, causing a cascade of failures that can bring the entire cluster down.
Virtual nodes mitigate this by spreading keys across many logical points, reducing the chance that a single physical node receives a sudden surge.
Enhanced Consistent Hash Implementation with Virtual Nodes
The following Python class adds support for a configurable replica factor (virtual nodes) and provides methods to add/remove nodes and allocate keys:
#!/usr/bin/env python
# -*- coding:utf8 -*-
import hashlib
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class ConsistentHash(object):
def __init__(self, nodes=None, replicaFactor=10):
self.nodes = nodes
self.replicaFactor = replicaFactor
self.ring = {}
self.sortedKeys = []
self.addNodes(nodes)
def addNodes(self, nodes):
if nodes:
for node in nodes:
for number in range(self.replicaFactor):
replicaNode = "%s%s" % (node, number)
replicaNodeHashResult = hashlib.sha1(replicaNode).hexdigest()
intReplicaNodeHashResult = int(replicaNodeHashResult, 16)
modIntReplicaNodeHashResult = intReplicaNodeHashResult % (2 ** 32)
self.ring[modIntReplicaNodeHashResult] = node
self.sortedKeys.append(modIntReplicaNodeHashResult)
self.sortedKeys.sort()
def removeNodes(self, nodes):
if nodes:
for node in nodes:
for number in range(self.replicaFactor):
replicaNode = "%s%s" % (node, number)
replicaNodeHashResult = hashlib.sha1(replicaNode).hexdigest()
intReplicaNodeHashResult = int(replicaNodeHashResult, 16)
modIntReplicaNodeHashResult = intReplicaNodeHashResult % (2 ** 32)
self.ring.pop(modIntReplicaNodeHashResult)
self.sortedKeys.remove(modIntReplicaNodeHashResult)
def getNode(self, modKeyHashResult):
for _modIntNodeHashResult in self.sortedKeys:
if modKeyHashResult < _modIntNodeHashResult:
return self.ring[_modIntNodeHashResult]
return self.ring[self.sortedKeys[0]]
def allocateKey(self, number):
keyNodeMap = []
for i in range(number):
keyName = 'testKey' + str(i)
keyHashResult = hashlib.sha1(keyName).hexdigest()
intKeyHashResult = int(keyHashResult, 16)
modKeyHashResult = intKeyHashResult % (2 ** 32)
_node = self.getNode(modKeyHashResult)
keyNodeMap.append(keyName + '_' + _node)
return keyNodeMapRunning the same removal tests with this enhanced hash shows that keys are now redistributed across multiple physical nodes, alleviating both load imbalance and avalanche risks.
下线节点['192.168.1.1'],需要迁移节点个数:15
等待迁移的key信息:['testKey27_192.168.1.5', 'testKey15_192.168.1.5', ...]
下线节点['192.168.1.2'],需要迁移节点个数:6
等待迁移的key信息:['testKey2_192.168.1.1', 'testKey10_192.168.1.1', ...]
...Further Thoughts
Even with virtual nodes, perfect balance is unattainable; the goal is relative balance. The optimal number of virtual nodes per physical node depends on workload characteristics and must be determined empirically.
In real‑world scenarios, key request rates vary, so a one‑size‑fits‑all replica factor may not suffice. Possible extensions include assigning different replica factors per node based on hotspot detection or maintaining a global key‑node map to migrate hot keys proactively.
Overall, the experiments confirm the theoretical properties of consistent hashing while highlighting practical considerations such as uneven distribution and cascade failures, and demonstrate that virtual node techniques can effectively mitigate these issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
