Why Consistent Hashing Cuts Data Movement When Scaling Distributed Systems
This article explains the principles of consistent hashing, compares it with simple modulo and virtual bucket methods, demonstrates how to build a hash ring, map nodes and keys, and shows through Python code how node addition or removal only affects a small portion of the data.
Introduction
Consistent hashing is introduced as a solution to the inflexibility of simple modulo hashing and the higher data movement of virtual bucket sharding. It aims to minimize the amount of data that needs to be remapped when nodes are added or removed.
1. Building an Empty Hash Ring
The hash ring spans the integer range 0 to 2^32-1. Nodes are placed on this ring by hashing their IP address or hostname with SHA‑1, converting the result to an integer, and taking the modulo 2^32. This creates a uniform distribution of node positions on the ring.
2. Constructing Node Topology on the Ring
Each node’s hashed value determines its location on the ring. For example, three nodes A, B, and C are hashed and placed at specific points, dividing the ring into three intervals that act as logical shards.
3. Mapping Keys to the Ring
Keys are processed in the same way as nodes: SHA‑1 hash → integer → modulo 2^32. The key is then assigned to the first node encountered when moving clockwise around the ring. This ensures each key belongs to exactly one node.
4. Simulating Node Failure
When a node (e.g., C) goes offline, only the keys that were mapped to the segment between the failed node and the next clockwise node need to be reassigned. All other keys remain untouched, demonstrating the low impact of node removal.
5. Simulating Node Addition
Adding a new node (e.g., D) inserts a new point on the ring. Only the keys that fall between the new node and its predecessor need to move to the new node; the rest of the system stays unchanged.
6. Model Analysis
Compared with modulo hashing and virtual bucket sharding, consistent hashing dramatically reduces the amount of data that must be remapped during scaling operations, improving fault tolerance and scalability.
7. Code Implementation
The following Python script implements the concepts described above. It defines a ConsistentHash class with methods to add and remove nodes, locate the appropriate node for a given key, and allocate a batch of keys. The script demonstrates initialization with four nodes, allocation of 40 keys, addition of a fifth node, and removal of one node, printing the allocation results after each operation.
#!/usr/bin/env python
# -*- coding:utf8 -*-
# author:lzj
import hashlib
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class ConsistentHash(object):
def __init__(self, nodes=None):
'''
a. initialize node list
b. ring maps hashed node values to node identifiers
c. sortedKeys keeps the hash values in ascending order to simulate the ring
d. add initial nodes to the ring
'''
self.nodes = nodes
self.ring = {}
self.sortedKeys = []
self.addNodes(nodes)
# add a list of nodes to the ring
def addNodes(self, nodes):
if nodes:
for node in nodes:
nodeHashResult = hashlib.sha1(node).hexdigest()
intNodeHashResult = int(nodeHashResult, 16)
modIntNodeHashResult = intNodeHashResult % (2 ** 32)
self.ring[modIntNodeHashResult] = node
self.sortedKeys.append(modIntNodeHashResult)
self.sortedKeys.sort()
def removeNodes(self, nodes):
'''
remove nodes from the ring and update mappings
'''
if nodes:
for node in nodes:
nodeHashResult = hashlib.sha1(node).hexdigest()
intNodeHashResult = int(nodeHashResult, 16)
modIntNodeHashResult = intNodeHashResult % (2 ** 32)
self.ring.pop(modIntNodeHashResult)
self.sortedKeys.remove(modIntNodeHashResult)
def getNode(self, modKeyHashResult):
'''
iterate over sortedKeys; the first node with a hash greater than the key's hash is selected
'''
position = 0
for _modIntNodeHashResult in self.sortedKeys:
position += 1
if modKeyHashResult < _modIntNodeHashResult:
return self.ring[_modIntNodeHashResult]
else:
continue
# if no larger hash is found, wrap around to the first node
if position == len(self.sortedKeys):
return self.ring[self.sortedKeys[0]]
def allocateKey(self, number):
keyNodeMap = []
# simulate a number of keys and map each to a node
for i in range(number):
keyName = 'testKey' + str(i)
keyHashResult = hashlib.sha1(keyName).hexdigest()
intKeyHashResult = int(keyHashResult, 16)
modKeyHashResult = intKeyHashResult % (2 ** 32)
_node = self.getNode(modKeyHashResult)
print('%s is allocateKey to %s' % (keyName, _node))
keyNodeMap.append(keyName + '_' + _node)
return keyNodeMap
# simulate a cluster with four nodes
iniServers = [
'192.168.1.1',
'192.168.1.2',
'192.168.1.3',
'192.168.1.4',
]
print('Initialize hash ring and node topology')
h = ConsistentHash(iniServers)
print(h.ring)
print(h.sortedKeys)
print('Allocate 40 keys to the ring')
oldMap = h.allocateKey(40)
print('Add a new node and re‑allocate keys')
newNode = ['192.168.1.5']
h.addNodes(newNode)
print(h.ring)
print(h.sortedKeys)
addNodeMap = h.allocateKey(40)
print('Remove a node and re‑allocate keys')
removeNode = ['192.168.1.1']
h.removeNodes(removeNode)
print(h.ring)
print(h.sortedKeys)
removeNodeMap = h.allocateKey(40)Conclusion
The provided code faithfully reproduces the core behavior of consistent hashing, demonstrating its advantages in reducing data movement during scaling. The next article will explore the remaining pain points and limitations of consistent hashing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
