Fundamentals 17 min read

Why Consistent Hashing Cuts Data Movement When Scaling Distributed Systems

This article explains the principles of consistent hashing, compares it with simple modulo and virtual bucket methods, demonstrates how to build a hash ring, map nodes and keys, and shows through Python code how node addition or removal only affects a small portion of the data.

ITPUB

Aug 17, 2020

Why Consistent Hashing Cuts Data Movement When Scaling Distributed Systems

Introduction

Consistent hashing is introduced as a solution to the inflexibility of simple modulo hashing and the higher data movement of virtual bucket sharding. It aims to minimize the amount of data that needs to be remapped when nodes are added or removed.

1. Building an Empty Hash Ring

The hash ring spans the integer range 0 to 2^32-1. Nodes are placed on this ring by hashing their IP address or hostname with SHA‑1, converting the result to an integer, and taking the modulo 2^32. This creates a uniform distribution of node positions on the ring.

2. Constructing Node Topology on the Ring

Each node’s hashed value determines its location on the ring. For example, three nodes A, B, and C are hashed and placed at specific points, dividing the ring into three intervals that act as logical shards.

3. Mapping Keys to the Ring

Keys are processed in the same way as nodes: SHA‑1 hash → integer → modulo 2^32. The key is then assigned to the first node encountered when moving clockwise around the ring. This ensures each key belongs to exactly one node.

4. Simulating Node Failure

When a node (e.g., C) goes offline, only the keys that were mapped to the segment between the failed node and the next clockwise node need to be reassigned. All other keys remain untouched, demonstrating the low impact of node removal.

5. Simulating Node Addition

Adding a new node (e.g., D) inserts a new point on the ring. Only the keys that fall between the new node and its predecessor need to move to the new node; the rest of the system stays unchanged.

6. Model Analysis

Compared with modulo hashing and virtual bucket sharding, consistent hashing dramatically reduces the amount of data that must be remapped during scaling operations, improving fault tolerance and scalability.

7. Code Implementation

The following Python script implements the concepts described above. It defines a ConsistentHash class with methods to add and remove nodes, locate the appropriate node for a given key, and allocate a batch of keys. The script demonstrates initialization with four nodes, allocation of 40 keys, addition of a fifth node, and removal of one node, printing the allocation results after each operation.

#!/usr/bin/env python
# -*- coding:utf8 -*-
# author:lzj
import hashlib
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class ConsistentHash(object):
    def __init__(self, nodes=None):
        '''
        a. initialize node list
        b. ring maps hashed node values to node identifiers
        c. sortedKeys keeps the hash values in ascending order to simulate the ring
        d. add initial nodes to the ring
        '''
        self.nodes = nodes
        self.ring = {}
        self.sortedKeys = []
        self.addNodes(nodes)

    # add a list of nodes to the ring
    def addNodes(self, nodes):
        if nodes:
            for node in nodes:
                nodeHashResult = hashlib.sha1(node).hexdigest()
                intNodeHashResult = int(nodeHashResult, 16)
                modIntNodeHashResult = intNodeHashResult % (2 ** 32)
                self.ring[modIntNodeHashResult] = node
                self.sortedKeys.append(modIntNodeHashResult)
                self.sortedKeys.sort()

    def removeNodes(self, nodes):
        '''
        remove nodes from the ring and update mappings
        '''
        if nodes:
            for node in nodes:
                nodeHashResult = hashlib.sha1(node).hexdigest()
                intNodeHashResult = int(nodeHashResult, 16)
                modIntNodeHashResult = intNodeHashResult % (2 ** 32)
                self.ring.pop(modIntNodeHashResult)
                self.sortedKeys.remove(modIntNodeHashResult)

    def getNode(self, modKeyHashResult):
        '''
        iterate over sortedKeys; the first node with a hash greater than the key's hash is selected
        '''
        position = 0
        for _modIntNodeHashResult in self.sortedKeys:
            position += 1
            if modKeyHashResult < _modIntNodeHashResult:
                return self.ring[_modIntNodeHashResult]
            else:
                continue
        # if no larger hash is found, wrap around to the first node
        if position == len(self.sortedKeys):
            return self.ring[self.sortedKeys[0]]

    def allocateKey(self, number):
        keyNodeMap = []
        # simulate a number of keys and map each to a node
        for i in range(number):
            keyName = 'testKey' + str(i)
            keyHashResult = hashlib.sha1(keyName).hexdigest()
            intKeyHashResult = int(keyHashResult, 16)
            modKeyHashResult = intKeyHashResult % (2 ** 32)
            _node = self.getNode(modKeyHashResult)
            print('%s is allocateKey to %s' % (keyName, _node))
            keyNodeMap.append(keyName + '_' + _node)
        return keyNodeMap

# simulate a cluster with four nodes
iniServers = [
    '192.168.1.1',
    '192.168.1.2',
    '192.168.1.3',
    '192.168.1.4',
]
print('Initialize hash ring and node topology')
h = ConsistentHash(iniServers)
print(h.ring)
print(h.sortedKeys)

print('Allocate 40 keys to the ring')
oldMap = h.allocateKey(40)

print('Add a new node and re‑allocate keys')
newNode = ['192.168.1.5']
h.addNodes(newNode)
print(h.ring)
print(h.sortedKeys)
addNodeMap = h.allocateKey(40)

print('Remove a node and re‑allocate keys')
removeNode = ['192.168.1.1']
h.removeNodes(removeNode)
print(h.ring)
print(h.sortedKeys)
removeNodeMap = h.allocateKey(40)

Conclusion

The provided code faithfully reproduces the core behavior of consistent hashing, demonstrating its advantages in reducing data movement during scaling. The next article will explore the remaining pain points and limitations of consistent hashing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Scalability load balancing consistent hashing hash ring

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.