Big Data 9 min read

Understanding Distributed Hash Tables (DHT) and Their Improvements

The article explains how Distributed Hash Tables replace simple modulo hashing with a ring‑based scheme, demonstrates severe data skew in basic implementations, and shows that adding multiple virtual nodes plus a load‑boundary factor dramatically balances storage and request distribution across cluster nodes.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Understanding Distributed Hash Tables (DHT) and Their Improvements

With the rapid growth of the Internet, the amount of user‑generated data has exploded, creating massive storage challenges for enterprises. Most mainstream distributed big‑data file systems slice and scatter data across all nodes in a cluster. This article introduces how a Distributed Hash Table (DHT) achieves distributed, scattered storage.

Technical Background : In the early Internet, data was stored on a single server. As user numbers and data volume grew exponentially, a single machine could no longer meet storage needs, prompting the need for multiple servers to cooperate.

Traditional Hash : Conventional hash functions (e.g., hash() = X mod S) distribute data based on the number of nodes S. When the cluster scales up or down, S changes, causing massive data migration and poor performance.

A Simple DHT : A DHT maps nodes onto a ring of size 2^32 (the IPv4 address space). Nodes are hashed onto the ring, and each node is responsible for the interval between it and its predecessor. The article uses the FNV hash algorithm to compute hash values. Data objects are hashed onto the ring and stored on the first node encountered clockwise.

The initialization code (shown as images) creates a map for node metadata and a list of virtual nodes (vNodes). The addPhysicalNode method adds physical nodes, computes their hash values, and inserts them into the ring.

Experiments:

• With four nodes and 100 inserted items, only one node receives data, demonstrating severe data skew.

• After inserting 1,000,000 items, all nodes receive data but the distribution remains uneven, causing 99% of requests to be handled by Node 3.

The skew arises because node intervals on the ring are uneven; data that hashes into a large interval all map to the same node.

Improvements :

1. Virtual Nodes : Each physical node is represented by multiple logical replicas (virtual nodes). Adding virtual nodes (e.g., 3 per physical node) spreads the hash space more evenly. Experiments with 10, 20, and 100 virtual nodes show progressively better balance, and with 1,000,000 virtual nodes the distribution becomes almost uniform.

2. Load Boundary Factor : Even with many virtual nodes, some intervals may still be overloaded. A load‑boundary factor sets a maximum weight per node (e.g., average items per node + 1). When a node reaches this threshold, new items are assigned to the next node, further smoothing the load. Code snippets (shown as images) illustrate how to enable this switch and the resulting balanced distribution.

Further Considerations : The article raises additional questions such as improving DHT read/write performance, ensuring high reliability, handling node failures, data backup, and avoiding replica concentration. While the discussion is introductory, it highlights DHT as a viable load‑balancing technique that leverages hash properties to distribute data and requests across a cluster, enhancing fault tolerance.

Big DataLoad Balancingdata skewvirtual nodesDHTDistributed Hash Table
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.