Big Data 11 min read

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

This article explains core data sharding concepts and models—including hash‑based, range‑based, and consistent hashing—detailing their mappings, routing strategies, scalability considerations, and practical implementation examples for handling massive datasets in distributed systems.

21CTO
21CTO
21CTO
Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

In the big‑data era, enterprises often handle terabytes to petabytes of data, which cannot be stored or processed on a single machine. Distributed storage therefore relies on sharding (splitting data) and routing mechanisms to allocate data across multiple nodes. Sharding differs from replication, which provides high availability by copying data.

Data Sharding: Enables horizontal scaling by adding machines to increase capacity.

Data Replication: Provides high availability by storing multiple copies, but introduces consistency challenges during updates.

2. Sharding Models

A common model uses a two‑level mapping: the first level maps keys to partitions (many‑to‑one), and the second maps partitions to machines (many‑to‑one). This abstract model can represent both hash‑based and range‑based sharding.

3. Hash Sharding

3.1 Round Robin / Modulo

The hash value of a key is taken modulo the number of servers K: H(Key) = hash(key) mod K The result indicates the physical machine index. When servers are added or removed, K changes and most data must be remapped, causing heavy migration.

3.2 Virtual Buckets / Memory Table

Instead of direct modulo, a metadata table maps hash values to virtual buckets, and a second mapping assigns buckets to physical machines. This two‑layer mapping improves flexibility and reduces data movement during scaling.

3.3 Consistent Hashing

Each node receives a token placed on a hash ring. A key is stored on the first node whose token is equal to or follows the key’s hash value. This limits data movement to neighboring nodes when nodes join or leave.

Routing strategies include:

O(1) sequential lookup: each node knows its predecessor and successor; may traverse the entire ring (O(N) time).

O(N) routing table: each node keeps a table of size N, enabling O(log N) lookup.

Node join/leave updates predecessor/successor links and migrates affected data. Virtual nodes (multiple tokens per physical node) balance load and accommodate heterogeneous machines.

class Consistent {
    int[] circle; // ordered hash values of physical nodes
    int virtualNodes;
    HashMap<int, String> virtualMap; // virtual node to host
    HashMap<String, bool> members;
    ReentrantReadWriteLock syncMutex;
    public void add(string elt) { /*...*/ }
    public void remote(string elt) { /*...*/ }
    public string get(string key) { /*...*/ }
}

4. Range Sharding

Range sharding sorts all keys and divides the ordered key space into contiguous intervals (sub‑tables). Each sub‑table stores all records whose keys fall within its range, avoiding the scattering effect of hash distribution.

Systems such as Google Bigtable implement range sharding with a B+‑tree hierarchy, where leaf nodes are sub‑tables. Large or small sub‑tables may be split or merged to maintain balanced distribution.

5. References

Consistent Hash algorithm background: cnblogs.com/haippy/archive/2011/12/10/2282943.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadata shardingHashingconsistent hashingrange partitioning
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.