Databases 25 min read

Mastering Database Sharding: When and How to Split Your Data

This article explains the concepts, types, advantages, and challenges of database sharding—including vertical and horizontal partitioning—covers when to consider splitting, practical strategies for handling transactions, joins, pagination, global IDs, and lists popular middleware solutions for implementing sharding in production.

Java Interview Crash Guide

Dec 10, 2021

Mastering Database Sharding: When and How to Split Your Data

1. Data Partitioning

Relational databases can become system bottlenecks because a single machine’s storage, connections, and processing capacity are limited. When a table reaches tens of millions of rows or hundreds of gigabytes, even adding replicas or optimizing indexes may not prevent severe performance degradation, so sharding is required to reduce load and query time.

Industry theoretical value

Distributed database core concepts are data sharding ( Sharding) and locating/aggregating data after sharding. Sharding distributes data across multiple databases, decreasing the amount stored per node and improving performance by adding more hosts.

Sharding can be classified into two types: vertical (or column) sharding and horizontal (or row) sharding.

1. Vertical (column) sharding

Vertical sharding includes vertical database splitting and vertical table splitting.

Vertical database splitting stores low‑coupling tables in separate databases, similar to breaking a large system into multiple smaller systems or microservices, each with its own database.

Vertical table splitting moves rarely used or large columns into an extension table, reducing row size, improving memory cache hit rate, and avoiding page splits that cause extra I/O.

Advantages of vertical sharding:

Reduces business coupling and clarifies responsibilities.

Facilitates management, monitoring, and scaling similar to microservice governance.

Improves I/O, connection count, and hardware utilization under high concurrency.

Drawbacks:

Some tables cannot be joined directly, requiring API aggregation and increasing development complexity.

Distributed transaction handling becomes more complex.

Large tables may still need horizontal sharding.

2. Horizontal (row) sharding

When vertical sharding is insufficient or a table’s row count is massive, horizontal sharding is needed.

Horizontal sharding can be “in‑database table splitting” or “database‑and‑table splitting”, distributing rows of a single logical table across multiple databases or tables based on a sharding key.

In‑database table splitting reduces the size of a single table but does not alleviate pressure on the underlying physical machine; full database‑and‑table sharding is preferred.

Advantages of horizontal sharding:

Eliminates single‑database performance bottlenecks, improving stability and load capacity.

Requires minimal changes to the application layer.

Drawbacks:

Cross‑shard transaction consistency is hard to guarantee.

Cross‑database joins perform poorly.

Data expansion and maintenance become more complex.

Typical sharding rules include:

1. Range‑based sharding

Data is split by time intervals or ID ranges (e.g., userId 1‑9999 in the first shard, 10000‑19999 in the second). This keeps shard size controllable and simplifies horizontal scaling.

Advantages: predictable shard size, easy expansion, fast range queries.

Disadvantages: hot‑spot risk when recent data concentrates in a few shards.

2. Mod‑based sharding

Rows are assigned to shards using a hash modulo of a key (e.g., customer number % 4). This yields relatively uniform data distribution.

Advantages: balanced load, reduced hot‑spot risk.

Disadvantages: data migration is required when adding shards; cross‑shard queries become complex.

2. Issues Introduced by Sharding

Sharding alleviates single‑machine bottlenecks but brings new challenges.

1. Transaction consistency

Distributed transactions

Cross‑shard updates require XA or two‑phase commit, which adds latency and increases the chance of conflicts or deadlocks as the number of nodes grows.

Eventual consistency

For systems tolerant of delayed consistency, compensation mechanisms (reconciliation, log comparison, periodic sync) can be used instead of strict atomic rollback.

2. Cross‑shard joins

After sharding, joins may span multiple nodes, degrading performance. Solutions include:

Global tables duplicated in each shard.

Field redundancy (denormalization).

Two‑step data assembly: query IDs first, then fetch related data.

ER‑based sharding: keep related tables in the same shard.

3. Cross‑shard pagination, sorting, and aggregation

When sorting or paginating across shards, each shard must sort locally, then results are merged, which can be CPU‑ and memory‑intensive, especially for deep pages.

Aggregations (MAX, MIN, SUM, COUNT) require executing the function on each shard and then merging the results.

4. Global primary‑key generation

Auto‑increment IDs are unsuitable across shards. Common strategies:

UUID

Universally unique identifiers are easy to generate but large and index‑unfriendly.

Sequence table

CREATE TABLE `sequence` (
    `id` bigint(20) unsigned NOT NULL auto_increment,
    `stub` char(1) NOT NULL default '',
    PRIMARY KEY (`id`),
    UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;

Each stub row yields a global ID via REPLACE and LAST_INSERT_ID().

REPLACE INTO sequence (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

Snowflake algorithm

Twitter’s 64‑bit ID combines a timestamp, datacenter ID, worker ID, and a per‑millisecond sequence, providing high throughput and time‑ordered IDs.

5. Data migration and scaling

When traffic grows, data must be migrated to new shards according to the chosen rule. Range‑based sharding allows adding nodes without moving existing data; mod‑based sharding requires rehashing and data movement.

3. When to Consider Sharding

Do not shard prematurely; first try hardware upgrades, read/write splitting, and index optimization. Sharding becomes necessary when:

Table size reaches a bottleneck (e.g., >10 million rows or >100 GB).

Backup or DDL operations cause unacceptable downtime.

Lock contention degrades performance.

Business growth demands vertical separation of rarely used columns.

Rapid data growth threatens system stability.

Higher availability is required by avoiding a single point of failure.

4. Case Study: User Center

The user table (User(uid, login_name, passwd, sex, age, nickname)) illustrates typical sharding decisions. Front‑end services need low‑latency lookups by login_name or uid, while back‑office analytics require batch queries on age, gender, etc.

Horizontal sharding methods

Range‑based sharding splits users by uid ranges, simplifying scaling but risking uneven load. Mod‑based sharding distributes users evenly but makes expansion harder.

Handling non‑uid queries

Approaches include a mapping table (login_name → uid) stored in cache, or calculating a “sharding gene” from the non‑uid field to route directly.

5. Sharding Middleware

sharding-jdbc (Dangdang)

TSharding (Mogujie)

Atlas (360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Vitess (Google)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed databases Global ID Generation vertical sharding horizontal sharding database partitioning sharding middleware

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.