Databases 26 min read

Mastering Data Sharding: When and How to Split Your Database for Scale

This article explains the concepts, benefits, and challenges of database sharding—including vertical and horizontal partitioning, common splitting strategies, handling distributed transactions, cross‑node joins, pagination, global primary keys, and migration—while offering practical guidelines on when to adopt sharding and which tools can assist.

Programmer DD

Dec 12, 2019

Mastering Data Sharding: When and How to Split Your Database for Scale

1. Data Splitting

Relational databases can become bottlenecks when a single table reaches massive size (e.g., 10 million rows or 100 GB). Sharding (data splitting) reduces the load on each node, shortens query time, and improves overall performance.

Sharding involves distributing data across multiple databases and locating/aggregating the data after the split.

There are two main types of splitting: vertical (column‑wise) and horizontal (row‑wise).

1.1 Vertical (Column‑wise) Splitting

Vertical splitting includes vertical database splitting and vertical table splitting.

Vertical database splitting stores low‑coupling tables in separate databases, similar to micro‑service governance.

Vertical table splitting moves rarely used or large columns to an extension table, reducing row size, improving memory cache hit rate, and decreasing disk I/O.

Solves business coupling, clarifies responsibilities.

Aligns with micro‑service governance, enabling independent monitoring and scaling.

Improves I/O and connection limits under high concurrency.

Drawbacks:

Some tables cannot be joined; aggregation must be done via APIs, increasing development complexity.

Distributed transaction handling becomes more complex.

Large tables may still need horizontal splitting.

1.2 Horizontal (Row‑wise) Splitting

When vertical splitting is insufficient, horizontal splitting distributes rows of a single table across multiple databases or tables.

In‑node table splitting reduces the size of a single table but does not alleviate pressure on a single physical machine; therefore, splitting across databases is preferred.

No single‑node data volume bottleneck; improves stability and load capacity.

Application changes are minimal; no need to refactor business modules.

Drawbacks:

Cross‑shard transaction consistency is hard to guarantee.

Cross‑database joins perform poorly.

Data expansion and maintenance become more complex.

Typical Sharding Rules

1. By Value Range

Split data by time intervals or ID ranges (e.g., month, day, or userId 1‑9999 → DB1, 10000‑20000 → DB2). This also supports hot‑cold data separation.

Table size remains controllable.

Horizontal scaling is natural; adding nodes requires no data migration.

Range queries can quickly locate the target shard.

Drawback: Hotspot data may concentrate on recent partitions.

2. By Modulo (Hash)

Use hash modulo to assign rows to shards (e.g., Customer.cusno % 4 → 4 databases). Queries containing the sharding key can directly locate the shard.

Data distribution is uniform, reducing hotspot risk.

Drawbacks:

Re‑hashing is required when adding nodes.

Cross‑shard queries become complex and may need to query all shards.

2. Problems Introduced by Sharding

2.1 Transaction Consistency

Updates spanning multiple databases cause distributed transaction issues. Two‑phase commit (XA) or similar protocols are typically used, but they increase latency and conflict probability.

2.2 Cross‑Node Join Issues

Before sharding, many list/detail queries could be satisfied with SQL joins. After sharding, data may reside on different nodes, making joins expensive; avoiding joins is recommended.

Solutions

Global tables (data dictionaries) duplicated in each database.

Field redundancy (denormalization) to avoid joins.

Data assembly: two‑step queries—first fetch IDs, then fetch related data.

ER‑based sharding: keep related tables in the same shard.

2.3 Cross‑Node Pagination, Sorting, and Aggregation

When paging across shards, each node must sort and return its portion, then the results are merged and re‑sorted, which can be CPU‑ and memory‑intensive for large page numbers.

Aggregate functions (MAX, MIN, SUM, COUNT) must be executed on each shard and then combined.

2.4 Global Primary Key Collision

Auto‑increment IDs are not globally unique across shards. Common solutions include:

UUID (simple but large and index‑unfriendly).

Dedicated sequence table with a unique stub column.

Multiple ID‑generation servers with stepped auto‑increment values.

Batch‑fetching ID blocks to reduce DB load.

Snowflake algorithm (64‑bit IDs with timestamp, datacenter, worker, and sequence bits).

Leaf (Meituan‑Dianping) distributed ID service.

Sequence Table Example

CREATE TABLE `sequence` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `stub` char(1) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;

Insert and retrieve a new ID:

REPLACE INTO sequence (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

2.5 Data Migration and Scaling

When sharding is introduced, historical data must be migrated according to the chosen rule. Capacity planning should aim for < 10 million rows per shard.

Range‑based sharding allows easy node addition without data migration, while modulo‑based sharding requires re‑hashing.

3. When to Consider Sharding

Avoid sharding if the table size is manageable; first try hardware upgrades, read/write separation, and index optimization.

When a single table reaches a bottleneck that impacts operations (backup, DDL, lock contention).

When rapid business growth leads to massive data growth and high concurrency.

When vertical splitting of specific columns can reduce write pressure.

When separating business domains improves availability ("don’t put all eggs in one basket").

4. Case Study – User Center

Core table: User(uid, login_name, passwd, sex, age, nickname). High‑frequency updates on last_login_time cause write pressure. Solution: vertically split into user_base (static fields) and user_time (login timestamps). Large personal_info column is moved to user_ext.

4.1 Horizontal Splitting Methods

By value range (e.g., uid 0‑10 M → DB1, 10 M‑20 M → DB2) – easy to scale but may cause load imbalance.

By modulo (e.g., uid % 2) – uniform distribution but scaling requires re‑hashing.

4.2 Non‑UID Queries

Build a mapping table login_name → uid (or cache) to locate the correct shard for non‑primary‑key lookups.

4.3 Front‑End / Back‑End Separation

User‑facing services keep low‑latency access to real‑time data, while operational analytics use separate services/databases (or ES/Hive) to avoid affecting user experience.

5. Sharding Middleware

sharding-jdbc (Dangdang)

TSharding (Mogujie)

Atlas (Qihoo 360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Vitess (Google)

6. References

Database Distributed Architecture – Sharding Overview

Sharding Principles and Strategies

Horizontal Sharding Key Steps and Issues

Sharding from Principles to Challenges

Leaf – Meituan Distributed ID System

Architect’s Road (WeChat public account)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database sharding distributed transactions Horizontal Partitioning Vertical Partitioning global primary key

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.