Databases 28 min read

Mastering Data Sharding: When and How to Split Your Database

This article explains the concepts, types, advantages, and drawbacks of vertical and horizontal database sharding, discusses the challenges such as distributed transactions, cross‑node joins, pagination, global primary‑key generation, migration, and provides practical guidelines and case studies for deciding when and how to apply sharding.

Java Interview Crash Guide

Dec 17, 2020

Mastering Data Sharding: When and How to Split Your Database

1. Data Partitioning

Relational databases often become system bottlenecks because a single machine has limited storage capacity, connection count, and processing power. When a single table reaches 10 million rows or 100 GB, performance degrades severely even after adding read replicas or optimizing indexes. At this point, data partitioning (sharding) is needed to reduce the load on each database and shorten query times.

Distributed database core concepts are data partitioning (sharding) and locating/aggregating data after partitioning. Partitioning spreads data across multiple databases, reducing the amount of data per instance and improving performance by adding more hosts.

Data partitioning can be divided into two types: vertical (column‑wise) and horizontal (row‑wise).

1) Vertical (column‑wise) Partitioning

Vertical partitioning commonly includes vertical database splitting and vertical table splitting.

Vertical database splitting stores low‑coupling tables in different databases, similar to breaking a large system into multiple smaller systems based on business domains. Each microservice can use its own database, resembling microservice governance.

Vertical table splitting separates rarely used or large columns into an extension table. Splitting a wide table (e.g., >100 columns) into smaller tables improves development, maintenance, and reduces page splits. Shorter rows increase memory cache hit rates and reduce disk I/O.

Solves business‑level coupling, making business clearer.

Similar to microservice governance, enabling hierarchical management, monitoring, and scaling of different business data.

In high‑concurrency scenarios, vertical partitioning can improve I/O, connection count, and alleviate single‑machine hardware bottlenecks.

Drawbacks:

Some tables cannot be joined; aggregation must be done via APIs, increasing development complexity.

Distributed transaction handling becomes more complex.

Large tables may still exist and require horizontal partitioning.

2) Horizontal (row‑wise) Partitioning

When vertical partitioning cannot further reduce granularity or when a single table still contains massive rows, horizontal partitioning is required to address read/write and storage bottlenecks.

Horizontal partitioning includes intra‑database table splitting and inter‑database table splitting. The same logical table is divided into multiple physical tables or databases based on a sharding key, reducing the size of each table.

Intra‑database table splitting only reduces the size of a single table but does not distribute data across machines, so the performance gain is limited. Full database‑level sharding is preferred.

Advantages of horizontal partitioning:

No single‑database size or high‑concurrency bottleneck; system stability and load capacity improve.

Application changes are minimal; business modules need not be split.

Drawbacks:

Cross‑shard transaction consistency is hard to guarantee.

Cross‑database join performance degrades.

Data expansion and maintenance become more complex.

Typical sharding rules:

1) Range‑based Sharding

Data is divided by time intervals or ID ranges (e.g., dates to different databases, userId 1‑9999 to DB1, 10000‑19999 to DB2). This also supports hot‑cold data separation.

Advantages:

Table size is controllable.

Horizontal scaling is natural; adding a node does not require data migration.

Range queries can quickly locate the shard.

Disadvantage:

Hotspot risk: recent time‑range shards may receive far more traffic than older shards.

2) Modulo‑based Sharding

Typically uses hash modulo (e.g., Customer table sharded by cusno % 4). Queries with the sharding key can directly locate the target database.

Advantages:

Data is evenly distributed, reducing hotspot and concurrency bottlenecks.

Disadvantages:

When expanding the cluster, existing data must be migrated; consistent hashing can mitigate this.

Cross‑shard queries become complex if the query condition does not contain the sharding key.

2. Problems Introduced by Sharding

Sharding alleviates single‑machine bottlenecks but brings several technical challenges.

1) Transaction Consistency Issues

Distributed Transactions

When updates span multiple databases, cross‑database transactions arise. Common solutions include the XA protocol and two‑phase commit, which ensure atomicity but increase latency and risk of deadlocks as the number of nodes grows.

Eventual Consistency

For systems tolerant to latency, eventual consistency can be achieved via compensation transactions, data reconciliation, or periodic synchronization with a source of truth.

2) Cross‑Node Join Problems

Before sharding, many list/detail pages could be satisfied with SQL joins. After sharding, data may reside on different nodes, making joins expensive or impossible. Mitigation strategies:

Global tables : Duplicate low‑change reference tables in each database.

Field redundancy : Store frequently accessed fields redundantly (e.g., store userName in the order table).

Data assembly : Perform two‑step queries—first fetch IDs, then retrieve related data and assemble in the application.

ER sharding : Keep tables with strong relationships in the same shard (e.g., order and order_detail by orderId).

3) Cross‑Node Pagination, Sorting, and Functions

When querying across multiple shards, pagination (LIMIT) and ORDER BY become complex. If the sort key is the sharding key, locating the shard is easy; otherwise, each shard must sort locally, return results, and the coordinator must merge and re‑sort them, which is CPU‑ and memory‑intensive for large page numbers.

Aggregate functions (MAX, MIN, SUM, COUNT) must be executed on each shard, then the partial results are combined and aggregated again.

4) Global Primary‑Key Duplication

Auto‑increment IDs are no longer globally unique across shards. Common strategies:

1) UUID

Standard UUID (36 characters) is easy to generate locally but consumes storage and indexes poorly, especially in InnoDB where randomness causes page splits.

2) Sequence Table

CREATE TABLE `sequence` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `stub` char(1) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;

Insert a row with a constant stub and retrieve LAST_INSERT_ID() in the same connection to obtain a globally unique ID.

REPLACE INTO `sequence` (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

This approach avoids large tables but creates a single point of failure; using master‑slave can improve availability but still limits throughput.

3) Distributed Incremental ID (Snowflake)

Twitter's Snowflake generates 64‑bit IDs composed of:

1 unused bit

41 bits for millisecond timestamp (≈69 years)

5 bits datacenter ID + 5 bits worker ID (max 1024 nodes)

12 bits sequence within the same millisecond (4096 IDs per node per ms)

Advantages: time‑ordered IDs, no external service, high QPS (~4 million/s). Drawback: clock rollback can cause duplicate IDs.

Industry‑grade solution: Meituan‑Dianping's Leaf system, which combines database‑based sequences with high availability and clock‑drift handling.

5) Data Migration and Scaling

When sharding is introduced, historical data must be migrated to the new shards. Capacity planning should aim for each shard to hold ≤10 million rows. Range‑based sharding simplifies scaling (just add a node). Modulo‑based sharding requires data re‑hashing during expansion.

3. When to Consider Sharding

Do not shard unless necessary; first try hardware upgrades, read/write splitting, index optimization.

When table size causes operational issues such as long backup times, lengthy DDL locks, or frequent lock waits.

When business growth demands vertical splitting of certain columns (e.g., moving rarely used large text fields to separate tables).

When rapid data growth makes a single table a performance bottleneck; plan sharding rules and capacity ahead.

For reliability: vertical splitting isolates unrelated business data; horizontal splitting ensures a failure affects only a subset of users.

4. Case Study – User Center

1) Business Scenario

The user center provides registration, login, and profile management. Core table:

User(uid, login_name, passwd, sex, age, nickname)

uid is the primary key.

2) Horizontal Sharding Methods

Range‑based sharding by uid (e.g., uid 0‑10 M in DB1, 10 M‑20 M in DB2) is easy to expand but may cause load imbalance because newer users are more active.

Modulo‑based sharding (uid % N) distributes load evenly but makes scaling harder because re‑hashing is required.

3) Queries Not Based on uid

For login_name queries, the system can:

Maintain a mapping table or cache from login_name to uid, then route by uid.

Use a "gene" function: embed a hash of login_name into the low bits of the generated uid, allowing direct routing without extra storage.

Separate front‑end and back‑end services: front‑end (user‑facing) uses the mapping approach; back‑end (operations) can use a dedicated service or database, possibly backed by Elasticsearch or Hive for complex analytics.

5. Sharding Middleware Support

sharding‑jdbc (Dangdang)

TSharding (Mogujie)

Atlas (Qihoo 360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Vitess (Google)

6. References

Database Distributed Architecture Primer – Sharding and Its Applicability to Core Banking Systems

Concepts of Sharding

Key Steps and Challenges of Horizontal Sharding

Principles, Solutions, Strategies, and Difficulties of Sharding

Leaf – Meituan‑Dianping Distributed ID Generation System

Architect’s Path Public Account

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sharding distributed transactions Vertical Sharding Horizontal Sharding database partitioning global IDs

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.