Databases 13 min read

When to Shard Databases and How to Do It Effectively

This article explains why database sharding is needed when performance bottlenecks appear, outlines the criteria and timing for adopting sharding, and provides detailed step‑by‑step guidance on both database‑level and table‑level partitioning, including common pitfalls and solutions.

ITPUB

Jun 4, 2023

When to Shard Databases and How to Do It Effectively

Why Use Sharding?

Sharding is applied when a database reaches a performance bottleneck that cannot be solved by simple hardware upgrades.

High concurrency leads to connection exhaustion and request blocking.

SQL queries become slow, especially full‑table scans on tables with billions of rows.

Rapid data growth strains storage capacity.

Typical bottlenecks involve CPU, disk, memory, or network. Upgrading hardware often yields low ROI, so software‑level solutions are preferred.

When to Shard?

Sharding should be considered only after conventional software optimizations (SQL tuning, indexing, read/write splitting, clustering) have been exhausted.

What criteria determine the need for sharding?

At what data volume does a single table require sharding?

How fast does data grow before sharding becomes essential?

Alibaba’s development handbook recommends sharding when a single table exceeds 5 million rows or 2 GB of data. If projected growth over the next three years will not reach these thresholds, avoid pre‑emptive sharding.

Sharding Approaches

Database Sharding

When multiple services initially share a single database, the database can become a contention point as traffic grows. The system can be split into separate databases per service.

Table Sharding

Table sharding is used when a single table’s data grows rapidly (e.g., an order table receiving hundreds of thousands of rows daily).

Horizontal vs. vertical splitting.

Single‑database vs. multi‑database splitting.

Vertical Splitting

Vertical splitting separates a logical entity into multiple tables with different schemas (e.g., base attributes, specification attributes, extended attributes).

Horizontal Splitting

Horizontal splitting divides rows while keeping the schema identical. Example: user1 stores rows with odd IDs, user2 stores rows with even IDs.

Time‑based horizontal strategies include:

Daily table – stores only the current day’s data.

Monthly table – a scheduled job migrates the previous day’s data to a monthly table.

History table – data older than 30 days is moved to a history table.

Characteristics:

Vertical split – based on tables/fields, different schemas.

Horizontal split – based on rows, same schema, partitioned by a column value.

Single‑Database vs. Multi‑Database Horizontal Splitting

In a single database, a large table can be divided into user1, user2, etc., to reduce query load.

For larger scale, sub‑tables are distributed across multiple databases with routing rules determining the target shard.

Complexities Introduced by Sharding

Cross‑Database Joins

Field redundancy – duplicate join keys in a primary table.

Data abstraction – use ETL to aggregate data into a global table.

Global tables – maintain a copy of reference data in every shard.

Application‑level assembly – fetch base data and combine in code.

Co‑locating related data – shard by a common attribute (e.g., user ID) so related rows reside in the same database.

Distributed Transactions

When multiple databases are involved, local transactions are insufficient. Common approaches are:

MQ‑based eventual consistency.

XA two‑phase commit.

Flexible (elastic) transactions.

Open‑source implementations include Seata (recommended) and TX‑LCN.

Sorting, Pagination, and Function Execution

Keywords such as ORDER BY and LIMIT must be applied on each shard, then results are merged. ShardingSphere provides robust support for these scenarios.

Distributed ID Generation

Unique identifiers across shards can be generated by:

UUID.

Global auto‑increment table.

Redis‑based ID generator.

Snowflake algorithm.

Baidu uid‑generator (Snowflake variant).

Meituan Leaf (Snowflake variant).

Didi Tinyid.

Multi‑DataSource Access

Middleware that abstracts multiple databases includes: ShardingSphere – Apache‑level, mature, with extensive documentation. MyCat – community‑driven, less actively maintained.

Conclusion

Do not adopt sharding prematurely. First evaluate whether conventional optimizations—SQL tuning, indexing, read/write splitting, and hardware upgrades—can resolve the issue. Sharding can break performance ceilings but introduces significant complexity; it should be applied only when the system truly requires it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization database sharding Horizontal Partitioning ID Generation

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.