When to Shard Databases and How to Do It Effectively
This article explains why database sharding is needed when performance bottlenecks appear, outlines the criteria and timing for adopting sharding, and provides detailed step‑by‑step guidance on both database‑level and table‑level partitioning, including common pitfalls and solutions.
Why Use Sharding?
Sharding is applied when a database reaches a performance bottleneck that cannot be solved by simple hardware upgrades.
High concurrency leads to connection exhaustion and request blocking.
SQL queries become slow, especially full‑table scans on tables with billions of rows.
Rapid data growth strains storage capacity.
Typical bottlenecks involve CPU, disk, memory, or network. Upgrading hardware often yields low ROI, so software‑level solutions are preferred.
When to Shard?
Sharding should be considered only after conventional software optimizations (SQL tuning, indexing, read/write splitting, clustering) have been exhausted.
What criteria determine the need for sharding?
At what data volume does a single table require sharding?
How fast does data grow before sharding becomes essential?
Alibaba’s development handbook recommends sharding when a single table exceeds 5 million rows or 2 GB of data. If projected growth over the next three years will not reach these thresholds, avoid pre‑emptive sharding.
Sharding Approaches
Database Sharding
When multiple services initially share a single database, the database can become a contention point as traffic grows. The system can be split into separate databases per service.
Table Sharding
Table sharding is used when a single table’s data grows rapidly (e.g., an order table receiving hundreds of thousands of rows daily).
Horizontal vs. vertical splitting.
Single‑database vs. multi‑database splitting.
Vertical Splitting
Vertical splitting separates a logical entity into multiple tables with different schemas (e.g., base attributes, specification attributes, extended attributes).
Horizontal Splitting
Horizontal splitting divides rows while keeping the schema identical. Example: user1 stores rows with odd IDs, user2 stores rows with even IDs.
Time‑based horizontal strategies include:
Daily table – stores only the current day’s data.
Monthly table – a scheduled job migrates the previous day’s data to a monthly table.
History table – data older than 30 days is moved to a history table.
Characteristics:
Vertical split – based on tables/fields, different schemas.
Horizontal split – based on rows, same schema, partitioned by a column value.
Single‑Database vs. Multi‑Database Horizontal Splitting
In a single database, a large table can be divided into user1, user2, etc., to reduce query load.
For larger scale, sub‑tables are distributed across multiple databases with routing rules determining the target shard.
Complexities Introduced by Sharding
Cross‑Database Joins
Field redundancy – duplicate join keys in a primary table.
Data abstraction – use ETL to aggregate data into a global table.
Global tables – maintain a copy of reference data in every shard.
Application‑level assembly – fetch base data and combine in code.
Co‑locating related data – shard by a common attribute (e.g., user ID) so related rows reside in the same database.
Distributed Transactions
When multiple databases are involved, local transactions are insufficient. Common approaches are:
MQ‑based eventual consistency.
XA two‑phase commit.
Flexible (elastic) transactions.
Open‑source implementations include Seata (recommended) and TX‑LCN.
Sorting, Pagination, and Function Execution
Keywords such as ORDER BY and LIMIT must be applied on each shard, then results are merged. ShardingSphere provides robust support for these scenarios.
Distributed ID Generation
Unique identifiers across shards can be generated by:
UUID.
Global auto‑increment table.
Redis‑based ID generator.
Snowflake algorithm.
Baidu uid‑generator (Snowflake variant).
Meituan Leaf (Snowflake variant).
Didi Tinyid.
Multi‑DataSource Access
Middleware that abstracts multiple databases includes: ShardingSphere – Apache‑level, mature, with extensive documentation. MyCat – community‑driven, less actively maintained.
Conclusion
Do not adopt sharding prematurely. First evaluate whether conventional optimizations—SQL tuning, indexing, read/write splitting, and hardware upgrades—can resolve the issue. Sharding can break performance ceilings but introduces significant complexity; it should be applied only when the system truly requires it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
