Mastering Database Sharding: When, Why, and How to Split Your Data
This comprehensive guide explains the motivations, methods, and trade‑offs of database sharding—including vertical and horizontal partitioning, sharding rules, handling cross‑node joins, distributed transactions, global primary key generation, and practical migration strategies—so you can scale relational databases effectively.
What Is Data Sharding?
When a relational database becomes a performance bottleneck due to limited storage, connections, or processing power, splitting the data across multiple databases or tables reduces load and shortens query time. Sharding (data partitioning) distributes rows so that each node holds a smaller subset of the overall data.
Vertical (Column) Sharding
Vertical sharding separates tables or columns based on business coupling. It includes:
Vertical database splitting : Low‑coupled tables are placed in different databases, similar to breaking a monolith into micro‑services.
Vertical table splitting : Frequently accessed or large columns are moved to an extension table, reducing row size and improving InnoDB page utilization.
Advantages:
Clear business boundaries and reduced coupling.
Improved I/O and connection capacity for high‑concurrency scenarios.
Better monitoring and maintenance per business domain.
Disadvantages:
Cross‑database joins become impossible; data must be aggregated via APIs.
Distributed transaction handling is complex.
Large tables may still need horizontal sharding.
Horizontal (Row) Sharding
Horizontal sharding splits a single table’s rows across multiple databases or tables based on a logical rule, reducing the size of each table.
Range sharding : Data is divided by value ranges (e.g., date or ID intervals).
Modulo sharding : Data is distributed by applying a hash/modulo operation to a key (e.g., userId % N).
Advantages:
Eliminates single‑node bottlenecks, improving stability and load capacity.
Minimal changes to the application layer.
Disadvantages:
Cross‑shard transactions are hard to guarantee.
Join queries across shards perform poorly.
Data maintenance and expansion become more complex.
Sharding Rules
Range Sharding
Rows are assigned to shards based on a numeric or temporal range, such as timestamps or ID intervals. Adding a new node only requires defining a new range; existing data does not need to be moved.
Modulo Sharding
Rows are assigned using a modulo of a key (e.g., userId % 4). This yields an even distribution of rows and requests, but adding or removing nodes requires re‑hashing and data migration.
Challenges After Sharding
Distributed Transactions
When updates span multiple shards, XA protocols or two‑phase commit are required, which increase latency and the risk of deadlocks. The more nodes involved, the greater the coordination overhead.
Cross‑Node Joins
SQL joins that previously operated on a single database now need to query multiple shards, aggregate results, and re‑join in the application layer, which can be costly.
Pagination, Sorting, and Aggregation
Limit/offset pagination and ORDER BY on non‑sharding columns require each shard to sort locally, return a subset, and then merge results globally, consuming CPU and memory, especially for large page numbers.
Global Primary Key Generation
Auto‑increment IDs are no longer globally unique. Common solutions include:
UUID : Simple, locally generated 128‑bit identifiers, but they are large and can fragment indexes.
Sequence table (MyISAM): A dedicated table with an auto‑increment column; applications issue
REPLACE INTO sequence (stub) VALUES ('a'); SELECT LAST_INSERT_ID();to obtain a unique ID.
Snowflake (Twitter): 64‑bit IDs composed of timestamp, datacenter ID, worker ID, and a per‑millisecond counter, providing ~409,600 IDs per second per node.
Leaf (Meituan) : A production‑grade service that combines timestamp, machine ID, and sequence to generate collision‑free IDs with high availability.
Data Migration and Capacity Planning
When sharding is introduced, historical data must be read from the original tables and written to the appropriate shards according to the chosen rule. Capacity planning should aim for no more than ~10 million rows per shard to keep DDL and backup operations manageable.
When to Consider Sharding
Avoid premature sharding; only split when tables approach the performance limits of a single node.
When table size or write volume causes backup, DDL, or lock‑wait problems.
When specific columns become hot (e.g., frequent updates to login timestamps) and can be moved to a separate table.
When rapid data growth threatens to exceed single‑node capacity.
When isolating business domains improves availability and prevents a single failure from affecting all services.
Case Study: User Center
A typical user table contains
uid, login_name, passwd, sex, age, nickname, last_login_time, personal_info, …. As the user base grows from 100 k to 1 billion, the last_login_time column becomes a write hotspot. The solution is to vertically split this column into a user_time table and move rarely used large text fields (e.g., personal_info) into a user_ext table.
For horizontal scaling, the uid can be used for range or modulo sharding. Range sharding (e.g., uid 0‑10 M → db1, 10 M‑20 M → db2) simplifies expansion but may lead to uneven load. Modulo sharding (uid % 2) balances load but makes scaling harder because re‑hashing is required.
Non‑uid queries (e.g., login_name) need a mapping table or cache that stores login_name → uid. This mapping can be sharded itself or stored in a fast key‑value store.
Sharding Middleware Options
sharding‑jdbc (Dangdang)
TSharding (Mogujie)
Atlas (Qihoo 360)
Cobar (Alibaba)
MyCAT (based on Cobar)
Oceanus (58.com)
Vitess (Google)
These open‑source projects provide routing, SQL rewriting, and connection pooling to simplify the implementation of both vertical and horizontal sharding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
