Database Sharding: Data Partitioning Strategies, Challenges, and Practical Solutions
The article explains why relational databases become performance bottlenecks at large scales, introduces vertical and horizontal sharding techniques, discusses their advantages and drawbacks, and provides detailed guidance on handling distributed transactions, joins, pagination, global primary keys, and ID generation strategies.
Relational databases often become system bottlenecks when a single table reaches massive sizes (e.g., 10 million rows or 100 GB), leading to degraded query performance even after adding replicas or optimizing indexes. To alleviate this, data sharding (partitioning) is employed, which distributes data across multiple databases, reducing per‑node load and improving query speed.
Core concept of distributed databases is sharding , which includes both vertical (column‑wise) and horizontal (row‑wise) partitioning.
1. Vertical (Column‑wise) Sharding
Vertical sharding typically involves two forms:
Vertical database splitting : Low‑coupling tables are placed in separate databases, similar to micro‑service governance where each service owns its own DB.
Vertical table splitting : Frequently accessed or large‑size columns are moved to an extension table, reducing row size and improving memory cache hit rates.
Advantages: clearer business boundaries, better isolation, improved I/O and connection limits. Disadvantages: cross‑database joins become impossible, distributed transaction complexity, and large tables may still need horizontal sharding.
2. Horizontal (Row‑wise) Sharding
When vertical sharding is insufficient, horizontal sharding splits rows across multiple databases or tables, reducing single‑table size.
Two main approaches:
In‑database table splitting : Divides a large table into smaller tables within the same DB; does not reduce machine‑level load.
Database‑and‑table splitting : Distributes tables across different DB instances, alleviating CPU, memory, and I/O bottlenecks.
Sharding rules commonly used:
Range‑based sharding : e.g., split by user ID ranges (0‑10M, 10M‑20M, …).
Modulo‑based sharding : e.g., user_id % N determines the target shard.
Pros: eliminates single‑node bottlenecks, improves stability and load capacity. Cons: cross‑shard transaction consistency, join performance degradation, hot‑spot data, and complex scaling when using modulo.
3. Problems After Sharding
Distributed transaction consistency : Requires XA or two‑phase commit; coordination overhead can increase latency and deadlock risk.
Eventual consistency : For high‑performance, low‑consistency needs, compensation transactions or asynchronous sync can be used.
Cross‑shard joins : Avoided by using global tables, field redundancy, data assembly in two‑step queries, or placing related tables on the same shard.
Pagination, sorting, and aggregate functions : Must be performed per‑shard then merged, which can be CPU‑ and memory‑intensive for large page numbers.
Global primary key duplication : Solutions include UUIDs, a dedicated sequence table, multi‑node ID generators (e.g., Flickr‑style), or Snowflake‑style 64‑bit IDs.
Example of a sequence table for global IDs:
CREATE TABLE `sequence` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`stub` char(1) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;Insert and retrieve a new ID:
REPLACE INTO sequence (stub) VALUES ('a');
SELECT LAST_INSERT_ID();4. When to Consider Sharding
Avoid premature sharding; first try hardware upgrades, read/write splitting, and index optimization.
Sharding becomes necessary when single‑table size or operation latency reaches critical thresholds (e.g., >10 M rows, >100 GB).
Consider operational impacts such as backup I/O, long DDL locks, and lock contention.
Vertical splitting may be needed for fields with high update frequency or large size.
Horizontal sharding is required when data growth outpaces vertical solutions.
Separate business domains to improve availability; failures in one shard affect only a subset of users.
5. Case Study: User Center
Core user table:
User(uid, login_name, passwd, sex, age, nickname)Sharding strategies:
Range‑based: uid 0‑10M → db1, 10M‑20M → db2, …
Modulo‑based: uid % 8 determines one of eight databases.
For non‑uid queries (e.g., login_name), a mapping table or cache from login_name to uid is recommended, or a "gene" function that encodes login_name into the shard‑determining bits.
6. Sharding Middleware
sharding‑jdbc (Dangdang)
TSharding (Mogujie)
Atlas (Qihoo 360)
Cobar (Alibaba)
MyCAT (based on Cobar)
Oceanus (58.com)
Vitess (Google)
These tools provide transparent routing, SQL rewriting, and connection pooling to simplify implementation of the above strategies.
References: multiple technical blogs and articles on sharding principles, implementation steps, and ID generation systems such as Meituan's Leaf.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.