Mastering Data Sharding: When, How, and What to Watch Out For
This comprehensive guide explains why and when to split relational databases, details vertical and horizontal sharding techniques, discusses associated challenges such as distributed transactions, cross‑node joins, pagination, global primary keys, and offers practical solutions and middleware options.
1. Why Data Sharding?
Relational databases become bottlenecks when a single table reaches tens of millions of rows or hundreds of gigabytes, even after adding read replicas or optimizing indexes. Sharding (data partitioning) reduces the load on each node, shortens query time, and improves overall performance.
2. Types of Sharding
2.1 Vertical (Logical) Sharding
Vertical sharding splits data by business domains. It includes:
Vertical database splitting: low‑coupling tables are placed in separate databases, similar to micro‑service governance.
Vertical table splitting: columns with low access frequency or large size are moved to an extension table, reducing row size, improving InnoDB page utilization, and increasing cache hit rates.
Advantages
Clear business boundaries and reduced coupling.
Facilitates micro‑service governance, independent monitoring and scaling.
Improves I/O, connection limits, and hardware utilization under high concurrency.
Disadvantages
Cross‑database joins must be handled via aggregation services, increasing development complexity.
Distributed transaction handling becomes more complex.
Large tables may still need horizontal sharding.
2.2 Horizontal (Physical) Sharding
When vertical splitting is insufficient, horizontal sharding distributes rows of a single logical table across multiple databases or tables.
Two main patterns:
Range‑based sharding : split by time or ID ranges (e.g., uid 0‑10M in db1, 10M‑20M in db2). Benefits include easy scaling by adding new nodes without data migration, but load may be uneven if newer data is hotter.
Modulo‑based sharding : split by hash of a key (e.g., uid % N). Provides uniform data distribution and balanced load, but adding nodes requires re‑hashing and data migration.
3. Problems Introduced by Sharding
3.1 Transaction Consistency
Cross‑shard updates require distributed transactions (XA, two‑phase commit), which increase latency and risk of deadlocks as node count grows.
For systems tolerant to eventual consistency, compensation mechanisms (reconciliation jobs, log‑based sync) can be used instead of strict ACID.
3.2 Cross‑Node Joins
Joins across shards are expensive. Solutions include:
Duplicating global reference tables in each shard.
Denormalizing frequently accessed fields (e.g., storing userName in the order table).
Two‑step data assembly: first fetch IDs, then query each shard for details and merge in application code.
ER‑based sharding: keep related tables in the same shard based on foreign‑key relationships.
3.3 Pagination, Sorting, and Aggregation
When sorting on non‑shard keys, each shard must sort locally, then a global merge is required, which is CPU‑ and memory‑intensive for deep pages. Aggregation functions (MAX, MIN, SUM, COUNT) must be computed per shard and then combined.
3.4 Global Primary Key Generation
Auto‑increment IDs are not globally unique across shards. Common strategies:
UUID – simple but large and index‑unfriendly.
Dedicated sequence table (MyISAM) with a unique stub column to generate IDs per business.
Multiple DB servers each with its own sequence, using different auto‑increment offsets and steps.
Batch‑fetching ID blocks to reduce DB load.
Snowflake algorithm (Twitter) – 64‑bit IDs composed of timestamp, datacenter/worker IDs, and a per‑millisecond counter; fast and collision‑free but depends on monotonic clocks.
Leaf (Meituan‑Dianping) combines database‑backed sequences with Snowflake‑style bits for high availability.
3.5 Data Migration & Capacity Planning
When sharding is introduced, historical data must be read, re‑sharded, and written to target shards. Capacity planning should aim for ≤10M rows per shard to keep backup, DDL, and lock times manageable.
3.6 When to Consider Sharding
Avoid premature sharding; first try hardware upgrades, read‑write splitting, and index tuning.
Trigger sharding when single‑table size or operational tasks (backup, DDL) cause unacceptable latency.
Vertical sharding is useful when certain columns are rarely accessed or very large.
Horizontal sharding is needed when row count or QPS exceeds single‑node limits.
4. Practical Case: User Center
Core table: User(uid, login_name, passwd, sex, age, nickname). High‑frequency queries use uid; occasional lookups use login_name. Strategies:
Horizontal sharding by uid range or modulo.
Maintain a login_name → uid mapping table (or cache) to resolve non‑uid lookups.
Use a “gene” function to embed a shard identifier into generated IDs, eliminating the need for a separate mapping table.
Separate front‑end (user‑facing) and back‑end (analytics) workloads: front‑end uses real‑time shards; back‑end can read from async replicas or a data warehouse (ES, Hive).
5. Sharding Middleware Options
Open‑source solutions that abstract routing, SQL rewriting, and connection pooling include:
sharding‑jdbc (Dangdang)
TSharding (Mogujie)
Atlas (360)
Cobar (Alibaba)
MyCAT (Cobar‑based)
Oceanus (58.com)
Vitess (Google)
6. References
Various Chinese technical articles and the Leaf ID generation project are cited for deeper reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
