Databases 24 min read

Mastering Data Sharding: When, How, and What to Watch Out For

This comprehensive guide explains why and when to split relational databases, details vertical and horizontal sharding techniques, discusses associated challenges such as distributed transactions, cross‑node joins, pagination, global primary keys, and offers practical solutions and middleware options.

Senior Brother's Insights

Dec 2, 2020

Mastering Data Sharding: When, How, and What to Watch Out For

1. Why Data Sharding?

Relational databases become bottlenecks when a single table reaches tens of millions of rows or hundreds of gigabytes, even after adding read replicas or optimizing indexes. Sharding (data partitioning) reduces the load on each node, shortens query time, and improves overall performance.

2. Types of Sharding

2.1 Vertical (Logical) Sharding

Vertical sharding splits data by business domains. It includes:

Vertical database splitting: low‑coupling tables are placed in separate databases, similar to micro‑service governance.

Vertical table splitting: columns with low access frequency or large size are moved to an extension table, reducing row size, improving InnoDB page utilization, and increasing cache hit rates.

Advantages

Clear business boundaries and reduced coupling.

Facilitates micro‑service governance, independent monitoring and scaling.

Improves I/O, connection limits, and hardware utilization under high concurrency.

Disadvantages

Cross‑database joins must be handled via aggregation services, increasing development complexity.

Distributed transaction handling becomes more complex.

Large tables may still need horizontal sharding.

2.2 Horizontal (Physical) Sharding

When vertical splitting is insufficient, horizontal sharding distributes rows of a single logical table across multiple databases or tables.

Two main patterns:

Range‑based sharding : split by time or ID ranges (e.g., uid 0‑10M in db1, 10M‑20M in db2). Benefits include easy scaling by adding new nodes without data migration, but load may be uneven if newer data is hotter.

Modulo‑based sharding : split by hash of a key (e.g., uid % N). Provides uniform data distribution and balanced load, but adding nodes requires re‑hashing and data migration.

3. Problems Introduced by Sharding

3.1 Transaction Consistency

Cross‑shard updates require distributed transactions (XA, two‑phase commit), which increase latency and risk of deadlocks as node count grows.

For systems tolerant to eventual consistency, compensation mechanisms (reconciliation jobs, log‑based sync) can be used instead of strict ACID.

3.2 Cross‑Node Joins

Joins across shards are expensive. Solutions include:

Duplicating global reference tables in each shard.

Denormalizing frequently accessed fields (e.g., storing userName in the order table).

Two‑step data assembly: first fetch IDs, then query each shard for details and merge in application code.

ER‑based sharding: keep related tables in the same shard based on foreign‑key relationships.

3.3 Pagination, Sorting, and Aggregation

When sorting on non‑shard keys, each shard must sort locally, then a global merge is required, which is CPU‑ and memory‑intensive for deep pages. Aggregation functions (MAX, MIN, SUM, COUNT) must be computed per shard and then combined.

3.4 Global Primary Key Generation

Auto‑increment IDs are not globally unique across shards. Common strategies:

UUID – simple but large and index‑unfriendly.

Dedicated sequence table (MyISAM) with a unique stub column to generate IDs per business.

Multiple DB servers each with its own sequence, using different auto‑increment offsets and steps.

Batch‑fetching ID blocks to reduce DB load.

Snowflake algorithm (Twitter) – 64‑bit IDs composed of timestamp, datacenter/worker IDs, and a per‑millisecond counter; fast and collision‑free but depends on monotonic clocks.

Leaf (Meituan‑Dianping) combines database‑backed sequences with Snowflake‑style bits for high availability.

3.5 Data Migration & Capacity Planning

When sharding is introduced, historical data must be read, re‑sharded, and written to target shards. Capacity planning should aim for ≤10M rows per shard to keep backup, DDL, and lock times manageable.

3.6 When to Consider Sharding

Avoid premature sharding; first try hardware upgrades, read‑write splitting, and index tuning.

Trigger sharding when single‑table size or operational tasks (backup, DDL) cause unacceptable latency.

Vertical sharding is useful when certain columns are rarely accessed or very large.

Horizontal sharding is needed when row count or QPS exceeds single‑node limits.

4. Practical Case: User Center

Core table: User(uid, login_name, passwd, sex, age, nickname). High‑frequency queries use uid; occasional lookups use login_name. Strategies:

Horizontal sharding by uid range or modulo.

Maintain a login_name → uid mapping table (or cache) to resolve non‑uid lookups.

Use a “gene” function to embed a shard identifier into generated IDs, eliminating the need for a separate mapping table.

Separate front‑end (user‑facing) and back‑end (analytics) workloads: front‑end uses real‑time shards; back‑end can read from async replicas or a data warehouse (ES, Hive).

5. Sharding Middleware Options

Open‑source solutions that abstract routing, SQL rewriting, and connection pooling include:

sharding‑jdbc (Dangdang)

TSharding (Mogujie)

Atlas (360)

Cobar (Alibaba)

MyCAT (Cobar‑based)

Oceanus (58.com)

Vitess (Google)

6. References

Various Chinese technical articles and the Leaf ID generation project are cited for deeper reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sharding distributed transactions Global ID Generation Vertical Sharding Horizontal Sharding database partitioning

Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.