Databases 25 min read

Mastering Database Sharding: When, Why, and How to Partition Your Data

This comprehensive guide explains the motivations behind database sharding, compares vertical and horizontal partitioning methods, discusses the challenges of distributed transactions, global primary keys, cross‑node queries, pagination, and data migration, and provides practical solutions and middleware options for implementing scalable, high‑availability database architectures.

Senior Brother's Insights

Nov 29, 2019

Mastering Database Sharding: When, Why, and How to Partition Your Data

Data Partitioning

When a single relational table grows to tens of millions of rows or hundreds of gigabytes, query latency and resource contention increase dramatically. Sharding splits data across multiple databases or tables, reducing the load on each node.

Vertical (Logical) Partitioning

Vertical Database Splitting stores loosely coupled tables in separate databases, similar to micro‑service governance.

Vertical Table Splitting moves rarely used or large columns to an extension table, shrinking row size, improving cache hit rates, and avoiding page splits.

Reduces business coupling and clarifies domain boundaries.

Allows independent scaling, monitoring, and resource allocation per domain.

Mitigates I/O and connection bottlenecks in high‑concurrency scenarios.

Drawbacks:

Joins across databases must be performed in application logic.

Distributed transactions become more complex.

Very large tables may still require horizontal splitting.

Horizontal (Physical) Partitioning

When vertical splitting is insufficient, rows are distributed across multiple databases or tables based on a sharding key.

Range‑Based Sharding

Rows are assigned to shards by time intervals or ID ranges (e.g., uid 1‑9,999,999 → DB1, uid 10,000,000‑19,999,999 → DB2). This keeps each shard size manageable and enables straightforward scaling.

Shard size stays controllable.

Adding a new node does not require moving existing data.

Range queries on the sharding key are fast because the target shard is known.

Modulo‑Based Sharding

Rows are assigned using a hash modulo, e.g., uid % 4 determines the target database. This yields a more uniform data distribution.

Even load distribution reduces hotspot risk.

Drawbacks:

Adding shards requires re‑hashing and data migration.

Queries that do not contain the sharding key must scan all shards, increasing latency.

Issues Caused by Sharding

Transaction Consistency

Updates that span multiple shards need distributed transaction protocols such as XA or two‑phase commit. These guarantee atomicity but add latency, increase lock contention, and become a scaling bottleneck as the number of nodes grows.

For systems tolerant of eventual consistency, compensation transactions or periodic reconciliation can replace strict ACID guarantees.

Cross‑Node Join Problems

After sharding, related rows may reside on different nodes, making traditional SQL joins expensive or impossible. Common mitigation strategies include:

Global Tables : Duplicate small, rarely changing reference tables in every shard.

Field Redundancy : Denormalize frequently accessed fields (e.g., store userName in the order table) to avoid joins.

Data Assembly : Perform a two‑step query—first fetch IDs, then retrieve related rows from the appropriate shards and merge in the application layer.

ER‑Based Sharding : Keep tables that share a primary‑key relationship (e.g., order and order_detail) in the same shard so joins stay local.

Cross‑Node Pagination, Sorting, and Aggregation

When the sort key is not the sharding key, each shard must sort locally, return a subset, and the coordinator merges and re‑sorts the results. Large page numbers can consume significant CPU and memory. Aggregate functions (MAX, MIN, SUM, COUNT) must be executed on every shard and then combined.

Global Primary‑Key Collision

Auto‑increment IDs are not unique across shards. Common solutions:

UUID 550e8400-e29b-41d4-a716-446655440000 Simple, locally generated, but large (36 bytes) and index‑inefficient.

Database‑Backed Sequence Table

CREATE TABLE `sequence` (
  `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `stub` CHAR(1) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;

Obtain a global ID with:

REPLACE INTO sequence (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

Requires a single point of coordination and uses MyISAM table‑level locking.

Snowflake‑Style Distributed ID 64‑bit IDs composed of timestamp, datacenter ID, worker ID, and a per‑node sequence. Provides time‑ordered, collision‑free IDs without a central coordinator, but relies on synchronized clocks.

Production‑grade implementations such as Meituan‑Dianping’s Leaf combine a database‑backed sequence with Snowflake bits for high‑throughput, fault‑tolerant ID generation.

When to Consider Sharding

Avoid sharding unless a single node cannot handle the data volume or query load.

Tables exceeding ~10 million rows or 100 GB typically suffer from backup, DDL, and lock contention.

Business‑driven vertical splits (e.g., separating hot columns like last_login_time from static user data).

Need for higher availability and risk isolation across multiple machines.

Case Study: User Center

The core User table contains uid, login_name, passwd, sex, age, nickname. As the user base grows, last_login_time becomes a hot column. Vertical splitting creates:

user_time(uid, last_login_time)
user_ext(uid, personal_info, ...)

Horizontal Sharding Strategies

Range Sharding : Distribute users by UID range (e.g., 0‑10 M → DB1, 10‑20 M → DB2). Simple to scale but may create load imbalance if newer users are more active.

Modulo Sharding : Use uid % N to assign shards, achieving uniform distribution at the cost of more complex re‑sharding.

Non‑UID Queries

For lookups by login_name, maintain a mapping table (login_name → uid) or a cache. The application first resolves the UID, then routes the request to the appropriate shard.

Front‑End vs. Back‑End Separation

User‑facing services require low‑latency single‑row queries and benefit from the UID‑based routing. Analytical workloads (large scans, multi‑condition filters) should run on a separate backend service/database or on a search engine (e.g., Elasticsearch) to avoid impacting the front‑end.

Sharding Middleware Support

sharding-jdbc (Dangdang)

TSharding (Mogujie)

Atlas (Qihoo 360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Vitess (Google)

References

Database Distributed Architecture Overview – Sharding and Core Banking Applicability

Sharding Concepts

Key Steps and Pitfalls of Horizontal Sharding

Principles, Strategies, and Challenges of Sharding

Leaf – Meituan‑Dianping Distributed ID Generation System

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

horizontal scaling

Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.