Databases 23 min read

Data Sharding, Partitioning, and Distributed ID Strategies for Relational Databases

This article explains why relational databases become bottlenecks at large scale, introduces vertical and horizontal sharding techniques, discusses the challenges of distributed transactions, cross‑shard queries, pagination, and global primary‑key generation, and provides practical guidelines and middleware options for implementing sharding in production systems.

Architecture Digest

Jan 5, 2020

Data Sharding, Partitioning, and Distributed ID Strategies for Relational Databases

1. Data Partitioning

Relational databases become bottlenecks when a single node reaches limits of storage, connections, and processing power. When a table exceeds ~10 million rows or 100 GB, performance degrades even with read replicas or index tuning.

Data sharding (splitting) reduces load by distributing data across multiple databases, shortening query time.

Core of distributed databases is data sharding and subsequent data location and aggregation.

Sharding splits data into smaller pieces stored on different DB instances, alleviating single‑node performance constraints.

Two main sharding types: vertical (schema) and horizontal (row) sharding.

1.1 Vertical Sharding

Includes vertical database splitting and vertical table splitting.

Vertical database splitting groups low‑coupling tables into separate databases, similar to micro‑service governance.

Vertical table splitting moves rarely used or large columns to an extension table, reducing row size and improving I/O.

Decouples business modules, improves IO and connection limits.

Facilitates micro‑service style management.

Helps high‑concurrency scenarios.

Drawbacks

Cross‑database joins require aggregation at the application layer.

Distributed transaction handling becomes complex.

Large tables may still need horizontal sharding.

1.2 Horizontal Sharding

Used when vertical splitting is insufficient or data volume is huge. Two patterns: table‑level sharding within a database and database‑level sharding.

Advantages

Avoids single‑node bottlenecks, improves stability and load capacity.

Minimal changes to application code.

Drawbacks

Cross‑shard transaction consistency is hard.

Cross‑database joins perform poorly.

Data expansion and maintenance become costly.

2. Issues Introduced by Sharding

2.1 Transaction Consistency

Updates spanning multiple databases require distributed transactions (XA, two‑phase commit), which increase latency and risk of conflicts.

Final consistency can be achieved with compensation logic instead of immediate rollback.

2.2 Cross‑Shard Joins

Solutions include global tables, field redundancy, two‑step data assembly, and ER‑based sharding that keeps related tables on the same shard.

2.3 Cross‑Shard Pagination, Sorting, Aggregation

Requires sorting each shard then merging results; large page numbers consume significant CPU and memory.

Aggregations (MAX, SUM, COUNT) must be performed per shard then combined.

2.4 Global Primary Key Generation

Strategies: UUID, a dedicated sequence table, multi‑node auto‑increment, and Snowflake algorithm.

CREATE TABLE `sequence` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM;

Snowflake produces 64‑bit IDs with timestamp, datacenter, worker, and sequence bits, offering high throughput but depends on clock stability.

3. When to Consider Sharding

Avoid sharding until data volume or performance truly requires it.

When single‑table size hampers backup, DDL, or lock contention.

When specific columns grow rapidly and benefit from vertical split.

When rapid data growth threatens system stability.

For availability, isolate unrelated business data across shards.

4. Case Study: User Center

Core table: User(uid, login_name, passwd, sex, age, nickname).

Front‑end queries are mostly by uid; back‑office queries involve age, gender, login time, requiring different access patterns.

Horizontal sharding by uid range or modulo distributes load; non‑uid queries need mapping tables or gene‑based routing.

4.1 Mapping Relationship

Store login_name → uid in a small index table or cache to locate the correct shard.

4.2 Gene‑Based Routing

Derive a shard identifier from a hash of the non‑uid attribute (e.g., login_name) and route directly to the appropriate database.

5. Sharding Middleware

sharding-jdbc (Dangdang)

TSharding (Mogujie)

Atlas (Qihoo 360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Vitess (Google)

— End of summary.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MySQL horizontal scaling Vertical Partitioning Distributed ID database partitioning

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.