Databases 22 min read

Mastering Database Sharding: When and How to Split Your Data

This article explains why and how relational databases should be sharded, covering vertical and horizontal partitioning techniques, common pitfalls such as distributed transactions and join queries, and practical guidelines for deciding when to split data and which tools to use.

ITPUB

May 28, 2018

Mastering Database Sharding: When and How to Split Your Data

1. Data Sharding Overview

Relational databases often become performance bottlenecks when a single table reaches tens of millions of rows or hundreds of gigabytes. Adding read replicas or optimizing indexes may not suffice, so sharding—splitting data across multiple databases—is used to reduce load and improve query speed.

2. Vertical (Logical) Sharding

Vertical sharding separates data by business domains. It includes:

Vertical database splitting : Low‑coupling tables are placed in separate databases, similar to micro‑service architecture where each service owns its own database.

Vertical table splitting : Frequently accessed or large columns are moved to an extension table, reducing row size, improving memory cache hit rate, and avoiding page‑splits in MySQL.

Advantages: clearer business boundaries, better IO and connection distribution, and improved concurrency. Drawbacks: some tables cannot be joined directly, increasing development complexity; distributed transaction handling becomes harder; large tables may still need horizontal sharding.

3. Horizontal (Physical) Sharding

When vertical splitting is insufficient, horizontal sharding distributes rows of a single logical table across multiple databases or tables.

Typical rules:

Range‑based sharding : Split by date or ID ranges (e.g., userId 1‑9999 → DB1, 10000‑19999 → DB2). Benefits: controllable table size, easy scaling, and fast range queries. Drawbacks: hot‑spot data can overload recent partitions.

Modulo‑based sharding : Apply a hash/mod operation on a column (e.g., customerId % 4) to assign rows to databases. Benefits: even data distribution, reduced hot‑spots. Drawbacks: re‑hashing required for scaling and cross‑shard queries become complex.

4. Challenges Introduced by Sharding

Distributed transaction consistency : Updating data across shards requires XA or two‑phase commit, increasing latency and risk of deadlocks. Some systems relax strict consistency in favor of eventual consistency using compensation mechanisms.

Cross‑shard joins : Joins become expensive or impossible; solutions include global tables, redundant fields, data assembly in application code, or placing related tables on the same shard (ER‑sharding).

Pagination, sorting, and aggregation : Queries must be executed on each shard, results merged and re‑sorted, which can consume significant CPU and memory, especially for deep pages.

Global primary‑key generation : Auto‑increment IDs are unsuitable. Common strategies:

UUID – simple but large and index‑unfriendly.

Sequence table – a dedicated table that generates unique IDs; requires careful locking.

Dual‑server sequence with offset and step – distributes ID generation load.

Snowflake algorithm – 64‑bit IDs composed of timestamp, datacenter, worker, and sequence bits; high throughput but clock‑drift sensitive.

Example SQL for sequence table:

REPLACE INTO sequence (stub) VALUES ('a');

SELECT LAST_INSERT_ID();

5. When to Consider Sharding

Do not shard preemptively. Evaluate based on:

Data volume approaching table limits (e.g., >10 million rows or >100 GB).

Operational impact: backup time, long DDL locks, lock contention.

Rapid growth of specific columns that cause frequent updates.

Business‑driven needs for vertical separation or higher availability.

6. Case Study: User Center

Core table: User(uid, login_name, passwd, sex, age, nickname). Sharding strategies:

Range sharding on uid (e.g., 0‑10M → DB1, 10M‑20M → DB2).

Modulo sharding on uid for even distribution.

Non‑uid queries (e.g., login_name) require a mapping table or a hash‑based “gene” function to locate the correct shard.

7. Open‑Source Sharding Middleware

Several mature solutions exist:

sharding‑jdbc

TSharding

Atlas

Cobar

MyCAT

Oceanus

Vitess

These tools provide sharding rules, routing, and transaction support, reducing the need to build custom infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sharding Middleware distributed transactions Horizontal Partitioning Vertical Partitioning ID Generation

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.