Databases 16 min read

Mastering Database Sharding: Core Concepts, Middleware, and Common Challenges

Sharding splits a single database into multiple servers to improve performance, covering vertical and horizontal partitioning, middleware options, and key challenges such as distributed transactions, cross‑node joins, ID generation, pagination, routing transparency, and choosing between frameworks or custom solutions.

Java Backend Technology

Aug 30, 2020

Mastering Database Sharding: Core Concepts, Middleware, and Common Challenges

Basic Idea of Sharding

Sharding’s core idea is to split a single database into multiple parts placed on different servers, alleviating performance bottlenecks of a monolithic database. For massive data, vertical partitioning separates tables by functional modules, while horizontal partitioning distributes rows of a single large table across servers based on a rule such as ID hashing. In practice, both strategies are often combined, and the choice depends on data volume, table count, and growth patterns.

Common Sharding Middleware

Simple and Easy‑to‑Use Components

sharding‑jdbc (by Dangdang)

TSharding (by Mogujie)

Heavy‑Weight Middleware

sharding‑proxy

TDDL Smart Client (Taobao)

Atlas (Qihoo 360)

Alibaba Cobar (B2B division)

MyCAT (based on Cobar)

Oceanus (58.com)

OneProxy (developed by Alipay chief architect)

Vitess (Google)

Problems to Solve in Sharding

1. Transaction Issues

Two main approaches exist: distributed transactions managed by the database (simple but high performance cost) and application‑controlled transactions that split a global transaction into per‑shard sub‑transactions (better performance but requires custom logic, especially when using Spring transaction management).

2. Cross‑Node Join Problems

Cross‑shard joins are inevitable; a common solution is a two‑step query: first retrieve related IDs from one shard, then query the other shards using those IDs and merge results in the application.

3. Cross‑Node Aggregations (count, order by, group by, etc.)

Aggregations that require the full data set must be computed on each shard and merged in the application. Parallel execution speeds up processing, but large result sets can strain application memory.

4. Data Migration, Capacity Planning, and Scaling

Some solutions (e.g., Taobao’s modulo‑2 strategy) minimize row‑level migration but still need table‑level moves and impose limits on shard count and table numbers, making scaling non‑trivial.

5. Distributed Transactions

Two‑phase commit ensures atomicity across shards but adds latency and becomes a scalability bottleneck as shard count grows. Many frameworks adopt Best‑Efforts 1PC or compensation transactions to trade strict consistency for higher throughput.

6. Global ID Generation

When databases are sharded, auto‑increment keys are no longer globally unique. Options include UUIDs (large storage and index overhead) or a dedicated SEQUENCE table:

CREATE TABLE `SEQUENCE` (
  `table_name` varchar(18) NOT NULL,
  `nextid` bigint(20) NOT NULL,
  PRIMARY KEY (`table_name`)
) ENGINE=InnoDB;

Applications fetch and increment the nextid per table, but the SEQUENCE table can become a bottleneck and a single point of failure. Alternatives such as Twitter’s Snowflake algorithm generate 64‑bit IDs using timestamp, datacenter ID, machine ID, and a per‑millisecond sequence.

7. Cross‑Shard Sorting and Pagination

When the sort key is not the sharding key, each shard must sort its subset, return results, and the application merges and re‑sorts them. Practical pagination strategies include limiting visible pages, increasing page size for batch jobs, or offloading aggregation to a big‑data platform.

8. Sharding Strategies

Two common methods: range‑based (e.g., user IDs 1‑9999 go to shard 1) and modulo‑based (user ID % n determines the shard). Modulo sharding is easier to scale; when expanding shards, only a fraction of data moves.

9. Determining Shard Count

Shard count depends on per‑shard record limits (e.g., MySQL ~50 million rows, Oracle ~100 million rows). More shards reduce per‑shard load but increase cross‑shard query complexity and hardware costs. Initial recommendations often suggest 4–8 shards.

10. Routing Transparency

Sharding changes the DB schema, so the data access layer (DAL) should handle routing to keep application code unchanged. Simple single‑shard queries are routed automatically; multi‑shard queries are aggregated by the DAL.

11. Framework vs. Custom Development

Numerous sharding solutions exist (MySQL Proxy, Amoeba, Hibernate Shards, sharding‑jdbc, TSharding, Cobar Client, etc.). Selection should consider maturity, community support, performance, and project‑specific constraints; a cautious approach is advised when adopting heavyweight frameworks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sharding Middleware distributed transactions ID Generation database partitioning

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.