Databases 17 min read

When and How to Shard Databases: A Complete Guide to Partitioning, Algorithms, and Tools

This article explains why and when to apply database sharding, distinguishes between sharding databases, tables, or both, describes horizontal and vertical splitting, guides key selection and sharding algorithms, covers global ID generation, lists open‑source tools, and discusses the operational challenges introduced by sharding.

dbaplus Community

Nov 18, 2022

When and How to Shard Databases: A Complete Guide to Partitioning, Algorithms, and Tools

1. What is sharding

Sharding (database partitioning) comprises three related techniques: sharding databases only, sharding tables only, and sharding both databases and tables. They address high concurrency, large data volume, or a combination of both.

2. When to shard databases

Database sharding is appropriate when read/write QPS exceeds the capacity of a single instance, causing connection‑pool exhaustion. Adding more database instances increases the number of available connections and improves throughput. Typical scenarios include micro‑service decomposition (each service gets its own database) and moving historical orders to an archive database.

3. When to shard tables

Table sharding is used when a single table grows beyond roughly 5 million rows or 2 GB, at which point query performance degrades even with modest concurrency. Splitting the table reduces the row count per physical table and speeds up queries.

4. When to shard both databases and tables

If a system suffers simultaneously from high concurrency and massive data volume, both database and table sharding are required. This situation is common in large‑scale e‑commerce platforms.

5. Horizontal vs. vertical splitting

Horizontal (row) splitting distributes rows of a logical table across many physical tables, reducing the number of rows per table. Vertical (column) splitting moves groups of columns into separate tables, reducing the column count per table. Vertical splitting also includes separating business domains into distinct databases.

6. Choosing sharding keys

The sharding key determines how data is routed. Common keys include buyer ID, seller ID, timestamp and region. Buyer ID usually avoids hotspot tables because a single buyer generates limited traffic, whereas a large seller can create a hotspot.

Buyer ID – distributes orders evenly, avoids hot tables.

Seller ID – can cause data skew if a seller has many orders.

Routing can be implemented by taking the key modulo a fixed number (e.g., 1024) or by hashing the key and then taking the modulo. The result maps to a specific physical table such as order_0002.

7. Sharding algorithms

Common algorithms include:

Direct modulo – integer % N.

Hash modulo – hash(key) % N, useful for string keys.

Consistent hashing – maps keys onto a 2^32‑node ring; adding or removing nodes moves only a small subset of keys.

8. Global ID generation

Sharding breaks the uniqueness of auto‑increment IDs, so a globally unique identifier is required. Options:

UUID – globally unique but long, string‑based and slower to query.

Centralized auto‑increment table – guarantees uniqueness but creates a single‑point bottleneck.

Multiple tables with step ranges – each instance gets a distinct numeric range (e.g., 1000‑1999, 2000‑2999) to avoid collisions.

Snowflake algorithm – 64‑bit ID composed of 1‑bit sign, 41‑bit timestamp (ms), 10‑bit worker ID (5‑bit data‑center + 5‑bit node) and 12‑bit sequence. Generates up to 4 194 304 IDs per millisecond per node and supports 1024 nodes, providing ordered, high‑throughput IDs.

9. Sharding tools (Java)

Popular open‑source frameworks for implementing sharding, read/write splitting and dynamic data sources:

ShardingSphere (formerly Sharding‑JDBC) – lightweight JDBC‑level sharding library. Repository: https://shardingsphere.apache.org TDDL – Alibaba middleware offering sharding, read/write splitting and dynamic data sources. Repository: https://github.com/alibaba/tb_tddl Mycat – distributed relational middleware supporting the MySQL protocol and multiple back‑ends. Repository:

https://github.com/MyCATApache/Mycat2

10. Problems introduced by sharding

After sharding, every query must include the sharding key; otherwise a full‑table scan across all physical tables is required. Cross‑database transactions are not supported, leading to consistency challenges. Pagination, sorting and other operations that rely on a single ordered result set become complex or impossible without additional logic. These operational costs should be weighed against the performance gains.

11. Summary

Sharding solves high‑concurrency and large‑volume problems by partitioning databases and/or tables. Selecting the appropriate split type (horizontal vs. vertical), sharding key, and algorithm is critical to avoid data skew. Global unique ID strategies such as Snowflake are required to maintain uniqueness across shards. Open‑source frameworks (ShardingSphere, TDDL, Mycat) provide ready‑made implementations, but sharding introduces operational overhead—mandatory key inclusion, loss of cross‑database transactions, and complications for pagination and sorting—that should be considered after other optimization techniques have been exhausted.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

open-source partitioning ID Generation horizontal-splitting vertical splitting

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.