Databases 28 min read

Mastering MySQL Sharding: Strategies, Pitfalls, and Best Practices

This comprehensive guide explains why MySQL sharding (分库分表) is essential for handling massive data growth, outlines various horizontal and vertical partitioning strategies, discusses common challenges such as hot‑spot, cross‑database queries, pagination, and transaction handling, and compares popular middleware solutions.

Su San Talks Tech

Feb 21, 2024

Mastering MySQL Sharding: Strategies, Pitfalls, and Best Practices

Keywords: sharding, high performance, MySQL database

Article Overview

Background Introduction

With the rapid development of Internet technology, data volume grows explosively. In high‑traffic scenarios, the database often becomes the performance bottleneck, causing slow queries and long response times that degrade user experience. To address this, sharding (分库分表) technology was created, distributing data across multiple databases or tables to improve processing capacity and stability.

Scenario Analysis

For example, a transaction system core database may include the following logical databases:

Product Database : Stores detailed information of tradable products or assets (e.g., stock code, name, current price).

Order Database : Stores user‑submitted orders, including order ID, status, creation time, etc.

User Database : Stores basic user information, encrypted passwords, contact details, roles, and permissions.

Transaction Database : Records all transaction details such as transaction ID, type, amount, time, and counterpart parties.

Configuration Database : Stores system configuration like trading rules, fee settings, and parameters.

Historical Data Database : Keeps historical records of trades, orders, and prices for analytics and reporting.

Account Database : Stores user account balances, types, and status.

Security and Audit Database : Logs security‑related events and audit records to ensure traceability.

From the analysis above, the corresponding tables can be roughly categorized as:

Configuration tables: product specifications, data dictionaries, system parameters, fee items, etc.

Transaction tables: order data, transaction logs, etc.

Log tables: application logs, user operation logs, exception logs, access logs, etc.

User tables: user registration, login, etc.

Thought

Typically, log tables, transaction tables, and user tables may experience data surges and performance issues, while configuration tables usually remain small.

Sharding and Partitioning

What is sharding?

Sharding is a database architecture optimization technique that applies the divide‑and‑conquer principle. By splitting data into multiple databases or tables, system performance and stability are improved. Common sharding strategies include horizontal sharding, vertical sharding, horizontal partitioning, and vertical partitioning.

Using an order database db_order and order table tb_order as examples:

Horizontal Database Sharding : Based on a rule (e.g., order‑ID range), db_order is split into db_order_1, db_order_2, db_order_3, etc. Each shard has the same schema but stores different rows.

Horizontal Table Sharding : Based on order creation time, tb_order is split into tb_order_2022, tb_order_2023, tb_order_2024, etc., each storing a time‑range of orders.

Vertical Database Sharding : Data is divided by business domain. Order‑related tables stay in db_order, user‑related tables move to db_user, product‑related tables to db_product, etc.

Vertical Table Sharding : If tb_order has many columns, it can be split into tb_order_base (basic info), tb_order_attrs (attribute info), tb_order_charges (fee info), etc.

Summary

In short, sharding follows the business domain principle for database division, while this article focuses on table‑level partitioning.

Common Sharding Issues

When should a table be split?

Reference Rules

According to the Alibaba Java Development Manual, the following guidelines are recommended:

Engineering Experience

In practice, when a table reaches the ten‑million‑row scale, sharding is usually required. This aligns with InnoDB's page size of 16 KB.

mysql> show global status like '%Innodb_page_size%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_page_size| 16384 |
+------------------+-------+
1 row in set (0.00 sec)

Index optimization is the first step for performance tuning. The underlying B+‑tree structure is illustrated below:

Image source: internet, rights reserved.

Assuming a bigint primary key (8 bytes) plus a 6‑byte pointer, a root node can hold about 1 170 entries (16 KB / (8+6)). The second level can hold roughly 1 170 × 1 170 ≈ 1.37 million entries, and the third level (leaf) can store about 16 rows per page, resulting in a total capacity of about 21.9 million rows—consistent with the ten‑million‑row rule of thumb.

Therefore, when a MySQL table exceeds ten million rows, splitting is advisable.

Choosing a Sharding Key

Typical candidates for an order table include:

Order ID : Unique, ensures even data distribution, avoids hotspots.

User ID : Unique; works well if each user's order volume is roughly equal, but can cause imbalance if some users generate many orders.

Order Creation Time : Suitable for time‑range queries, but may create hotspot spikes during promotional periods.

Primary Key Strategies

When sharding, auto‑increment keys need special handling to avoid collisions across shards. Common solutions:

UUID : Globally unique 128‑bit identifier; however, large size and lack of order can affect performance.

Snowflake Algorithm : 64‑bit distributed ID generated from timestamp, datacenter ID, worker ID, and sequence number.

Distributed Auto‑Increment Generators : Use Snowflake, Alibaba Druid, etc., to generate globally unique IDs.

Auto‑Increment with Shard‑Specific Offsets : MySQL 8.0 variables AUTO_INCREMENT_INCREMENT and AUTO_INCREMENT_OFFSET adjust step and start values per shard.

Choosing a Table Partitioning Strategy

For an order table, common strategies are:

Range Partitioning : Split by order ID ranges (e.g., 1‑10 M → tb_order_01, 10‑20 M → tb_order_02, …). Good for sequential IDs and range queries.

Hash Partitioning : Apply a hash function to the order ID and store the result modulo the number of tables. Provides uniform load distribution but complicates expansion.

Mapping Table : Maintain a lookup table that maps each order ID to a physical table, offering flexible data placement at the cost of an extra lookup.

Consistent Hashing : Map IDs onto a hash ring; adding or removing nodes only moves a small portion of data. Suitable for high‑availability scenarios.

Non‑Shard‑Key Queries

When the query field is not the shard key, common solutions include:

Global Index : A cross‑shard index that stores the non‑shard field together with the shard key.

Data Redundancy : Duplicate the non‑shard field in each shard for fast lookup.

Application‑Level Aggregation : Query each shard separately and merge results in the application.

Elasticsearch : Sync non‑shard fields to ES for full‑text and complex queries.

Database Middleware : Use middleware (e.g., ShardingSphere, MyCAT) that can route and merge results automatically.

Cross‑Database Join Issues

Strategies to handle joins across shards:

Field redundancy (broadcast fields).

Data synchronization (ETL to local tables).

Global (broadcast) tables.

Binding tables (store related tables together).

Distributed middleware that rewrites and merges join results.

Application‑level data aggregation.

Sorting and Pagination After Sharding

Sorting can be addressed by:

Introducing a global sort field.

Fetching data from each shard and merging in the application.

Using middleware that supports distributed sorting.

Pre‑computing and caching sorted results.

Pagination can be handled by:

Replacing LIMIT OFFSET with cursor‑based or range‑based pagination.

Aggregating data from shards in the application.

Leveraging middleware pagination support.

Limiting deep pagination depth.

Pre‑loading and caching frequently accessed pages.

Sharding Expansion

When data grows, expansion involves:

Assessing data growth trends.

Choosing an appropriate shard key that distributes data evenly.

Planning the expansion timeline and target scale.

Migrating data to new shards using tools or scripts while ensuring consistency.

Balancing load across old and new shards, possibly with load balancers or middleware.

Continuous monitoring and tuning (indexes, query optimization).

Sharding Middleware Comparison

Two widely used open‑source solutions are ShardingSphere and MyCAT.

ShardingSphere : Provides data sharding, read/write splitting, and multi‑datasource integration via Sharding‑JDBC, Sharding‑Proxy, and upcoming Sharding‑Sidecar. Advantages: rich features, active community, multi‑DB support. Drawbacks: extra adaptation for non‑Java apps, potential performance overhead for complex SQL.

MyCAT : MySQL‑protocol‑compatible middleware offering SQL parsing, routing, rewriting, and result merging. Supports read/write separation, auto‑failover, load balancing, multi‑tenant mode, and global sequences. Advantages: seamless MySQL integration, flexible routing, good monitoring tools. Drawbacks: limited to MySQL, some constraints on complex SQL.

Distributed Transaction Challenges

After sharding, ensuring transaction consistency across multiple databases requires distributed transaction solutions. Refer to related articles on distributed transaction theory, selection of solutions, and Seata as an all‑in‑one approach.

Conclusion

Sharding Strategies

Horizontal sharding by key.

Vertical sharding by column.

Read/write separation.

Common Problems

Data consistency across shards.

Cross‑shard queries.

Distributed transaction handling.

Choosing the right middleware.

For scenarios where sharding is not suitable or to simplify distributed database management, TiDB can be considered as an alternative.

Reference link: https://mp.weixin.qq.com/s/SuJ-XCaVegVunOIf69-9AQ

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance mysql database scaling partitioning

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.