Mastering Database Sharding: When and How to Split Databases and Tables
Learn why large-scale internet applications need database sharding, understand the evolution from single databases to split databases and tables, and follow practical steps for assessing, planning, and implementing horizontal and vertical sharding, including sizing calculations, key selection, and query challenges.
Background
Database sharding (分库分表) is used to scale large‑scale internet applications. When a single instance cannot keep up with growing data volume and QPS, vertical scaling becomes expensive and limited, so horizontal scaling—splitting data across multiple databases—is adopted.
What Is Sharding?
Sharding splits a large logical database into several smaller databases (分库) and optionally splits each database into multiple tables (分表). The shards may reside on different physical servers or on separate instances of the same server.
Evolution of Sharding
Initially a single database handles all traffic. As QPS rises, read replicas are added to offload reads. When write pressure on the master grows, two bottlenecks appear:
Oversized tables that degrade query performance.
Overall IOPS saturation that cannot be relieved by read replicas.
Table‑level partitioning (by time, hot/cold data) can reduce index depth, but when IOPS dominates the only viable solution is to shard the database.
Table partitioning divides a large table into manageable partitions; a partitioned table treats each partition as an independent table. Partitioning improves access efficiency but can cause full‑table locks if queries do not include the partition key.
Preparation Phase
Assessing the Need for Sharding
Choose between sharding only tables, only databases, or both based on current characteristics:
Large single‑table size with modest QPS → split tables.
High IOPS, high QPS, insufficient connections → split databases.
Very large overall data volume and connection limits → split both.
Determining the Sharding Plan
After the sharding type is fixed, decide the number of tables and databases required for the projected data growth and peak concurrency.
Number of Tables – Example Calculation
Assume an order system expects 100 000 orders per day and will run for 5 years.
Estimate total rows: 100000 * 365 * 5 = 182,500,000 rows Assume a single table can hold 5,000,000 rows. Required tables = 182,500,000 / 5,000,000 ≈ 36.5 → round up to 64 tables to leave headroom.
Number of Databases – Concurrency Estimate
Peak QPS during holidays may reach 6 000. If the average query latency is 0.2 s, each connection can handle 1 / 0.2 = 5 concurrent queries. Required simultaneous connections = 6 000 / 5 = 1 200. If a single MySQL instance caps at 500 connections, the number of database instances needed is 1 200 / 500 = 2.4 → round up to 4 databases.
Sharding Dimensions
Horizontal Sharding
Horizontal sharding distributes rows of a large table across multiple nodes based on a rule such as range, hash, or round‑robin. This reduces the load on any single node and eliminates a single point of failure.
Vertical Sharding
Vertical sharding separates a logical table into multiple tables according to business domains (e.g., user info, order info, product info). Columns that are frequently updated can be moved to a dedicated table to avoid locking the rest of the data.
Sharding Scheme
Choosing the Sharding Key
The sharding key determines how data is distributed. For consumer‑facing (C‑end) services, an auto‑incremented user_id often yields a uniform distribution. For B‑end systems such as ERP, recent data is more important, so creation_time can serve as the sharding key.
Querying Across Dimensions
Queries that do not contain the sharding key require scanning multiple shards. A common mitigation is to embed the sharding key into the primary identifier. Example order‑id format:
YYMMDD (6‑digit date) + VV (2‑digit version) + UUUU (last 4 digits of user_id) + SSSSSSSS (8‑digit sequence)By extracting the user‑id suffix, the target shard can be derived without extra lookups.
When additional query dimensions are needed (e.g., merchant_id, year, courier number), typical strategies include:
Synchronizing required data to a read‑only replica sharded by the alternate key.
Feeding data into a data warehouse for analytical queries.
Maintaining a mapping table (e.g., courier_number ↔ order_id) to resolve the shard before the main query.
These techniques preserve query performance while keeping the sharding layout simple.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
