Mastering Database Sharding: Solving Read Amplification and Partition Strategies
This article explains database sharding concepts, compares vertical and horizontal partitioning, details ID‑range and modulo sharding methods, discusses the read‑amplification issue caused by non‑shard queries, and presents solutions such as auxiliary index tables, Elasticsearch integration, and TiDB as alternatives.
Sharding Basics
When a single table grows beyond roughly 20 million rows, the B+ tree depth increases, leading to more disk I/O and slower queries. To handle large data, tables are split into multiple tables, either vertically (splitting columns) or horizontally (splitting rows).
Horizontal Partitioning (Sharding)
Horizontal sharding creates many small tables such as user_0, user_1 … user_N, each storing a subset of rows. Typical shard size is 500k–2M rows per table.
Sharding by ID Range
Assuming each shard holds 2M rows, IDs 1‑2M go to user_0, 2M+1‑4M to user_1, and so on. The routing logic calculates the shard by integer division of the ID.
Sharding by Modulo
Another common method uses ID % N to select a shard. For example, ID 31 with 5 shards maps to user_1 because 31%5=1. This approach is simple but makes scaling the number of shards difficult, requiring data migration.
Combining Range and Modulo
Range sharding provides stable shard allocation, while modulo distributes writes evenly. By applying modulo within a range—e.g., IDs 2M‑4M are further split into five sub‑shards—the write hotspot problem is reduced.
Read Amplification Problem
When queries use non‑shard columns (e.g., name), the system must query all shards concurrently, leading to many duplicate queries. As the number of shards grows, the read amplification cost increases.
select * from user where name = "小白";Solution: Auxiliary Index Table
Create a separate table that stores only the primary key and the indexed column (e.g., name). Queries first retrieve the IDs from this table, then fetch the rows from the original shards, reducing the number of shards accessed.
Alternative: Use Search Engine or Distributed DB
Integrating Elasticsearch allows real‑time queries with inverted indexes, while TiDB provides built‑in range sharding and secondary index support, eliminating many of the read‑amplification issues.
Summary
Large single tables degrade performance; horizontal sharding is needed.
Select a shard key (usually the primary key) and shard by ID range or modulo.
Non‑shard queries cause read amplification; auxiliary index tables or external search engines can mitigate it.
TiDB or Elasticsearch are viable alternatives for scalable querying.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
