Common Practices for Database Sharding and Pagination Optimization
This article explains typical sharding strategies—hash, range, and middle‑table—describes SDK, Proxy, and Sidecar middleware forms, and presents several pagination optimization techniques such as global query rewriting, average pagination, disabling cross‑page queries, secondary queries, and using intermediate tables to improve performance in distributed databases.
The article introduces common sharding methods used in distributed databases: hash sharding (e.g., using a numeric key and modulo), range sharding (splitting by intervals such as [0,1000) and [1000,2000) ), and the middle‑table approach that records the mapping between primary keys and target tables.
It then outlines three middleware deployment models: SDK form (e.g., adding ShardingSphere as a Java dependency), Proxy form (an independent service that receives all queries and forwards them to the appropriate shards), and Sidecar form (a theoretical companion service that runs alongside the application).
For interview preparation, the article suggests focusing on how a company solves pagination in sharding, potential performance problems caused by sorting or pagination, and monitoring metrics such as response time, memory, and CPU usage.
In the global‑query approach, a paginated SQL like SELECT * FROM order_tab ORDER BY id LIMIT 4 OFFSET 2 is rewritten to LIMIT 4+2 OFFSET 0 and sent to every shard; the results are then merged and globally sorted in memory to obtain the final page.
The performance drawbacks of this method include increased network traffic, memory consumption for sorting, and CPU overhead. The article recommends using merge sort to mitigate these issues.
Optimization scheme 1 – average pagination – queries each shard with adjusted limits (e.g., SELECT * FROM order_tab_0 ORDER BY id LIMIT 2 OFFSET 1 and SELECT * FROM order_tab_1 ORDER BY id LIMIT 2 OFFSET 1 ) and merges the results, providing an approximate answer suitable for scenarios where exact ordering is not critical.
Optimization scheme 2 – disabling cross‑page queries – restricts navigation to sequential pages and adds a condition based on the previous page’s maximum id, e.g., SELECT * FROM order_tab WHERE id > max_id ORDER BY id LIMIT 50 OFFSET 0 , eliminating large offsets.
Optimization scheme 3 – switching middleware – suggests using NoSQL solutions such as Elasticsearch or ClickHouse, or distributed relational databases, to offload pagination complexity to the storage layer.
Optimization scheme 4 – secondary query – performs a first query with distributed offsets, identifies the smallest id, then issues a second BETWEEN query (e.g., SELECT * FROM order_tab_0 WHERE id BETWEEN 4 AND 12 ) to compute the exact global offset before fetching the final page.
Optimization scheme 5 – intermediate table – stores sorting columns in a separate table; writes are duplicated to the main and intermediate tables (or synchronized via binlog tools like Canal) to keep them consistent, allowing fast look‑ups for ordered pagination.
The article concludes with a simplified step‑by‑step summary of these techniques and emphasizes choosing the appropriate method based on performance requirements and business constraints.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.