15 Must‑Know Interview Questions on Database Sharding and Partitioning
This article explains why and when to split databases and tables, how to choose sharding keys, various sharding strategies such as range, hash and consistent hash, handling cross‑node joins, pagination, distributed IDs, middleware choices, and step‑by‑step zero‑downtime migration techniques.
Why We Need Database Sharding and Table Partitioning
When business volume surges, a single MySQL instance can hit performance bottlenecks due to disk capacity limits and limited concurrent connections, leading to errors like too many connections. Micro‑service architectures often split modules into separate databases to spread read/write load.
Large tables also degrade query speed; once a table exceeds tens of millions of rows, the B+‑tree index height grows, causing more disk seeks. InnoDB stores data in 16KB pages, and a B+‑tree of height 2 can hold about 18,720 rows, while height 3 can hold around 21.9 million rows. Beyond this, queries become noticeably slower.
When to Consider Sharding
For MySQL InnoDB, a single table can store up to a billion rows, but performance deteriorates long before that. Alibaba’s Java Development Manual recommends sharding when a table exceeds 5 million rows or 2 GB in size. However, proactive planning is advised; if you estimate the table won’t reach that size in three years, sharding may be unnecessary.
Choosing a Sharding Key
The sharding key determines how data is split. It should reflect the business’s main entity, such as using a customer number for a customer‑information table, ensuring related records stay together and avoiding full‑table routing.
Querying with Non‑Sharding Keys
When a query must use a non‑sharding field (e.g., login by phone number while userId is the sharding key), common solutions include:
Full table scan across all shards (generally discouraged).
Synchronizing user data to Elasticsearch and querying there (recommended).
Deriving the sharding key from the non‑sharding field when possible (e.g., parsing customer ID from an order number).
Sharding Strategies
Range Partitioning
Data is divided based on numeric or temporal ranges, e.g., order_id 0‑3,000,000 in one table, 3,000,001‑6,000,000 in another. This approach eases scaling but can create hotspots if recent IDs concentrate in a single range.
Hash Modulo
Hash the sharding key (e.g., user_id , order_id ) and take the modulo of the total number of tables to distribute rows evenly.
Example: with four tables, id=1 maps to t_order_1, id=3 maps to t_order_3. Math.abs(orderId.hashCode()) % table_number Pros: Even distribution, no obvious hotspots.
Cons: Adding tables later requires re‑hashing all data; consistent hashing can mitigate this.
Consistent Hashing
When expanding from 10 to 20 tables, consistent hashing minimizes data movement by remapping only a subset of keys, reducing migration effort.
Avoiding Hotspots and Data Skew
Combining range and hash strategies can balance load. For example, split orders by ID range into separate databases, then apply hash modulo within each database to distribute rows across tables.
Distributed Transaction Solutions
After sharding, local transactions no longer work across databases. Common distributed transaction patterns include two‑phase commit, three‑phase commit, TCC, local message tables, eventual consistency (max‑effort notification), and saga orchestration.
Cross‑Node Join Strategies
Field redundancy: duplicate join fields in the primary table.
Global tables: maintain a copy of frequently joined reference data in every shard.
Data synchronization via ETL tools.
Application‑level assembly: perform multiple queries and combine results in code.
Aggregations Across Shards
Functions like COUNT, ORDER BY, and GROUP BY require gathering partial results from each shard and merging them in the application layer.
Pagination After Sharding
Two approaches are common:
Global view method : Retrieve results from all shards, merge, then paginate (accurate but may transfer excess data).
Business compromise method : Allow only previous/next page navigation, passing the last seen timestamp to fetch the next slice from each shard.
Distributed ID Generation
When relying on database‑generated IDs is impossible, use UUIDs or Snowflake IDs. Snowflake IDs are 64‑bit values composed of a sign bit, 41‑bit timestamp, 10‑bit machine identifier, and 12‑bit sequence number.
Sharding Middleware Options
Popular choices include Sharding‑JDBC, Cobar, Mycat, Atlas, TDDL (Taobao), and Vitess. The author’s project uses Sharding‑JDBC.
Evaluating Number of Shards
For MySQL, a single database handling over 50 million rows strains performance. Typically, 4‑10 databases are recommended; the author’s enterprise splits data across ten databases.
Horizontal vs. Vertical Sharding
Horizontal database sharding : Split rows of a single logical database across multiple physical databases using hash or range.
Horizontal table sharding : Split rows of a single table across multiple tables.
Vertical database sharding : Separate tables by business domain into different databases.
Vertical table sharding : Split a table’s columns into a main table and extension tables based on column activity.
Zero‑Downtime Migration Steps
Create a proxy layer with a switch to control whether new or old DAO is used; during gray rollout, continue using the old DAO.
After full release, enable dual writes: write to both old and new tables, recording the new table’s starting ID.
Migrate existing data from old to new tables via scripts.
Switch reads to the new table while keeping dual writes for a stabilization period.
Once stable, stop writes to the old table.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
