How to Design and Test Horizontal Sharding for Large-Scale Databases
This article outlines the background, analysis, design, testing strategy, migration plan, and production quality results for implementing horizontal sharding, table splitting, and Elasticsearch integration in a growing enterprise database system.
1. Background
As the company grows, user volume and business complexity increase, requiring each business team to migrate relevant tables from a unified database to their own databases for easier maintenance.
With the development of the middle‑platform, data volume grows, a single‑DB architecture leads to slow responses and timeouts, prompting sharding and Elasticsearch integration.
2. Analysis and Design
2.1 Horizontal Sharding
Reason: Table data will keep growing, slowing queries, so horizontal splitting based on primary‑key ID is needed.
Problem: Multiple tables share a single auto‑increment ID.
Solution: Use
tddl‑sequence; each node obtains a distinct ID segment, guaranteeing no primary‑key conflict.
Risk considerations: Verify that tddl‑sequence fully prevents conflicts by studying conflict cases.
Investigation points:
Manual inserts causing primary‑key conflicts.
Modification of the sequence value leading to conflicts.
Multi‑threaded code causing conflicts.
2.2 Number of Shards and Sharding Fields
Tables x and y are split horizontally. Table x (1,000 万 rows, 2 万 daily growth) suggests 8 shards, usable for 4.1 years. Table y (5,000 万 rows, 5 万 daily growth) suggests 32 shards, usable for 5.5 years.
Calculation: One shard stores up to 5 million rows. Example for x: 8 × 5 million = 40 million – 10 million (existing) = 30 million / 2 = 15 000 days ≈ 4.1 years.
After sharding, fields remain unchanged.
Risk: Define important fields such as primary keys as BIGINT in both DB and ES.
2.3 Sharding Strategy
Since only one database exists after migration, sharding strategy is based on actual shard count.
Table x: 8 shards, strategy “primary‑key % 8”. Table y: 32 shards, strategy “primary‑key % 32”.
Risk: With non‑business‑specific IDs as shard keys, data skew is unlikely; however, strong business IDs may need additional evaluation.
2.4 Elasticsearch Integration
Advantages: Massive query capability.
Disadvantages: Higher latency (≈10× slower) and lower stability compared to DB.
Plan: Use ES for y‑table queries that involve many fields; keep x‑table queries in DB, using ES only for full‑table scans.
Risk: No separate data store for ES, so no fallback; only low‑frequency, tolerant queries are routed to ES.
3. Sharding Test Plan
3.1 Overall Quality Control
Service layer: Ensure all service interfaces function correctly, possibly using traffic replay.
Data layer: Guarantee data consistency during migration and after enabling dual‑write.
Business layer: Validate end‑to‑end flows with both internal and external traffic.
Performance: Assess performance at each switch point.
3.2 Key Test Details
Data consistency: Compare row counts between old and new databases; verify dual‑write consistency.
Interface consistency: Use traffic replay, manual verification, and the
QueryCheckResultswitch to compare records.
Business correctness: After data and interface validation, deploy code with dual‑write, read, and check switches enabled, then run main business flows.
Performance testing: Conduct baseline pressure tests, capacity planning, and subsequent tests after cache compression, GC changes, and adding read replicas.
4. Switch Migration Plan
Step 1: Configure switches for sequence, search limit, summerCache, write mode, and read source.
Step 2: Enable sequence and dual‑write.
Step 3: Stop data‑source sync, monitor counts via DBV to ensure stability.
Step 4: Enable ES read switch, routing reads to new DB and ES.
Step 5: After all consumers switch to the new DB, stop writes to the public DB.
5. Overall Migration Architecture
Key stages illustrated with diagrams: switch primary‑key sequence, enable dual‑write, switch reads to new DB, fully migrate reads, and finally stop writes to the public DB.
6. Production Quality
Application stability high; few alerts or errors.
Core interface latency reduced by over 50% (e.g., interface a – 60%, b – 63%, c – 58%, d – 59%, e – 54%, f – 61%, g – 46%).
Capacity planning nearly eliminated dependence on the public DB, reducing risk.
Qunhe Technology Quality Tech
Kujiale Technology Quality
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.