Sharding Massive MySQL Tables: End‑to‑End Architecture, Migration & Best Practices
This article presents a comprehensive, step‑by‑step case study of a large‑scale MySQL sharding project, covering business refactoring, storage architecture design, data migration, synchronization strategies, capacity planning, validation, rollout procedures, stability safeguards, and cross‑team collaboration.
Introduction
Storing and accessing massive data sets has become a bottleneck for MySQL, exhausting CPU, disk, and memory resources while demanding high stability and scalability. Sharding (分库分表) is a common solution, but most online articles only discuss isolated concepts. This guide walks through a complete sharding project from architecture design to production rollout.
Phase 1: Business Refactoring (Optional)
1. Query Statistics
Using a distributed tracing system, the team collected all services that query the two large tables (A and B, each ~80 million rows), recorded owners, and identified both online and offline consumers.
2. Query Splitting and Migration
A dedicated JAR named projectdb (initial version 1.0.0‑SNAPSHOT) was created to host the extracted queries. All original calls such as xxxMapper.xxxMethod() were replaced with projectdb.xxxMethod(). This approach simplifies later query‑splitting analysis and enables a future switch to RPC calls by merely upgrading the JAR version.
Facilitates subsequent query‑splitting analysis.
Allows a seamless transition to middle‑service RPC calls; only the JAR version needs to be upgraded.
3. Joint Query Analysis
The collected queries were classified to answer key questions: which queries cannot be split, which can be split via joins, which tables/fields can be merged or discarded, and which fields need redundancy. The analysis produced a concrete sharding‑key plan.
4. New Table Design
Based on the analysis, a new merged table was designed, reviewed by all business owners, and indexed according to the identified query patterns. Redundant or obsolete fields were removed.
5. First Upgrade
The new table was introduced by upgrading the JAR to 2.0.0‑SNAPSHOT, updating all services to use the new fields, and performing a batch rollout that separates non‑core and core services to limit risk.
6. Best Practices
Avoid renaming original column names unless absolutely necessary.
Redesign indexes based on the post‑sharding query analysis rather than copying the old ones.
Phase 2: Storage Architecture Design (Core)
1. Overall Architecture
Query statistics showed that >80 % of queries use three key columns (pk1, pk2, pk3). The architecture therefore combines a MySQL sharding layer, a search‑engine index (Alibaba Cloud OpenSearch/Elasticsearch), and data‑sync tools. The diagram below illustrates the full stack.
2. MySQL Sharding Storage
Sharding keys pk1, pk2, pk3 cover >80 % of queries. Two full‑copy shards (pk1 and pk3) are kept because their queries require low latency; pk2 is stored as a simple mapping table (pk1‑pk2) due to their one‑to‑one relationship.
3. Search Platform Index Storage
The remaining ~20 % of irregular queries (fuzzy search, non‑key filters) are served by the search platform, which stores only primary keys and indexed fields. Results are then fetched from MySQL to guarantee consistency.
4. Data Synchronization
Four synchronization flows were implemented:
Full‑copy from old tables to the new main table (initially via async sync, later switched to dual‑write 3.0.0‑SNAPSHOT for real‑time needs).
Full‑copy from the new main table to a secondary copy.
Sync from the new main table to the pk1‑pk2 mapping table.
Sync from the new main table to the search‑engine index.
Only flows without strict real‑time requirements use simple async sync; the critical path (old → new main) uses dual‑write to meet latency constraints.
5. Capacity Planning
Before provisioning MySQL storage and search‑engine indexes, the team estimated total data volume and QPS from monitoring. During full‑copy sync they observed a storage “hole” effect, causing actual usage to exceed estimates, so an extra 50 % capacity buffer was reserved.
6. Data Validation
Because the migration involved heterogeneous changes, a dedicated validation service was built to compare full‑copy and incremental data between old and new stores. The service uncovered numerous consistency issues early, preventing production incidents.
7. Best Practices
Beware of traffic amplification caused by sharding (secondary index lookups, IN‑batch queries).
Limit IN‑clause size and evaluate amplification during capacity planning.
When a sharding key must change (e.g., pk3), perform delete‑then‑add logic with transactional guarantees.
Ensure message ordering in the sync pipeline (Kafka partitions) to avoid data overwrite; if ordering cannot be guaranteed, perform a reverse‑lookup to the source.
Account for sync latency (seconds‑level) and database replica lag when designing read paths.
Phase 3: Refactoring and Release (Cautious)
1. Query Refactoring
All read queries were rewritten to target the new sharded tables, replace inner joins on pk1/pk2 with the appropriate shard, drop obsolete fields, and route non‑key queries to the search platform.
2. Write Refactoring
Writes were changed to a dual‑write mode (old + new tables) with a feature flag to toggle the new path. This allows quick rollback by disabling the flag while keeping the old‑to‑new sync active.
3. Service Refactoring
A new middle‑service was introduced to encapsulate all transformed queries. Existing JAR calls were replaced with RPC calls to this service, and the JAR version was upgraded to 3.0.0‑SNAPSHOT.
4. Batch Service Rollout
Services were upgraded in stages from non‑core to core. During the rollout, a mixed state (read new, write old) could appear; monitoring ensured no external service accessed the old tables.
5. Old Table Decommission
Decommission steps:
Confirm no non‑mid‑service accesses the old tables via monitoring and SQL audit.
Stop all data synchronization.
Drop the old tables.
6. Best Practices
Address write‑then‑read latency by either switching sync to dual‑write or adding a fallback read from the old table.
Replace auto‑increment primary keys with a distributed ID generator (MySQL sequence) to avoid global key collisions.
Allocate separate ID ranges for old and new tables, then swap them after migration to guarantee uniqueness.
Use a single code base with feature switches for seamless cut‑over.
Stability Assurance
Stability is reinforced throughout the project: thorough table‑design reviews, mandatory data‑validation for every sync, prepared rollback plans for each phase, batch rollouts starting with low‑risk services, comprehensive monitoring and alerting, and extensive unit and integration testing.
Cross‑Team Collaboration
Key collaboration practices include:
Document‑first approach for requirements, progress, and hand‑offs.
Close communication with all business owners for schema changes and field confirmations.
Clear responsibility assignment: each team designates a single point of contact for coordination.
Outlook
Future work may involve adopting newer database middleware that automates sharding, or migrating to distributed databases such as PolarDB, TiDB, or HBase, which could eliminate the need for manual sharding altogether.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
