From Monolith to Sharded MySQL: A Complete End‑to‑End Sharding Case Study
This article walks through a real‑world large‑scale MySQL sharding project, covering business refactoring, storage architecture design, data migration, incremental upgrades, best‑practice tips, and stability safeguards, while sharing concrete steps, pitfalls, and lessons learned from start to production rollout.
1. Introduction
Massive data growth makes a single MySQL instance a bottleneck; CPU, disk, and memory limits force a split‑database‑and‑table (sharding) solution. Two options exist: replace MySQL with a distributed store (e.g., HBase, PolarDB, TiDB) or keep MySQL and apply sharding. This article focuses on the latter, presenting a complete end‑to‑end process.
2. Phase 1 – Business Refactoring (Optional)
For well‑designed micro‑services, only storage changes may be needed; business refactoring is optional.
The project involved two huge tables (A and B) with ~80 million rows each, inherited from a monolithic system, linked to 50+ online services and 20+ offline jobs. Refactoring required merging the tables, removing redundant fields, and ensuring no query was missed.
2.1 Query Statistics
Using a distributed tracing system, the team listed all services that accessed the tables, including offline analytics, to avoid missing any data source.
2.2 Query Migration
A jar named projected‑1.0.0‑SNAPSHOT was created to host all collected queries. Existing mapper calls were replaced with projectdb.xxxMethod(), enabling later replacement with RPC calls.
Facilitates further query analysis.
Allows seamless switch to middle‑tier service calls by upgrading the jar version.
2.3 Joint Query Analysis
The team classified queries to decide which could be split, which required joins, which fields could be merged, redundant, or discarded, and identified sharding keys.
2.4 New Table Design
Based on the analysis, a new unified table schema was drafted, reviewed by all business owners, and indexed according to the new query patterns.
2.5 First Upgrade
Version 2.0.0‑SNAPSHOT of the jar updated all SQL to the new schema. Services were upgraded in batches (core vs. non‑core) to mitigate risk.
2.6 Best Practices
Keep original column names when possible.
Redesign indexes after merging tables; avoid blindly copying old indexes.
3. Phase 2 – Storage Architecture Design (Core)
The core of any sharding project is the storage architecture.
3.1 Overall Architecture
Analysis showed >80% of queries use three primary keys (pk1, pk2, pk3). The design introduced a database middleware, data‑sync tools, and a search platform (OpenSearch/Elasticsearch) for the remaining 20% of queries.
3.1.1 MySQL Sharding
Sharding keys pk1/pk2/pk3 cover most queries. Two full‑copy shards (pk1 and pk3) were maintained due to real‑time requirements; pk2 was stored only in a mapping table.
3.1.2 Search Index
The search index stores primary keys and necessary fields for the remaining queries, avoiding full data duplication unless fields are small.
Search‑engine‑to‑DB sync introduces latency; queries that need strong consistency should hit MySQL directly.
3.1.3 Data Synchronization
Both full‑copy and incremental sync were used. Initially a sync‑only approach was chosen, but real‑time latency forced a switch to dual‑write (snapshot 3.0.0‑SNAPSHOT) for critical paths.
The sync topology includes:
Full sync from old tables to new main tables.
Full sync from new main tables to secondary tables.
Sync from new main tables to mapping tables.
Sync from new main tables to the search engine.
3.2 Capacity Planning
Storage and performance were estimated from QPS and table sizes, but actual sync required ~50% more capacity due to storage gaps.
3.3 Data Validation
Because the migration involved heterogeneous schema changes, a dedicated validation service compared full and incremental data to ensure logical correctness and sync consistency.
3.4 Best Practices
Beware of traffic amplification caused by sharding (secondary index lookups, IN‑batch queries).
When a sharding key must change (e.g., pk3), use delete‑then‑insert with transactional guarantees.
Ensure message order in Kafka‑based sync; for one‑to‑many relationships, fallback to a reverse‑lookup to guarantee consistency.
Account for latency at each layer (sync platform, read‑replica lag, search index delay).
Allocate ~50% extra storage for sync‑induced gaps.
4. Phase 3 – Refactoring and Production Rollout
After architecture is ready, the team performed careful rollout:
Middle‑tier services switched to single‑read, dual‑write mode.
Old tables kept syncing to new tables.
All services upgraded to projectDB‑3.0.0‑SNAPSHOT and exposed RPC endpoints.
Monitoring ensured no non‑mid‑tier service accessed old tables.
Data sync stopped, old tables dropped.
4.1 Query Refactoring
Read queries were rewritten to use new table names and sharding keys; write queries were changed to dual‑write mode with feature flags for quick rollback.
4.2 Service Refactoring
A new middle‑tier service was created to host the refactored queries, allowing independent upgrades and rollbacks.
4.3 Staged Deployment
Non‑core services were upgraded first, followed by core services, to limit impact. Each batch required thorough API risk assessment and regression testing.
4.4 Old Table Decommission
Confirm no external service accesses old tables via monitoring and SQL audit.
Stop data sync.
Delete old tables.
4.5 Additional Best Practices
Immediate reads after writes may miss data due to sync lag; mitigate with dual‑write or fallback reads.
Replace auto‑increment IDs with distributed sequence IDs to avoid key collisions across shards.
Swap ID ranges between old and new tables before cut‑over to prevent conflicts.
5. Stability Assurance
Key stability measures include thorough table‑design reviews, data‑validation services, rapid rollback plans, batch rollouts, comprehensive monitoring/alerting, and extensive unit/integration testing.
6. Cross‑Team Project Management
Successful large‑scale sharding requires disciplined documentation, clear business communication, ownership assignment, and synchronized progress tracking across all involved teams.
7. Outlook
Future improvements may involve newer middleware that automates sharding or adopting distributed databases (PolarDB, TiDB, HBase) that eliminate the need for manual sharding.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
