Fast and Stable MySQL Data Center Migration: Choosing and Implementing the Optimal Strategy
This article details the background, migration plan selection, and step‑by‑step procedures—including pre‑building cascades, service pause, automated batch operations, cluster tiering, pre‑ and post‑checks, and gray‑scale validation—to achieve a fast, stable MySQL data‑center migration for a large‑scale production environment.
1 Background
As a leading domestic circular economy company, the rapid growth of the ZhaiZhai business has pushed its basic services to a “shell‑shedding” stage. Existing IDC resources are saturated and cannot meet future demand. Additionally, Tencent Cloud’s load‑balancer upgrade requires replacing TGW with CLB.
After six months of planning, a new IDC and CLB have been deployed, and public services such as MySQL, TiDB, and Redis need orderly migration. The MySQL migration faces many clusters, high impact, and strict operational requirements, prompting thorough research and a reasonable plan to minimize service perception.
2 Migration Plan Selection
2.1 Option 1: Expansion + Master‑Slave Switch
Backup and expand enough slave instances, then use MHA (Master High Availability) to actively switch, finally decommission old nodes to change the cluster topology.
2.2 Option 2: Cascading Switch
Build a cascading cluster via backup, synchronize data to the new cluster, then perform a cascade break and domain‑name switch to complete the topology change.
2.3 Comparison
Option 1: Small development effort; expansion and MHA switch are easy, but each cluster’s MHA switch takes >30 s, causing prolonged business impact and demanding high‑performance MHA for large‑scale switches.
Option 2: Simple principle and fast switch (<10 s per cluster), but requires extensive automation (auto‑scaling, cascading setup, pre/post checks, traffic switching), leading to higher development cost.
Considering business impact, the fast and stable cascading solution (Option 2) was chosen; it also allows smooth CLB upgrade during new‑cluster construction.
Cascading read‑traffic switch diagram
Cascading write‑traffic switch diagram
3 How to Achieve a Fast and Stable MySQL Data‑Center Migration
MySQL cluster migration carries high risk; improper operations can render clusters unavailable. ZhaiZhai operates hundreds of MySQL clusters, requiring a rapid yet reliable migration.
3.1 Pre‑Build Cascades
Backup and expand new clusters, then establish cascading relationships with old clusters. Because ZhaiZhai uses a single‑machine multi‑instance architecture, the new clusters must consider mixed deployment issues, instance resource limits, and load balancing.
Constraints:
Maximum 5 master instances per machine
Maximum 10 slave instances per machine
Maximum 15 total instances per machine
Memory and disk usage ≤ 85 %
Cascading construction process:
3.2 Service Pause
Complex upstream/downstream relationships make selective pause difficult; therefore, a short low‑traffic window is used to pause services and batch‑switch core clusters, minimizing downtime while ensuring robust batch operation and recovery capabilities.
3.3 Automated Batch Operations and Decoupling Critical Steps
All migration steps are automated and modular, forming a closed loop that enables rapid fault localization and rollback.
Automated cascading cluster setup
Automated pre/post checks
Batch read‑traffic switch
Batch write‑traffic switch
Automatic kill of old‑cluster connections and detection of new connections
Batch decommission of old clusters
3.4 Cluster Tiering
Clusters are divided into three tiers (P1, P2, P3) with roughly equal distribution.
P3 clusters can be switched at any daytime
P2 clusters are switched between 20:00‑22:00
P1 clusters require early‑morning maintenance windows
3.5 Pre‑ and Post‑Switch Checks
Both new and old clusters undergo thorough checks to ensure configuration consistency.
Pre‑checks include:
VIP‑RSHOST connectivity
buffer_pool_size
sql_mode
Number of slave instances
Cascade latency
…
Post‑checks include:
Read‑only status of new and old masters
Real‑time business connections
Domain name points to the new cluster
…
3.6 Gray‑Scale Switch Validation
After automation development, multiple rounds of batch switch tests are performed on a test cluster, followed by tiered online switches. P3 clusters serve as a low‑impact pilot to surface issues before larger batches.
Gray‑scale switch order:
Single‑cluster switch
Small batch (<10)
Large batch (>30)
Issues encountered:
Multiple domain names : Some clusters have several master/slave domain names, requiring compatibility handling during traffic switch.
CMDB inaccuracies : Stale metadata leads to incorrect instance or domain information, necessitating pre‑migration verification.
4 Conclusion
ZhaiZhai operates over 400 MySQL clusters; all switches were completed during the early‑morning maintenance window on September 27. P3 and P2 clusters were switched beforehand; the remaining 100+ P1 core clusters averaged 10 s per switch, finishing within half an hour with smooth operation and successful validation.
Author: Huang Jianbo, ZhaiZhai DBA, responsible for MySQL operations and database platform development.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
