Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study
In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.
This article documents NetEase Cloud Music's 2023 large-scale data center migration project, moving over 20,000 applications and 100w+ QPS to a new Guizhou data center. The project represents the largest, most complex technical migration in the company's history.
Project Challenges
The migration faced multiple challenges: massive scale involving all services including middleware, storage, and third-party dependencies; high business complexity with diverse scenarios requiring different data consistency levels; accumulated technical debt; significant new risks; strict constraints requiring zero-downtime migration without P2+ incidents; and complex cross-team coordination.
Key Constraints
Limited machine procurement - Guizhou and Hangzhou cannot be fully equivalent in deployment
Cross-region bandwidth controlled within 200Gbps with potential network interruptions
Network latency of approximately 30ms between Guizhou and Hangzhou
Business availability requirements: no P2 or above incidents
Minimize business code intrusion from migration solutions
Migration Strategy
The batch migration strategy follows principles of: team/domain decoupling, server-side traffic self-closed loop, C-end priority, and resource constraints. Traffic switching employs multiple approaches including client-side switching, DNS switching, gateway switching, and storage layer switching, with detailed migration strategies for different storage types (DB, Redis, Memcached).
Stability Assurance
The project addressed stability risks through: information gathering, new risk identification, historical technical debt处理 (including ZK dependency issues, Kafka to Nydus migration, configuration hardcoding, service dependency改造), standardized integration, monitoring enhancements, emergency plans, and business-side technical solutions.
System Accumulation
Key systems developed include SOP Platform for standardized operations and Auto-Upgrade Platform for automated component upgrades, both providing foundational support for future large-scale projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
