Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study
In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.
This article documents NetEase Cloud Music's 2023 large-scale data center migration project, moving over 20,000 applications and 100w+ QPS to a new Guizhou data center. The project represents the largest, most complex technical migration in the company's history.
Project Challenges
The migration faced multiple challenges: massive scale involving all services including middleware, storage, and third-party dependencies; high business complexity with diverse scenarios requiring different data consistency levels; accumulated technical debt; significant new risks; strict constraints requiring zero-downtime migration without P2+ incidents; and complex cross-team coordination.
Key Constraints
Limited machine procurement - Guizhou and Hangzhou cannot be fully equivalent in deployment
Cross-region bandwidth controlled within 200Gbps with potential network interruptions
Network latency of approximately 30ms between Guizhou and Hangzhou
Business availability requirements: no P2 or above incidents
Minimize business code intrusion from migration solutions
Migration Strategy
The batch migration strategy follows principles of: team/domain decoupling, server-side traffic self-closed loop, C-end priority, and resource constraints. Traffic switching employs multiple approaches including client-side switching, DNS switching, gateway switching, and storage layer switching, with detailed migration strategies for different storage types (DB, Redis, Memcached).
Stability Assurance
The project addressed stability risks through: information gathering, new risk identification, historical technical debt处理 (including ZK dependency issues, Kafka to Nydus migration, configuration hardcoding, service dependency改造), standardized integration, monitoring enhancements, emergency plans, and business-side technical solutions.
System Accumulation
Key systems developed include SOP Platform for standardized operations and Auto-Upgrade Platform for automated component upgrades, both providing foundational support for future large-scale projects.
NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.