Operations 27 min read

Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

This article documents NetEase Cloud Music's 2023 large-scale data center migration project, moving over 20,000 applications and 100w+ QPS to a new Guizhou data center. The project represents the largest, most complex technical migration in the company's history.

Project Challenges

The migration faced multiple challenges: massive scale involving all services including middleware, storage, and third-party dependencies; high business complexity with diverse scenarios requiring different data consistency levels; accumulated technical debt; significant new risks; strict constraints requiring zero-downtime migration without P2+ incidents; and complex cross-team coordination.

Key Constraints

Limited machine procurement - Guizhou and Hangzhou cannot be fully equivalent in deployment

Cross-region bandwidth controlled within 200Gbps with potential network interruptions

Network latency of approximately 30ms between Guizhou and Hangzhou

Business availability requirements: no P2 or above incidents

Minimize business code intrusion from migration solutions

Migration Strategy

The batch migration strategy follows principles of: team/domain decoupling, server-side traffic self-closed loop, C-end priority, and resource constraints. Traffic switching employs multiple approaches including client-side switching, DNS switching, gateway switching, and storage layer switching, with detailed migration strategies for different storage types (DB, Redis, Memcached).

Stability Assurance

The project addressed stability risks through: information gathering, new risk identification, historical technical debt处理 (including ZK dependency issues, Kafka to Nydus migration, configuration hardcoding, service dependency改造), standardized integration, monitoring enhancements, emergency plans, and business-side technical solutions.

System Accumulation

Key systems developed include SOP Platform for standardized operations and Auto-Upgrade Platform for automated component upgrades, both providing foundational support for future large-scale projects.

distributed systemsSystem StabilityDevOpsTechnical Debtservice governancedata center migrationlarge-scale infrastructuretraffic switching
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.