Operations 27 min read

Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.

NetEase Cloud Music Tech Team

Jul 11, 2024

Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

This article documents NetEase Cloud Music's 2023 large-scale data center migration project, moving over 20,000 applications and 100w+ QPS to a new Guizhou data center. The project represents the largest, most complex technical migration in the company's history.

Project Challenges

The migration faced multiple challenges: massive scale involving all services including middleware, storage, and third-party dependencies; high business complexity with diverse scenarios requiring different data consistency levels; accumulated technical debt; significant new risks; strict constraints requiring zero-downtime migration without P2+ incidents; and complex cross-team coordination.

Key Constraints

Limited machine procurement - Guizhou and Hangzhou cannot be fully equivalent in deployment

Cross-region bandwidth controlled within 200Gbps with potential network interruptions

Network latency of approximately 30ms between Guizhou and Hangzhou

Business availability requirements: no P2 or above incidents

Minimize business code intrusion from migration solutions

Migration Strategy

The batch migration strategy follows principles of: team/domain decoupling, server-side traffic self-closed loop, C-end priority, and resource constraints. Traffic switching employs multiple approaches including client-side switching, DNS switching, gateway switching, and storage layer switching, with detailed migration strategies for different storage types (DB, Redis, Memcached).

Stability Assurance

The project addressed stability risks through: information gathering, new risk identification, historical technical debt处理 (including ZK dependency issues, Kafka to Nydus migration, configuration hardcoding, service dependency改造), standardized integration, monitoring enhancements, emergency plans, and business-side technical solutions.

System Accumulation

Key systems developed include SOP Platform for standardized operations and Auto-Upgrade Platform for automated component upgrades, both providing foundational support for future large-scale projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems system stability technical debt Service Governance Data Center Migration large‑scale infrastructure traffic switching

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.