How NetEase Cloud Music Automated Massive Service Upgrades with a Custom Platform
This article presents a comprehensive case study of NetEase Cloud Music's automatic upgrade platform, detailing the background challenges, technical architecture, sidecar versus component upgrades, workflow orchestration, operational safeguards, performance metrics, and future roadmap for large‑scale microservice migrations.
Background
Upgrading backend services at scale is difficult due to stability risks, high labor cost, and coordination overhead across many teams and thousands of applications. Cloud Music’s rapid growth has led to over a thousand backend services, making manual upgrades costly and risky.
Stability risk : Compatibility issues in components can cause production instability.
Upgrade investment & cost : Each upgrade requires developer work, QA testing, and a multi‑week rollout.
Upgrade coordination cost : Low willingness from development teams, plus scheduling, troubleshooting, and multi‑team coordination increase effort.
During the migration of the Guizhou data center, a large number of applications needed to be upgraded. The team built an automatic upgrade platform to address stability, cost, and coordination challenges.
Technical Practice
Upgrade Classification
Upgrades are divided into two categories based on architecture:
Component upgrade – traditional JAR upgrade that often requires code changes.
Sidecar‑mode upgrade – decouples the component from the business application, allowing upgrades with little or no code change (e.g., JavaAgent, ServiceMesh).
Sidecar Mode Overview
Sidecar mode deploys an auxiliary process alongside the main application to extend functionality without modifying the main codebase. In this article, JavaAgent‑style agents are treated as a sidecar implementation.
Scalability : Add new capabilities by attaching sidecars.
Flexibility : Deploy, upgrade, and maintain sidecars independently.
Reusability : Share sidecar applications across multiple services.
Capability Panorama
The platform consists of four layers:
Underlying common capabilities built on Git, release platform, deployment platform, automated test platform, code analysis & search, and online monitoring.
Component upgrade capabilities supporting various file types.
Customizable task flow orchestration and upgrade rule configuration.
Use‑case support such as JDK upgrades, architecture migrations, risk mitigation, and compatibility testing.
Core Processes
The platform provides five generic capabilities:
Upgrade change : Git‑based branch creation, commit, merge‑request handling.
Test deployment : Automated creation and teardown of test environments.
Test verification : CI checks, automated test execution, deployment validation.
Online release : Gray release and standardized deployment workflow.
Result detection : Code and runtime analysis to verify successful upgrades.
Key design aspects include workflow orchestration, resource throttling & release, idempotent execution with exponential back‑off retries, and observability for both normal and error states.
Component Upgrade Workflow
The default upgrade flow consists of:
Create a new Git branch to isolate changes.
Apply the upgrade plugin (based on OpenRewrite) to modify code, configuration, or dependencies.
Deploy the branch to a test environment and run automated tests.
Validate CI results and release the branch for review.
Merge the branch into master after approval.
Continuous offline detection of component dependencies in both source and runtime.
Release resources (delete branches, free test clusters) once verification passes.
Pre‑upgrade dry‑runs are performed on a subset of applications to surface compatibility and stability issues early.
Task Orchestration & Non‑Functional Design
The platform supports custom task pipelines, idempotent execution, resource throttling, retry strategies per exception type, MQ‑based notifications, visualized process and error information, and extensible hooks at each pipeline stage.
Task Management & Functional Design
Features include:
Application selection by team, service, or dependent JAR.
Precise source and target JAR version configuration.
Configurable upgrade rules and plugin versions per task.
Independent task flow customization.
Controls for retry, skip, abort, and restart, with full resource cleanup.
Statistics on task scope, duration, success rates, and resource usage.
Operational Data
In six months, the platform supported three major migration events, achieving:
~50% one‑shot upgrade success rate for large‑scale migrations.
Approximately 500 person‑days saved (≈83% efficiency gain) when upgrading 1,000 applications.
Root causes for failures included outdated component versions, insufficient test environment configuration, non‑standard dependency usage, and new component incompatibilities.
Future Outlook
Planned enhancements:
Increase one‑shot upgrade success rate.
Add full sidecar upgrade support.
Integrate component publishing, version management, risk mitigation, and automatic upgrade into a closed loop.
The platform continues to evolve to meet the growing scale and complexity of Cloud Music’s microservice ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
