Cloud Native 16 min read

Serverless Adoption at NetEase Cloud Music: Architecture, Migration, and Benefits

NetEase Cloud Music transitioned from public FaaS to a private Knative‑based serverless platform, delivering elastic audio‑video processing, multi‑language support, event‑driven scaling across hybrid private and public clouds, boosting resource utilization and cutting costs, though cold‑starts and container expertise remain challenges.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Serverless Adoption at NetEase Cloud Music: Architecture, Migration, and Benefits

In the era of cloud hosts, enterprises often face resource anxiety due to sudden spikes in workload and the need for rapid scaling. Serverless, with its pay‑as‑you‑go model that can scale down to zero, has become an attractive solution for many companies, including NetEase Cloud Music.

Initially, NetEase Cloud Music used public FaaS offerings from cloud providers. In 2020 the team began exploring Google’s open‑source Knative project, and by May 2021 decided to build a private‑cloud Serverless platform based on Knative to meet the elastic processing needs of its audio‑video pipelines while improving resource utilization and reducing costs.

The music catalog team processes hundreds of thousands of songs daily, requiring near‑real‑time transcoding, chorus detection, and feature extraction. Traditional resource provisioning suffered from low elasticity, high compute anxiety, cumbersome manual scaling, low CPU utilization (<20%), and stability issues.

After evaluating several open‑source options (Knative, OpenFaaS, Fission, Nuclio) and considering factors such as business fit, community support, and ease of use, the team selected Knative Serving for dynamic scaling and Knative Eventing for an event‑driven architecture.

Key implementation details include:

Supporting arbitrary Dockerfiles so that teams can build language‑specific container images (Java, Go, Node, Python, etc.).

Embedding a Kafka‑based message engine within Knative Eventing to route events to the Serverless platform.

Replacing the default Round‑Robin load‑balancing algorithm with a Least‑Request strategy to mitigate pod overload and improve concurrency handling.

Providing custom monitoring dashboards that expose metrics such as pod load, queue latency, CPU/memory usage, and auto‑scaling frequency.

Cold‑start latency was addressed by using multi‑stage Dockerfile builds and P2P image acceleration. The platform also integrates with NetEase’s Horizon deployment system, enabling template‑driven instance creation for both Kubernetes Deployments and Knative services.

After more than a year of iteration, the Serverless platform supports multiple languages, both online and batch workloads, blue‑green and traffic‑based canary releases, automatic scaling based on QPS and task volume, full‑stack monitoring, various triggers (HTTP, internal Kafka, Nydus queues), and hybrid deployment across private clouds and public‑cloud providers (Alibaba Cloud ECI, AWS).

Operational results show that during peak periods the Serverless workloads account for roughly 20% of CPU capacity, with over 500 Serverless applications and more than 10,000 virtual cores in use during high‑traffic windows. Resource utilization on private clouds increased by over 50%, and the mixed deployment model allows seamless spill‑over to public‑cloud resources when private capacity is saturated.

While Serverless brings significant cost and efficiency gains, the team notes challenges such as cold‑start delays, potential request loss during pod termination, and the need for teams to possess container/Kubernetes expertise. For workloads that are primarily offline, compute‑intensive, and bursty, Serverless is recommended; for steady, latency‑sensitive services, fixed‑size deployments may still be preferable.

Overall, NetEase Cloud Music’s Serverless journey demonstrates how open‑source cloud‑native technologies can be leveraged to achieve “cost‑reduction and efficiency‑increase” goals while maintaining high availability and scalability.

Cloud Nativeserverlessscalabilityresource optimizationKnativeevent-driven architecture
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.