Inside NetEase Cloud Music’s MLOps: Scaling AI with VK, ECI, and Ceph
This article details NetEase Cloud Music’s four‑layer machine‑learning platform architecture, covering resource provisioning with Visual Kubelet and Alibaba Cloud ECI, Ceph storage optimizations, TensorFlow migration, large‑scale graph neural network support, and end‑to‑end workflow tooling that together enable efficient, cost‑effective AI development and deployment.
Resource Layer: Platform Core Capability Assurance
The platform secures compute, storage, networking, and tenancy resources while optimizing costs through virtualization and dynamic resource pools. It can quickly acquire additional compute from other teams for bursty workloads. Two examples illustrate this:
Visual Kubelet (VK) Resources
NetEase operates many Kubernetes clusters. The internal kubeMiner system unifies scheduling across clusters, allowing idle resources to be presented as virtual nodes within the main cluster. By routing CPU‑intensive tasks such as graph computation, large‑scale discrete problems, and distributed training to VK, parallelism and iteration speed improve dramatically.
Typical CPU jobs (tfjob, mpijob, PaddlePaddle) require 4‑8 cores and 12‑20 GB memory per replica. Prior to VK, limited compute forced low replica counts and long training times. VK enables multi‑replica, multi‑task parallelism by leveraging idle capacity in other clusters.
Alibaba Cloud Elastic Container Instance (ECI)
GPU resources are scarce across the company. To address sudden GPU demand, the platform integrates Alibaba Cloud ECI in a manner similar to VK. Users select an ECI resource in the UI, and the platform schedules GPU workloads elastically. This capability is already used by bursty services.
Base Layer: Foundational Capabilities for Users
This layer builds on the resource layer to provide big‑data, real‑time, and massive‑task scheduling capabilities via Spark, Hadoop, Flink, and Kubernetes + Docker. A major focus is Ceph, the distributed storage system used throughout the platform.
Ceph Optimizations
Ceph powers a shared filesystem for development and training tasks, but growing usage revealed several pain points:
Data safety – CephFS lacks a recycle‑bin, making accidental deletions irreversible.
Cost‑performance balance – Pure SSD drives are fast but expensive; pure HDDs are cheap but slow. A mixed SSD‑for‑logs, HDD‑for‑data layout is desired.
High‑throughput small‑file workloads – Compilation, logging, and sample downloads require both high throughput and efficient small‑file handling.
To address these, the team collaborated with the internal storage group and delivered three major improvements:
Improvement 1: CephFS Recycle Bin – Implemented a trashbin directory; delete operations are transformed into rename calls, preserving user habits while enabling periodic cleanup and restoration.
Improvement 2: Mixed‑Storage Performance Boost – Analyzed I/O patterns, identified two critical bottlenecks in the Ceph code path, and refactored them. The tuned version yields 7‑8× lower latency and IOPS improvement under ample resources, and >2× under throttling.
Improvement 3: Full‑Stack CephFS Performance Enhancements – Developed asynchronous large‑directory deletion (seconds‑level vs. hours‑level), doubled large‑file write bandwidth and halved latency, accelerated Git status and make builds by moving metadata handling from user‑space FUSE to kernel space, and introduced multi‑metadata‑node architecture to cut average metadata latency by >50%.
Application Framework Layer: Tools for Most ML Workloads
This layer supplies the frameworks needed for model development, including TensorFlow and large‑scale graph neural networks (GNNs).
TensorFlow Migration and A100 MIG
In 2021 the platform added A100 GPUs, but TensorFlow 1.x (the version used internally) does not support CUDA 11. The team upgraded the platform to fully support TensorFlow 2.6 and Nvidia‑provided TensorFlow 1.15, enabling CUDA 11 compatibility. They also adopted Nvidia MIG to partition a single A100 into smaller instances (2‑10 GB, 3‑20 GB, 4‑40 GB), increasing overall throughput.
After migration, training speed improved >40% on average (up to 170% for some jobs) and inference performance rose >20% while maintaining compatibility with legacy TF 1.x models.
Large‑Scale Graph Neural Networks
Using the PaddlePaddle Graph Learning (PGL) framework, the platform built a user‑embedding graph with billions of edges across users, songs, podcasts, etc. Challenges included massive data volume, huge model parameters, and cost‑benefit trade‑offs.
Solutions implemented:
GraphService provides graph‑database‑like storage and sampling for massive graphs.
Kubernetes MPI‑Operator enables ultra‑large‑scale graph storage and sampling.
Integration of K8s TF‑Operator and MPI‑Operator to handle distributed training, storage, and sampling.
Elastic scaling of compute and storage via VK resources and CephFS, dynamically expanding or shrinking resources after training.
Function Layer: End‑to‑End ML Lifecycle Support
The platform orchestrates the full ML lifecycle: data sample services, feature operator development, model training & offline evaluation, and model service deployment & continuous updates.
Key capabilities:
Standardized sample services that integrate Spark, Flink, Hadoop, and expose a unified FeatureStore.
Feature operator DSL that compiles into feature_extractor packages for both offline and online use.
Model deployment workflow that handles initial resource provisioning, environment setup, and dynamic updates of models, configs, and dictionaries without full redeployment.
End‑to‑End Platform Benefits
By centering on models, the platform reduces developer effort, cutting end‑to‑end development time from weeks to days, and provides visualized workflow tracking through a unified metadata center that records sample usage, feature schemas, training hyper‑parameters, and resource consumption.
ModelZoo: Modular Model Service Marketplace
ModelZoo offers reusable model services, SDKs, and APIs for inference, fine‑tuning, and re‑training. It abstracts resources (GPU, VK, ECI), algorithms (CV, NLP, Faiss), delivery methods (SDK vs. API), and tasks (inference, fine‑tuning, re‑training). Recent work includes serverless deployment via K8s, integration of TF‑Serving and TorchServe, and performance tuning (MKL‑compiled images, session thread adjustments) that reduces latency by ~30% in high‑QPS scenarios.
Conclusion
The NetEase Cloud Music machine‑learning platform spans four layers—resource, base, application framework, and function—each addressing distinct technical challenges while collectively enabling scalable, cost‑effective AI development and deployment across a wide range of services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
