How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance
This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.
Problem Challenges
Before governance, the team faced severe baseline operation issues: over 80% of nights required emergency work and baseline production often missed the start of the workday, with nearly ten hours of daily overruns. Three main challenges were identified:
Massive task volume: billions of log entries and tens of thousands of tasks required rapid recovery after failures.
Resource scarcity: nightly resource usage exceeded 95% with no buffer or elastic capacity.
High business expectations: product MUSE relied on hundreds of business dashboards updated daily, making KPI accuracy critical.
Goal Measurement
The governance effort aimed to deliver four key values: provide management with real‑time business metrics, ensure stable and timely C‑end data for users, build the data development team’s reputation, and improve engineers’ operational happiness to boost organizational stability.
Two quantitative targets were set:
98% overall service availability (no more than seven unavailable days per year).
Baseline production time meeting core SLA requirements.
Action Plan
Overall Solution
The team broke the solution into three pillars: platform infrastructure, task operations, and organizational processes—summarized as “stable foundation, optimized tasks, defined standards.”
Stable Infrastructure
Key issues identified were unclear queue usage, experience‑based resource monitoring, overloaded NameNode clusters, and weak resource control. Solutions implemented:
Cluster stability: collaborated with Hangzhou research team to split high‑load NameNode databases, migrate hundreds of tables, and enhance monitoring for NVMe disks and critical nodes.
Resource digitization: built a reliable usage model providing numeric indicators for quick assessment of cluster resource status.
Productization: added queue‑priority support and self‑service analysis & data补数 features to improve resource efficiency.
Queue usage guidelines: defined roles for high, medium, and low‑priority queues and enforced usage standards.
Optimized Tasks
To handle the massive, diverse workloads, the team:
Introduced streaming ETL to pre‑aggregate data hourly, enabling one‑hour batch completion.
Upgraded Hive and Spark from version 2 to Spark 3, refactored SQL, and adjusted queues.
Established lineage between tables, tasks, and baselines to reduce scheduling delays and dependency errors.
Implemented anomaly monitoring for key metrics (e.g., DAU) using Holt‑Winters and XGBoost models, boosting recall by 74%, precision by 40%, and overall accuracy by 20%.
Spark upgrades, supported by the Hangzhou research team, transformed several hundred tasks, cutting resource consumption by 60%, improving performance by 52%, and reducing file counts by 69%.
Defined Standards
Standardization focused on operation scope and SOPs:
Defined core product and report baselines, reducing on‑call tasks from tens of thousands to a few thousand.
Assigned clear task owners using a hybrid tool‑plus‑manual approach.
Established baseline mounting principles, role responsibilities, and constraints.
Created strict SOPs, reward‑penalty mechanisms, and a “data ops traffic police” team with Hangzhou research for rapid incident handling.
Set up an official ops alert group for instant notifications.
Coordinated with security, front‑end, and QA teams to ensure unified incident review and root‑cause analysis.
Results
Business impact: core baselines now finish in the early morning, average alert days dropped by 60%, baseline breaches reached zero, and the 98% availability target was met.
Technical achievements: the team won the 2022 NetEase Group Open‑Source Introduction Award for their machine‑learning‑driven metric anomaly detection. Resource digitization enabled precise elastic scaling, preventing task delays and baseline breaches, and improved overall resource utilization and cost efficiency.
Product side: with strong support from Hangzhou research, queue resource tilting and self‑service data retrieval features were launched, further boosting resource efficiency.
Future Plans
The team will continue governance across four dimensions:
Product: introduce DataOps for automated code auditing throughout development, release, and approval.
System: enhance resource monitoring with label‑level CPU/Memory metrics and provide rationality assessments for task‑level resource configurations.
Business: expand content‑level monitoring coverage and accuracy, and close the lineage loop for online services.
Mechanism: collaborate with analysts and data product teams to define deprecation processes for reports and historical tasks.
Governance is a long‑term effort; success depends on meticulous execution and continuous refinement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
