Big Data 15 min read

Cloud Music Data Governance Practice

This article presents a comprehensive case study of NetEase Cloud Music's data governance practice, covering data background, governance philosophy, detailed solutions across metadata, storage, compute, and model design, practical implementations, measurable cost savings, and future planning for sustainable data management.

DataFunTalk

Jun 9, 2023

Introduction

The sharing session titled "Cloud Music Data Governance Practice" was delivered by Huang Kai, a senior data development engineer at NetEase Cloud Music, and edited by Sun Yinggang from Harbin Institute of Technology.

Data Background

Cloud Music has launched nine independent products over the past nine years, including domestic and overseas social products. All product data is supported by the data warehouse development team. The data environment includes over 2 000 scheduled tasks, more than 50 000 tables, 12 active projects on the Mammoth data development platform, and over 600 data users.

Key challenges identified:

Scale: massive data volume and task count.

Cost: >80 PB daily storage costing over 190 k RMB, and daily compute cost of 270 k RMB.

Quality: high resource utilization (>95%) causing operational risk.

Efficiency: legacy Hive and Spark 2 jobs, small file issues, and direct ODS table reads.

Environment: five on‑premise Hadoop clusters plus AWS and Alibaba Cloud.

Overall problems: large scale, heterogeneous environment, lack of standards, and resource waste.

Governance Approach

The governance direction is defined from four perspectives: data assurance, model design rationality, management standards, and monitoring. Specific issues include legacy Hive/Spark 2 jobs, chaotic model dependencies, uncontrolled database proliferation, and lack of monitoring for invalid files.

To address these, the team focuses on obtaining complete and accurate metadata (tables, tasks, lineage) as the foundation for effective governance.

Governance Solution

Metadata-driven modeling on the Mammoth platform produces a rich CDM layer, enabling visibility of assets, health metrics, and cost usage per team. The governance framework follows four pillars: monitoring, standardization, tooling, and execution.

The guiding principles are: governance based on evidence, clear responsibility, sustainable mechanisms, recoverable outcomes, and reusable methods.

Governance Practice

The practice emphasizes two core skills: clear ownership of data, tables, and tasks, and a sustainable mechanism for rollout. Ownership actions include assigning ODS tasks/tables, handling departing personnel assets, and managing project account tables.

Mechanism: a dedicated governance process with public announcements (email or group), followed by execution and rollback if needed.

Key governance actions:

"Source" cleanliness – ensure clean source data.

"Complex" 80/20 – achieve high ROI with minimal effort.

"Stability" guarantee – maintain online stability.

Specific implementations include:

HDFS layer – identification and cleanup of orphan files (450 M files, >7 PB storage reclaimed).

Database layer – consolidation of redundant databases and recovery of approval permissions.

Table layer – lifecycle governance, large‑table handling, and A/B test table management, reducing tables from 55 k to 30 k.

Model layer – "three‑degree" metrics (health, value, progress) covering reuse rate, penetration, asset standardization, and coverage.

Compute layer – migration to Spark 3 with optimizations (AQE, Z‑order, ZSTD compression), achieving 30 % resource savings for core jobs.

Governance Results

Cost and efficiency gains:

Storage: annual saving of ¥25 M, >30 PB storage decommissioned, daily growth reduced from 170 TB to 55 TB.

Compute: 30 % resource reduction for core tasks, improved cluster stability.

Table cleanup: 25 k tables removed.

Asset consolidation: visual dashboards for data assets, three‑degree metrics, cost‑storage, and governance effectiveness.

Tooling: personalized governance reports, temporary table reports, reuse‑rate lists, and ownership monitoring tools.

Development standards: refined guidelines for database usage, temporary table creation, task naming, queue usage, deployment, and decommission processes.

Future Planning

Data governance is a long‑term, continuous effort aiming to shift from scattered to centralized, from passive to proactive/automatic, and from experience‑driven to intelligence‑driven. The roadmap includes pre‑governance (preventive measures, data white‑paper), mid‑governance (monitoring tools, real‑time alerts), and post‑governance (guidelines, automated tools).

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

metadata Cost Optimization Spark Hadoop

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.