Evolution and Practices of Modern Data Governance at NetEase DataFun
This article outlines NetEase DataFun's journey in building a full‑stack big data platform, describing the four‑stage development of data governance—from designing a unified data middle‑platform to addressing cost, quality, and security challenges—and presents the principles of modern data governance that integrate development, consumption, and continuous improvement.
Introduction: This article shares the development process of NetEase DataFun's data governance, explains modern data governance concepts, and proposes core features such as integrated development‑governance, closed‑loop governance, unified in‑and‑out‑of‑warehouse governance, and a data asset portal.
The talk is organized into four parts:
NetEase DataFun big data overview
Unified middle‑platform: design before development
Responsive governance: agile "movement" governance
Governance framework: modern data governance
Speaker: Yu Lihua, General Manager of NetEase DataFun Big Data product line.
01 NetEase DataFun Big Data Overview
NetEase's big data team started with distributed databases, file systems, and search engines, moved to Hadoop for analysis in 2009, launched a big data platform in 2014, commercialized in 2017, and built a data middle‑platform in 2018. In 2020 the "Data Productivity" concept was introduced, and Data Governance 2.0 was released in 2022.
Today NetEase DataFun offers a four‑layer full‑stack big data product system:
Infrastructure layer : NDH distribution, compatible with CDH/CDP, providing storage and compute, with features like recycle bin, compute‑storage separation, and hybrid deployment.
Data development layer : Covers data design, development, testing, deployment, and operation, aiming for a DataOps‑style workflow.
Data middle‑platform layer : Provides metric systems, model design, data maps, etc., to help businesses build a data middle‑platform.
Data product layer : Offers BI tools, data portals, and low‑code/no‑code solutions to enable "everyone uses data" and realize data productivity.
02 Unified Middle‑Platform: Design Before Development
In 2018 rapid business growth led to siloed data warehouses, causing inconsistent metric definitions, lack of modeling standards, and data duplication.
Analysis showed over 50% of tasks read raw ODS data, 30% of ad‑hoc queries hit raw data, and many tables lacked proper layering, resulting in slow response and low query efficiency.
Solution: a three‑step process—metric definition, model definition, and data development—starting with clear metric standards to avoid inconsistent definitions, followed by dimensional modeling, and finally data development with measurable model quality (completeness, reuse, compliance).
Results: cross‑layer reference rate dropped from 30% to below 10%, model reuse increased from 2.4% to 9.6%, and 34,000 models were decommissioned, significantly improving delivery speed and query performance.
03 Responsive Governance: Agile "Movement" Governance
Cost issues : low ROI, resource misuse, and exponential cost growth.
Low output‑input ratio: many tables unused for over 30 days.
Inefficient resource usage: analysts write heavy SQL, developers complain about slow queries.
Cost index skyrockets due to uncontrolled machine scaling.
Solution: build a data asset center to account for query and storage costs, implement "onion"‑style data decommissioning, and forecast task/query costs for pre‑approval.
Result: 69 PB of data decommissioned, with 47.6% and 61.0% table reductions for Cloud Music and Yanxuan, saving 38% of compute resources.
Quality issues : average of 10 weekly data quality incidents, many discovered by business, causing severe losses.
Solution: full‑chain data quality tracking, intelligent baseline operations with SLA‑based baselines and early warning, and impact analysis to quickly resolve incidents.
Result: early detection and resolution of baseline breaches, preventing major accidents.
Security issues : accidental deletion of entire data warehouse, insufficient permission granularity, and lack of approval processes.
Solution: public recycle bin, directory freeze, backup‑restore across clusters, row‑level and queue‑level permissions, tag‑based access control, and custom approval workflows.
04 Governance Framework: Modern Data Governance
Traditional data governance suffers from "pollute‑then‑govern", lack of unified metrics, poor data discoverability, and confinement to the big data platform.
Modern data governance should have four characteristics:
Integrated development‑governance from source.
Closed‑loop governance for existing data.
Unified in‑warehouse and out‑of‑warehouse governance.
Data asset portal for easy discovery and consumption.
Implementation includes:
Data standards defining data elements, metrics, and security rules.
Model design based on dimensional modeling linked to standards.
Data development following defined standards and metrics.
Closed‑loop governance with problem discovery, solution tools, and operational mechanisms (e.g., red‑black lists, KPI ties).
Asset scoring across safety, cost, value, quality, and standards, with dashboards and recommendations.
Process for governance tickets, owner assignment, and cross‑department collaboration.
Culture building through data competitions, certifications, and training.
Logical data lake for unified governance of external sources (Oracle, MySQL) via metadata registration and mapping.
One‑stop data consumption platform with portal, permission management, and BI integration.
05 Q&A
Q1: Who defines data and metric standards inside NetEase? A: Business units have independent standards; data governance teams lead standardization and audit, while data teams provide data and fix metadata issues.
Q2: What is a governance baseline? A: A set of inter‑dependent tasks with SLA, monitored for expected completion time; early warnings are issued before a breach.
Q3: Can standards be aligned with industry benchmarks? A: Standards are productized and embedded throughout the data lifecycle, enabling both pre‑ and post‑governance compliance checks.
Conclusion: Modern data governance emphasizes integrated development‑governance, measurable closed‑loop improvement, and a focus on data consumption to unlock value.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
