Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions
This article details NetEase DataFan's journey in building a full‑stack big‑data platform, explains the design‑first data‑mid‑platform approach, analyzes cost, quality, and security problems encountered, and presents the modern data‑governance framework that integrates development, governance, and consumption into a closed loop.
NetEase DataFan shares the development history of its big‑data product line, starting from early work on distributed databases, file systems, and search engines, moving to Hadoop‑based analytics in 2009, the launch of a data platform in 2014, commercial exploration in 2017, and the formal introduction of the "data productivity" concept in 2020, culminating in the release of Data Governance 2.0 in 2022.
The platform is organized into four layers: (1) Infrastructure – the NDH distribution (or CDH/CDP) providing storage and compute, with features such as recycle‑bin support and compute‑storage separation; (2) Data R&D – covering the full lifecycle from design to testing, deployment, and operations, aiming for a DataOps‑style workflow; (3) Data Mid‑Platform – offering metric systems, model design, and data‑map products to help businesses build their own data middle‑platform; (4) Data Product – low‑code/no‑code tools like BI and data portals that enable users to create scenario‑driven data products and achieve the "everyone uses data" goal.
In the first stage, NetEase adopted a "design‑first, develop‑later" strategy for the data mid‑platform. Rapid business growth in 2018 exposed problems such as inconsistent metric definitions, lack of modeling standards, and massive data duplication, leading to slow development cycles and poor query performance. Analysis showed that over 50% of tasks read raw ODS data directly, with many ad‑hoc queries hitting raw tables, and more than 40% of tables lacking proper layering.
To address these issues, NetEase introduced a three‑step solution: metric definition, model definition, and data development. Metric definition standardized atomic and derived metrics, reducing the number of metrics for the Kaola e‑commerce business by about half. Model definition applied dimensional modeling and enforced the use of a shared data layer, pulling ODS data into a common repository. Data development then focused on measuring model completeness, reuse, and compliance, resulting in a reduction of cross‑layer references from 30% to below 10% and an increase in model reuse from 2.4% to 9.6%.
The second stage, called "dynamic governance," tackled three major pain points:
Cost issues : low ROI, inefficient resource usage, and exponential growth in infrastructure costs.
Quality issues : frequent data‑quality incidents reported by business teams, including severe cases causing financial loss.
Security issues : accidental deletions, insufficient permission granularity, and inadequate approval processes.
Solutions included building a data‑asset center to allocate costs to queries and tables, implementing an "onion‑style" data decommissioning process, and establishing baseline‑based task monitoring with early‑warning alerts. These measures led to the offline of 69 PB of data, a 38% reduction in compute resource usage, and significant improvements in data‑quality incident response.
The final stage presents a modern data‑governance model with four core characteristics: (1) integrated development‑governance, ensuring data quality at the source; (2) a closed‑loop governance process that discovers, resolves, and operates on issues; (3) unified governance of both in‑platform and external data sources (databases, MPP, etc.); and (4) a data‑asset portal that makes assets discoverable and consumable.
Implementation details include defining data standards (data elements), linking them to metrics, models, and security policies, and using these standards throughout the data lifecycle. Governance metrics (cost, safety, value, quality, standards) are scored and visualized, enabling users to identify low‑performing assets and trigger automated decommissioning or optimization workflows. Continuous operation relies on a governance workflow where owners receive tickets, resolve issues, and the system tracks improvements.
Additional topics covered are the logical data‑lake unified governance solution for external tables, the one‑stop data‑consumption platform, and a Q&A session addressing standard‑setting responsibilities, baseline management, and alignment with industry standards.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.