How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs
This report details Alibaba’s large‑scale data model governance initiative for the DaTao ecosystem, analyzing current data issues such as naming inconsistencies, low reuse, and application‑layer inefficiencies, and presents a comprehensive solution—including a model evaluation system, DataWorks co‑development, intelligent modeling, data map enhancements, and future roadmap—to improve data health, reduce costs, and increase operational efficiency.
Data Current Situation
Alibaba’s DaTao data system has grown dramatically, supporting complex business scenarios, but increasing data volume and developer count have exposed problems such as non‑standard table naming, low reuse of common‑layer tables, and inefficient application‑layer practices, leading to higher storage costs, reduced efficiency, weaker standards, and heavier operational burdens.
1. Normative Issues
Table names often do not follow Alibaba’s big‑data naming standards, making governance difficult.
2. Common‑Layer Reuse Issues
Many common‑layer tables are referenced by fewer than two downstream projects.
Insufficient common‑layer construction or exposure (e.g., CDM usage declines while ADS usage rises).
Numerous ADS tables share similar logic and fields, leading to code duplication.
3. Application‑Layer Efficiency Issues
Proliferation of temporary tables (TDDL, PAI, machine, stress‑test) hampers data management.
Common‑layer tables are scattered across many teams, causing uneven ownership.
Deep dependency chains in ADS tables (often >10 layers) increase complexity.
Cross‑market dependencies among ADS tables affect data stability and accuracy.
Mixed ownership of tables leads to uneven maintenance workload.
Problem Analysis
The digitalized analysis reveals gaps in data evaluation, construction, management, and usage. Evaluation lacks a unified scoring system; construction suffers from missing end‑to‑end modeling tools; management relies on ad‑hoc governance; and usage is hindered by difficulty in finding and trusting data.
Solution
Based on the analysis, the following objectives were set:
Model digitization: build a comprehensive model evaluation framework to assess data health and provide improvement suggestions.
Public model sinking: define clear standards for sinking data into the common layer.
Productization: co‑develop a modeling product covering design, review, development, control, and governance.
Daily governance: monitor model health continuously and optimize.
Find‑data efficiency: improve data retrieval, enhance recommendation accuracy, and showcase core data in data catalogs.
Overall Design
1. DataWorks Co‑development
DataWorks, built on MaxCompute/EMR/Hologres, provides a one‑stop big‑data development and governance platform. By deeply collaborating with the DataWorks team, Alibaba contributed years of modeling, development, and operations experience to enhance intelligent modeling, development assistants, and data maps, achieving end‑to‑end productization of data design, development, control, and usage.
Intelligent Data Modeling
DataWorks now supports four major modules: warehouse planning, data standards, dimensional modeling, and data metrics. Features include reverse modeling of existing physical tables, forward visual modeling, Excel‑based modeling, and code‑based modeling. Models can be reviewed and published to engines such as MaxCompute and Hologres, and generated ETL code integrates with DataStudio to boost developer efficiency.
Warehouse planning: customizable data domains, data marts, and naming conventions.
Dimensional modeling: reverse engineering of existing tables, forward modeling of dimension, detail, summary, and application tables, with visual, Excel, and code options.
Model release: supports five engines and auto‑generates ETL scaffolding.
Development Assistant
The assistant provides permission reminders, release controls, and automatic construction of temporary tables during code development.
2. Model Scoring
Model scores are visualized on an internal dashboard, offering governance suggestions that link directly to relevant product pages. Scores can be quickly configured for new projects by providing a project list, producing table‑level, owner‑level, and BU‑level metrics.
3. Find‑Data Efficiency
Search & Recommendation: left‑side filter conditions expose high‑frequency options, improving CTR.
Content & Organization – Table Description Upgrade: integrated with Yuque editor for better terminology management.
Data Albums: collaborative maintenance, multi‑user editing, and searchable remarks.
Data Map Integration: tables are marked as model tables with visible model information.
Data Q&A Bot: algorithmic processing of data knowledge enables table lookup and usage via a chatbot.
Summary of Achievements (FY22)
Established a model evaluation system that assesses data health from multiple dimensions and provides governance recommendations.
Delivered intelligent modeling capabilities (warehouse planning, dimensional modeling) in collaboration with DataWorks.
Enhanced the data map with search, recommendation, data album, and table description features, greatly improving data discovery and usage.
Defined clear common‑layer sinking standards, collaboration rules, hand‑over processes, and newcomer training to ensure consistent governance.
Future Plans
Upgrade the DaTao model architecture, covering design, development, operation, and governance standards.
Continue co‑building with DataWorks to further improve common‑layer and application‑layer development efficiency.
Accelerate data‑map adoption, allowing teams to independently configure and maintain data albums.
Deepen integration between intelligent modeling outputs and the data map for rapid model exposure.
Enhance table query and usage capabilities through documentation, Q&A bots, and algorithmic knowledge extraction.
Advance the development assistant for better table recommendation and control.
Upgrade the common‑layer evaluation system with lineage information.
Automate model, table, and service deprecation to reduce manual effort.
For detailed documentation of DataWorks intelligent data modeling, visit the official help page.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
