Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance
DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.
Introduction
DataWorks is a key PaaS product of Alibaba Cloud, providing end‑to‑end services such as data integration, data development, data catalog, data quality, and data services through a unified development‑management interface, enabling enterprises to focus on extracting value from data.
MaxCompute is an enterprise‑grade SaaS‑style cloud data warehouse designed for analytical scenarios. Built on a serverless architecture, it delivers fast, fully managed online data warehousing, eliminating traditional platform constraints on scalability and elasticity, minimizing operational effort, and allowing cost‑effective processing of massive datasets.
Data Architecture Selection
To support rapid business growth, we explored a one‑stop data development platform based on the DataWorks + MaxCompute framework. The following diagram shows our current big‑data platform architecture.
MaxCompute Warehouse Standards
Data Model Standards
Layered Data Division
ODS – Data ingestion layer (offline & real‑time), stores raw data; unstructured data is structured here.
CDM – Common data layer.
DIM – Public dimension layer for enterprise‑wide dimensions.
DWD – Detail‑level fact layer modeled on business processes.
DWS – Public aggregated fact layer modeled on analysis subjects.
ADS – Data application layer for customized statistical metrics.
Data flow and naming follow business classification, process, and domain segmentation.
Design Principles
Task flow, node, and table names should be clear and understandable.
Data models should be highly cohesive and loosely coupled.
Common foundational logic should be abstracted.
Layer Development Standards
ODS Layer Tables
Table naming: ods_{source_table}_{delta/flag} Field naming: original source name or {keyword}_col Task naming matches output table name.
Each source table is synchronized only once; suffix indicates sync mode (full/incremental).
Lifecycle management based on data retention policies.
DWD Layer (Detail Fact)
Table naming: dwd_{project}_{domain}_{custom_name}_{refresh_flag} Task naming matches output table name.
Storage partitioned by day; lifecycle set according to access span.
DWS Layer (Aggregated Fact)
Table naming: dws_{project}_{domain}_{custom_name}_{refresh_flag}{period} Task naming matches output table name.
Day‑level partitioning with appropriate lifecycle.
ADS Layer (Data Application)
Table naming: ads_{project}_{custom_name}{suffix} Suffix bi for reports/analysis, app for data products.
Common Development Standards
Layered call rules: Application layer cannot directly query ODS; it must go through CDM. DWS should prioritize DWD data. Each processing task produces only one output table.
Null‑value handling: Metrics default to 0; dimensions use predefined defaults.
Data Governance Based on DataWorks
Data Integration
Supports offline (batch) data synchronization, unifies multiple data sources, and eliminates data silos via databases, APIs, etc.
Two development modes:
Wizard mode – visual interface for most integration scenarios.
Script mode – write JSON scripts for fine‑grained synchronization configuration.
Data Development
Development is organized around business processes. Users create one or more business processes, each containing engine groups, and within each group, nodes, tables, resources, and functions are grouped by component type. Only components used in the current process are displayed.
On DataWorks, create a business process first, then perform development tasks.
All code changes for production‑environment tasks must be modified in the data development UI and go through a release workflow.
Data Operations
After publishing tasks to the production environment, operations can be performed in the Operations Center, including automatic and manual scheduling, task monitoring, resource usage tracking, real‑time task control, alarm configuration, and dedicated dashboards for integration and real‑time synchronization tasks.
Conclusion
Information is a valuable asset used for both operational record‑keeping and analytical decision‑making. Operational systems store data, while DW/BI systems consume it. This article briefly introduced the use of the DataWorks + MaxCompute framework; interested readers can explore the official documentation for deeper details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
BaiPing Technology
Official account of the BaiPing app technology team. Dedicated to enhancing human productivity through technology. | DRINK FOR FUN!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
