Big Data 8 min read

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.

BaiPing Technology

Mar 14, 2022

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

Introduction

DataWorks is a key PaaS product of Alibaba Cloud, providing end‑to‑end services such as data integration, data development, data catalog, data quality, and data services through a unified development‑management interface, enabling enterprises to focus on extracting value from data.

MaxCompute is an enterprise‑grade SaaS‑style cloud data warehouse designed for analytical scenarios. Built on a serverless architecture, it delivers fast, fully managed online data warehousing, eliminating traditional platform constraints on scalability and elasticity, minimizing operational effort, and allowing cost‑effective processing of massive datasets.

Data Architecture Selection

To support rapid business growth, we explored a one‑stop data development platform based on the DataWorks + MaxCompute framework. The following diagram shows our current big‑data platform architecture.

MaxCompute Warehouse Standards

Data Model Standards

Layered Data Division

ODS – Data ingestion layer (offline & real‑time), stores raw data; unstructured data is structured here.

CDM – Common data layer.

DIM – Public dimension layer for enterprise‑wide dimensions.

DWD – Detail‑level fact layer modeled on business processes.

DWS – Public aggregated fact layer modeled on analysis subjects.

ADS – Data application layer for customized statistical metrics.

Data flow and naming follow business classification, process, and domain segmentation.

Design Principles

Task flow, node, and table names should be clear and understandable.

Data models should be highly cohesive and loosely coupled.

Common foundational logic should be abstracted.

Layer Development Standards

ODS Layer Tables

Table naming: ods_{source_table}_{delta/flag} Field naming: original source name or {keyword}_col Task naming matches output table name.

Each source table is synchronized only once; suffix indicates sync mode (full/incremental).

Lifecycle management based on data retention policies.

DWD Layer (Detail Fact)

Table naming: dwd_{project}_{domain}_{custom_name}_{refresh_flag} Task naming matches output table name.

Storage partitioned by day; lifecycle set according to access span.

DWS Layer (Aggregated Fact)

Table naming: dws_{project}_{domain}_{custom_name}_{refresh_flag}{period} Task naming matches output table name.

Day‑level partitioning with appropriate lifecycle.

ADS Layer (Data Application)

Table naming: ads_{project}_{custom_name}{suffix} Suffix bi for reports/analysis, app for data products.

Common Development Standards

Layered call rules: Application layer cannot directly query ODS; it must go through CDM. DWS should prioritize DWD data. Each processing task produces only one output table.

Null‑value handling: Metrics default to 0; dimensions use predefined defaults.

Data Governance Based on DataWorks

Data Integration

Supports offline (batch) data synchronization, unifies multiple data sources, and eliminates data silos via databases, APIs, etc.

Two development modes:

Wizard mode – visual interface for most integration scenarios.

Script mode – write JSON scripts for fine‑grained synchronization configuration.

Data Development

Development is organized around business processes. Users create one or more business processes, each containing engine groups, and within each group, nodes, tables, resources, and functions are grouped by component type. Only components used in the current process are displayed.

On DataWorks, create a business process first, then perform development tasks.

All code changes for production‑environment tasks must be modified in the data development UI and go through a release workflow.

Data Operations

After publishing tasks to the production environment, operations can be performed in the Operations Center, including automatic and manual scheduling, task monitoring, resource usage tracking, real‑time task control, alarm configuration, and dedicated dashboards for integration and real‑time synchronization tasks.

Conclusion

Information is a valuable asset used for both operational record‑keeping and analytical decision‑making. Operational systems store data, while DW/BI systems consume it. This article briefly introduced the use of the DataWorks + MaxCompute framework; interested readers can explore the official documentation for deeper details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MaxCompute Data Governance DataWorks Cloud Data Warehouse

Written by

BaiPing Technology

Official account of the BaiPing app technology team. Dedicated to enhancing human productivity through technology. | DRINK FOR FUN!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.