Mastering Alibaba DataWorks: Data Warehouse Architecture & Modeling Guide
This comprehensive tutorial walks you through Alibaba DataWorks' data warehouse architecture, covering technical stack selection, three‑layer warehouse design (ODS, CDM, ADS), detailed data modeling with DDL examples, storage strategies, dimension and fact table conventions, and best‑practice hierarchical call standards.
01 Technical Architecture Selection
The tutorial uses Alibaba Cloud MaxCompute as an example, but the process and methodology are generic. Before data model design, you must choose the technical architecture. This guide employs MaxCompute together with DataWorks to complete the
data modeling and development workflow. The complete technical architecture diagram is shown below:
DataWorks' data integration handles data collection and basic ETL (which can be built on open‑source components). MaxCompute serves as the offline compute engine. DataWorks provides functions such as data development, data quality, data security, and data management.
02 Data Warehouse Layering
Alibaba recommends dividing the data warehouse into three layers from bottom to top:
ODS (Operation Data Store)– raw data storage.
CDM (Common Data Model)– common data layer.
ADS (Application Data Service)– data‑product layer.
The ODS layer stores unprocessed raw data, matching the source system's structure, and acts as the data preparation area of the warehouse. It primarily completes basic data ingestion to MaxCompute and records historical changes.
The CDM layer includes
DIM dimension tables, DWD and DWS, derived from ODS data. It handles data processing, integration, consistent dimensions, and builds reusable detailed fact tables and aggregated public‑grain metrics.
The ADS layer stores personalized statistical indicator data for data products, generated from CDM and ODS layers.
03 Data Model
1. Data Ingestion Layer (ODS)
ODS stores the most original data obtained from business systems, serving as the source for upper layers. Example ODS tables include order details, user details, product details, etc., stored as two‑dimensional tables. Although ODS is not strictly a modeling layer, proper planning and data synchronization are crucial.
Example ODS tables:
s_auction – auction product information.
s_sale – normal sale product information.
s_users_extra – detailed user information.
s_biz_order_delta – incremental order data.
s_logistics_order_delta – incremental logistics data.
s_pay_order_delta – incremental payment data.
Table naming convention uses the
_deltasuffix for incremental tables. If a column name conflicts with a keyword, append
_col1to resolve.
Table example (s_auction)
<code>CREATE TABLE IF NOT EXISTS s_auction (
id STRING COMMENT '商品ID',
title STRING COMMENT '商品名',
gmt_modified STRING COMMENT '商品最后修改日期',
price DOUBLE COMMENT '商品成交价格,单位元',
starts STRING COMMENT '商品上架时间',
minimum_bid DOUBLE COMMENT '拍卖商品起拍价,单位元',
duration STRING COMMENT '有效期,销售周期,单位天',
incrementnum DOUBLE COMMENT '拍卖价格的增价幅度',
city STRING COMMENT '商品所在城市',
prov STRING COMMENT '商品所在省份',
ends STRING COMMENT '销售结束时间',
quantity BIGINT COMMENT '数量',
stuff_status BIGINT COMMENT '商品新旧程度 0 全新 1 闲置 2 二手',
auction_status BIGINT COMMENT '商品状态 0 正常 1 用户删除 2 下架 3 从未上架',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
commodity_id BIGINT COMMENT '品类ID',
commodity_name STRING COMMENT '品类名称',
umid STRING COMMENT '买家umid'
) COMMENT '商品拍卖ODS'
PARTITIONED BY (ds STRING COMMENT '格式:YYYYMMDD')
LIFECYCLE 400;</code>To support historical analysis, ODS tables can add a time‑dimension partition field. Storage can be incremental, full, or chain (bitemporal) based on data characteristics.
2. Common Dimension Layer (DIM)
DIM builds consistent enterprise‑wide dimensions based on dimensional modeling principles. Dimension tables (DIM tables) store logical dimensions such as product, region, and time. Naming convention:
dim_{business}_pub_{dimension}[_{custom}]. Example:
dim_pub_area – public area dimension.
dim_asale_itm – A‑company e‑commerce product table.
Dimension table example
<code>CREATE TABLE IF NOT EXISTS dim_asale_itm (
item_id BIGINT COMMENT '商品ID',
item_title STRING COMMENT '商品名称',
item_price DOUBLE COMMENT '商品成交价格_元',
item_stuff_status BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
commodity_id BIGINT COMMENT '品类ID',
commodity_name STRING COMMENT '品类名称',
umid STRING COMMENT '买家ID',
item_status BIGINT COMMENT '商品状态_0正常1用户删除2下架3未上架',
city STRING COMMENT '商品所在城市',
prov STRING COMMENT '商品所在省份'
) COMMENT '商品全量表'
PARTITIONED BY (ds STRING COMMENT '日期,yyyymmdd');</code>3. Detailed Fact Layer (DWD)
DWD is business‑process‑driven, storing the most granular facts. Facts can be transactional, periodic snapshot, or cumulative snapshot. Design principles include single‑dimension association per fact table, inclusion of all relevant facts, consistent grain declaration, and handling of nulls.
Example DWD table (transactional product facts):
<code>CREATE TABLE IF NOT EXISTS dwd_asale_trd_itm_di (
item_id BIGINT COMMENT '商品ID',
item_title STRING COMMENT '商品名称',
item_price DOUBLE COMMENT '商品价格',
item_stuff_status BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
item_prov STRING COMMENT '商品省份',
item_city STRING COMMENT '商品城市',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
commodity_id BIGINT COMMENT '品类ID',
commodity_name STRING COMMENT '品类名称',
buyer_id BIGINT COMMENT '买家ID'
) COMMENT '交易商品信息事实表'
PARTITIONED BY (ds STRING COMMENT '日期')
LIFECYCLE 400;</code>4. Common Summary Fact Layer (DWS)
DWS aggregates DWD data based on analysis subjects, producing public‑grain summary tables. Each summary table typically corresponds to a derived metric. Naming convention:
dws_{business}_pub_{domain}_{grain}[_{tag}]_{period}, where period suffixes indicate the statistical window (e.g.,
_1dfor the latest day).
Example DWS tables for recent‑day metrics:
<code>CREATE TABLE IF NOT EXISTS dws_asale_trd_byr_ord_1d (
buyer_id BIGINT COMMENT '买家ID',
buyer_nick STRING COMMENT '买家昵称',
mord_prov STRING COMMENT '收货人省份',
cate_id BIGINT COMMENT '商品类目ID',
cate_name STRING COMMENT '商品类目名称',
confirm_paid_amt_sum_1d DOUBLE COMMENT '最近一天订单已经确认收货的金额总和'
) COMMENT '买家粒度所有交易最近一天汇总事实表'
PARTITIONED BY (ds STRING COMMENT '分区字段YYYYMMDD')
LIFECYCLE 36000;</code>04 Hierarchical Call Standards
After layering the warehouse, call relationships between layers must be defined. ADS should prioritize calling CDM data; direct ODS calls are prohibited when CDM provides the needed view. CDM views must be encapsulated with scheduling to ensure maintainability. Additional guidelines:
ODS data cannot be directly referenced by application tasks; access it via CDM views.
CDM task depth should not exceed ten layers.
Each compute task should output only one table (exceptions allowed).
If multiple tasks write to the same table (different partitions), create a virtual task in DataWorks to aggregate dependencies.
CDM summary layers should call CDM detail layers for additive metrics, avoiding heavy calculations on massive detail tables.
CDM detail layer snapshot facts should call CDM transactional facts to maintain data consistency.
Build CDM public summary layers selectively to prevent excessive reliance on CDM detail data by application layers.
(Content sourced from Alibaba Cloud)
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.