Big Data 21 min read

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

This article provides a comprehensive guide to designing the Data Warehouse Detail (DWD) layer, covering Kimball‑based design principles, step‑by‑step modeling, table and field naming conventions, concrete Hive DDL/DML examples, and optimization techniques such as partitioning, bucketing, and compression.

Ma Wei Says
Ma Wei Says
Ma Wei Says
Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

01 Design Principles and Steps

The DWD layer sits between the Operational Data Store (ODS) and the Data Warehouse Summary (DWS) and stores cleaned, transformed, and standardized detail data. Its design follows Kimball’s dimensional modeling theory, emphasizing fact tables.

Key design principles include:

Divide data by business subject (e.g., orders, users).

Maintain detail granularity (one row per transaction or event).

Apply consistent data cleaning and standardization.

Enforce primary‑key and foreign‑key constraints for integrity.

Optimize performance with partitioning, bucketing, or indexing.

Ensure scalability and extensibility.

Use wide tables with selective denormalization to reduce joins.

Protect sensitive data through masking and access control.

The modeling process consists of four core steps:

Select the business process and define the corresponding fact table.

Declare the grain of the fact table.

Identify and, where appropriate, degenerate dimensions.

Define measurable facts and their calculation logic.

Advantages of dimension degeneration: fewer joins and faster queries. Disadvantages: increased storage and potential update overhead for frequently changing attributes.

02 Table and Field Design

Fact tables typically contain the following field categories:

Business primary key (e.g., order_id, store_id) to guarantee row uniqueness.

Foreign keys linking to dimension tables, supporting downstream aggregation.

Measure fields storing business metrics such as amount or quantity.

Time fields for temporal analysis.

Partition fields (e.g., dt) to improve query performance.

Degenerated dimension fields for low‑cardinality attributes embedded directly in the fact table.

Field‑naming guidelines :

Use clear, business‑oriented names (e.g., payment_amount, trade_time).

Follow snake_case and prefix conventions (e.g., dwd_fact_, dwd_dim_).

Mark primary keys with an id suffix, foreign keys with _id, and measures with descriptive names.

Include comprehensive comments describing meaning, units, and format.

payment_amount DECIMAL(16,2) COMMENT 'Payment amount, RMB, two‑decimal precision'</code><code>payment_time TIMESTAMP COMMENT 'Payment timestamp, format: YYYY‑MM‑DD HH:MM:SS'</code><code>payment_type STRING COMMENT 'Payment type (degenerate dimension, values: Alipay/WeChat)'

03 Fact Table Design

3.1 Transactional Fact Table

Transactional fact tables capture immutable business events. Example: a payment fact table records each payment transaction with dimensions such as time, user, and location.

CREATE EXTERNAL TABLE dwd_fact_payment_info ( 
    id STRING COMMENT 'Payment ID',
    out_trade_no STRING COMMENT 'External business number',
    order_id STRING COMMENT 'Order ID',
    user_id STRING COMMENT 'User ID',
    alipay_trade_no STRING COMMENT 'Alipay transaction number',
    payment_amount DECIMAL(16,2) COMMENT 'Payment amount, RMB, two‑decimal precision',
    subject STRING COMMENT 'Transaction description',
    payment_type STRING COMMENT 'Payment type (degenerate dimension, values: Alipay/WeChat)',
    payment_time TIMESTAMP COMMENT 'Payment time, format: YYYY‑MM‑DD HH:MM:SS',
    province_id STRING COMMENT 'Province ID'
) COMMENT 'Payment fact table'
PARTITIONED BY (dt STRING COMMENT 'Partition field, format: YYYY‑MM‑DD')
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_fact_payment_info/'
TBLPROPERTIES ("parquet.compression"='snappy');

Data loading with cleansing logic:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
INSERT OVERWRITE TABLE dwd_fact_payment_info PARTITION(dt='2021-05-03')
SELECT 
    pi.id,
    pi.out_trade_no,
    pi.order_id,
    pi.user_id,
    pi.alipay_trade_no,
    COALESCE(pi.total_amount,0) AS payment_amount,
    pi.subject,
    pi.payment_type,
    pi.payment_time,
    oi.province_id
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY payment_time DESC) AS rn
    FROM ods_payment_info
    WHERE dt='2021-05-03' AND total_amount>0
) pi
JOIN ods_order_info oi ON pi.order_id = oi.id
WHERE pi.rn = 1;

3.2 Periodic Snapshot Fact Table

Snapshot tables store a full view of business state for each period (e.g., daily cart snapshots). They are refreshed in full each cycle and may be purged periodically.

CREATE EXTERNAL TABLE dwd_fact_cart_info ( 
    id STRING COMMENT 'ID',
    user_id STRING COMMENT 'User ID',
    sku_id STRING COMMENT 'Product ID',
    cart_price DECIMAL(16,2) COMMENT 'Price at add‑to‑cart, RMB',
    sku_num INT COMMENT 'Quantity',
    sku_name STRING COMMENT 'Product name (redundant)',
    create_time TIMESTAMP COMMENT 'Creation time',
    operate_time TIMESTAMP COMMENT 'Modification time',
    is_ordered STRING COMMENT 'Ordered flag (degenerate dimension, 1/0)',
    order_time TIMESTAMP COMMENT 'Order time',
    source_type STRING COMMENT 'Source type',
    source_id STRING COMMENT 'Source ID'
) COMMENT 'Cart snapshot fact table'
PARTITIONED BY (dt STRING COMMENT 'Partition field, format: YYYY‑MM‑DD')
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_fact_cart_info/'
TBLPROPERTIES ("parquet.compression"='snappy');
INSERT OVERWRITE TABLE dwd_fact_cart_info PARTITION(dt='2021-05-03')
SELECT 
    id, user_id, sku_id,
    COALESCE(cart_price,0) AS cart_price,
    sku_num, sku_name,
    create_time, operate_time, is_ordered, order_time,
    source_type, source_id
FROM ods_cart_info
WHERE dt='2021-05-03';

3.3 Cumulative Snapshot Fact Table

Cumulative snapshots track the lifecycle of an entity (e.g., coupon usage). They record each state change and are updated incrementally.

CREATE EXTERNAL TABLE dwd_fact_coupon_use ( 
    id STRING COMMENT 'ID',
    coupon_id STRING COMMENT 'Coupon ID',
    user_id STRING COMMENT 'User ID',
    order_id STRING COMMENT 'Order ID',
    coupon_status STRING COMMENT 'Coupon status (degenerate dimension, 1‑unused/2‑used/3‑expired)',
    get_time TIMESTAMP COMMENT 'Acquisition time',
    using_time TIMESTAMP COMMENT 'Use time (order)',
    used_time TIMESTAMP COMMENT 'Payment time'
) COMMENT 'Coupon usage fact table'
PARTITIONED BY (dt STRING COMMENT 'Partition field, format: YYYY‑MM‑DD')
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_fact_coupon_use/'
TBLPROPERTIES ("parquet.compression"='snappy');
INSERT OVERWRITE TABLE dwd_fact_coupon_use PARTITION(dt)
SELECT 
    id, coupon_id, user_id, order_id, coupon_status,
    get_time, using_time, used_time,
    date_format(get_time,'yyyy-MM-dd') AS dt
FROM ods_coupon_use
WHERE dt='2021-05-03';

04 Optimization Strategies

4.1 Partitioning and Bucketing

Partition data by time (dt) or business dimensions (region_id) to limit scans. Bucketing on high‑cardinality fields such as user_id enables map‑side joins and parallel processing. Example: bucket the cart fact table by user_id into 64 buckets.

CREATE EXTERNAL TABLE dwd_fact_cart_info ( ... )
PARTITIONED BY (dt STRING COMMENT 'Partition field, format: YYYY‑MM‑DD')
CLUSTERED BY (user_id) INTO 64 BUCKETS
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_fact_cart_info/'
TBLPROPERTIES ("parquet.compression"='snappy');
SET hive.enforce.bucketing=true;
INSERT OVERWRITE TABLE dwd_fact_cart_info PARTITION(dt='2021-05-03')
SELECT ... FROM ods_cart_info WHERE dt='2021-05-03';

4.2 Additional Strategies

Storage format & compression : Use columnar formats like Parquet or ORC with Snappy compression for efficient storage and query performance.

Indexing : Create bitmap or other indexes on high‑frequency query columns in Hive to accelerate lookups.

05 Conclusion

The DWD layer is a pivotal component of a modern data warehouse, requiring a balance between business needs, data quality, and performance. As data volume, real‑time requirements, AI integration, and cloud‑native environments evolve, DWD design becomes increasingly challenging and essential.

big dataData ModelingData WarehouseHiveDWDFact Table
Ma Wei Says
Written by

Ma Wei Says

Follow me! Discussing software architecture and development, AIGC and AI Agents... Sometimes sharing insights on IT professionals' life experiences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.