Big Data 15 min read

Mastering DWS Layer Design: Principles, Steps, and Best Practices

This article explains the role of the DWS layer in data warehouses, outlines design principles, step‑by‑step modeling, naming conventions, field design, provides concrete DDL/ETL examples, common pitfalls, and how to build reusable, performant summary tables for analytics.

Ma Wei Says
Ma Wei Says
Ma Wei Says
Mastering DWS Layer Design: Principles, Steps, and Best Practices

What is the DWS layer?

The DWS (Data Warehouse Summary) layer sits above the detailed DWD layer and below ADS/DWT, serving as a light‑aggregation, theme‑oriented wide‑table layer that transforms detailed facts into summary tables for BI reports, analytics, and machine‑learning models.

Design Principles

Theme‑driven design – build tables around business subjects such as order analysis or user behavior.

Base aggregation on DWD – use the original star schema (fact + dimension tables) to define aggregation dimensions and metrics.

Light aggregation and wide‑table design – keep summaries flexible while reducing duplicate calculations.

Focus on a single data domain – avoid mixing unrelated domains in one table.

Prioritize reusability – design metrics that can serve multiple analyses.

Explicit statistical period – embed the period (e.g., _1d, _1w) in table/field names.

Balance performance and maintenance – consider query speed versus ETL complexity.

Design Steps

Requirement analysis – clarify upstream metrics, analysis scenarios, and granularity (day, hour, etc.).

Define analysis theme – abstract requirements into a clear business subject.

Determine aggregation dimensions and statistical period – choose dimensions (province, city, category) and appropriate granularity.

Identify data sources and calculation logic – map DWD tables (e.g., dwd_order_detail, dwd_product_info) to the summary.

Design table and field names following naming conventions.

Implement DDL and ETL scripts – create the DWS table and write INSERT‑OVERWRITE jobs.

Naming Conventions

Table names start with dws_, include the business theme, statistical period, and a suffix such as _summary or _wide (e.g., dws_order_summary_1d). All letters are lower‑case and words are separated by underscores. Field names end with _id, _name for dimensions, _count, _amount for metrics, and the partition field ds stores the date in yyyyMMdd format.

Field Design

Classify fields into:

Primary‑key fields – usually a combination of dimension columns.

Dimension fields – describe analysis axes (time, region, product).

Metric fields – store derived indicators such as total amount or order count.

Partition field – ds for date‑based partitioning.

Metadata fields – optional columns like update_time.

Example DDL and ETL

CREATE TABLE IF NOT EXISTS dws_order_summary_1d (
    province STRING COMMENT '省份',
    city STRING COMMENT '城市',
    category_id BIGINT COMMENT '商品类目ID',
    category_name STRING COMMENT '商品类目名称',
    total_amount DECIMAL(10,2) COMMENT '最近一天订单总金额(已支付)',
    order_count BIGINT COMMENT '最近一天订单数量(去重)',
    user_count BIGINT COMMENT '最近一天购买用户数(去重)',
    update_time TIMESTAMP COMMENT '数据更新时间'
) COMMENT '订单粒度最近一天汇总事实表'
PARTITIONED BY (ds STRING COMMENT '分区字段,格式:yyyyMMdd')
STORED AS ORC;
INSERT OVERWRITE TABLE dws_order_summary_1d PARTITION (ds='${bizdate}')
SELECT
    COALESCE(o.province,'unknown') AS province,
    COALESCE(o.city,'unknown') AS city,
    COALESCE(p.category_id,-1) AS category_id,
    COALESCE(p.category_name,'unknown') AS category_name,
    SUM(COALESCE(o.amount,0)) AS total_amount,
    COUNT(DISTINCT o.order_id) AS order_count,
    COUNT(DISTINCT o.user_id) AS user_count,
    CURRENT_TIMESTAMP AS update_time
FROM dwd_order_detail o
LEFT JOIN dwd_product_info p ON o.product_id = p.product_id
WHERE o.order_date='${bizdate}'
  AND o.order_status='paid'
  AND o.amount IS NOT NULL
GROUP BY
    COALESCE(o.province,'unknown'),
    COALESCE(o.city,'unknown'),
    COALESCE(p.category_id,-1),
    COALESCE(p.category_name,'unknown');

Typical Queries

SELECT ds, province, SUM(total_amount) AS gmv
FROM dws_order_summary_1d
WHERE ds='20231001'
GROUP BY ds, province;
SELECT category_name, SUM(total_amount) AS gmv, SUM(order_count) AS orders
FROM dws_order_summary_1d
WHERE ds='20231001'
GROUP BY category_name;

Common Pitfalls

Large data volume causing slow queries – add partitioning (daily/monthly), bucketing, and SQL optimizations.

Frequent schema changes – reserve extension fields and use dynamic partitions, but avoid over‑engineering.

Inconsistent metric definitions – maintain a metric dictionary, document calculation logic, and enforce consistency via metadata tools.

Conclusion

The DWS layer provides a balanced, theme‑oriented aggregation tier that delivers high‑performance support for downstream applications while preserving analytical flexibility. Adhering to clear design principles, naming standards, and governance practices ensures reusability, maintainability, and scalability of summary tables.

big dataSQLData WarehouseETLDesign Principlestable namingDWS Layer
Ma Wei Says
Written by

Ma Wei Says

Follow me! Discussing software architecture and development, AIGC and AI Agents... Sometimes sharing insights on IT professionals' life experiences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.