Big Data 38 min read

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

dbaplus Community

Oct 30, 2022

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

1. Significance of Data Warehouse Layering

Organizing data with a clear model enables high‑performance, low‑cost, efficient, and high‑quality big‑data usage. Layering provides a clear data structure, simplifies data lineage tracing, promotes data reuse, reduces duplicate development, breaks complex problems into manageable steps, and isolates business impact by shielding downstream users from source system changes.

2. ETL Operations (Extraction, Transformation, Loading)

ETL extracts data from heterogeneous sources into a temporary layer, cleans and transforms it, then loads it into the warehouse. Extraction handles initial loads and incremental refreshes. Data cleaning removes duplicates, nulls, and invalid records (e.g., via Hive or MapReduce). Transformation normalizes schemas, creates unified dictionaries, and derives missing fields. Loading writes the final data to storage systems such as HBase or MySQL for downstream consumption.

3. Common Pitfalls in Layering

Layers should solve ETL workflow, data flow, permission control, and business needs, not exist for their own sake. Over‑splitting (e.g., DWD, DWT, DWS, DIM, DM) without clear boundaries leads to confusion, especially in complex business scenarios.

4. Technical Architecture

The typical stack centers on Hadoop, Spark, Hive, Sqoop, Zookeeper, and HBase, wrapped by web‑based scheduling platforms. Tasks are organized by layer, allowing thousands of daily jobs to be managed efficiently.

5. Layered Architecture Details

ODS (Operational Data Store) : Stores raw source data with minimal transformation, preserving source schema. It acts as the preparation zone, often partitioned by day, and may include light cleaning (e.g., field normalization, timestamp alignment).

DWD (Data Warehouse Detail) : The detail layer cleans, normalizes, and de‑duplicates ODS data, producing high‑quality, granular fact tables. It may perform dimension degradation (e.g., embedding order_id directly in the fact table) and supports both batch and streaming sources.

DWS (Data Warehouse Service) : Aggregates DWD data into wide, topic‑oriented summary tables (e.g., user behavior, product sales). These tables are typically partitioned by day and stored as Parquet/Impala internal tables for fast OLAP queries.

ADS (Application Data Service) : Provides highly aggregated, application‑ready datasets for reporting, dashboards, and downstream services. Data may reside in MySQL, Oracle, Redis, or analytical stores like Druid.

6. Modeling Principles and Naming Conventions

Use star or snowflake schemas for dimensional modeling. Declare granularity early (e.g., one row per order item). Choose dimensions (time, region, user, product) based on analysis needs. Follow naming rules such as dwd_{business}_{domain}_{process}_{i|f} where i denotes incremental and f full load. Example: dwd_asale_trd_ordcrt_trip_di (daily incremental order fact).

CREATE TABLE IF NOT EXISTS dwd_asale_trd_itm_di (
  item_id BIGINT COMMENT '商品ID',
  item_title STRING COMMENT '商品名称',
  item_price DOUBLE COMMENT '商品价格',
  item_stuff_status BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
  item_prov STRING COMMENT '商品省份',
  item_city STRING COMMENT '商品城市',
  cate_id BIGINT COMMENT '商品类目ID',
  cate_name STRING COMMENT '商品类目名称',
  commodity_id BIGINT COMMENT '品类ID',
  commodity_name STRING COMMENT '品类名称',
  buyer_id BIGINT COMMENT '买家ID'
) COMMENT '交易商品信息事实表'
PARTITIONED BY (ds STRING COMMENT '日期')
LIFECYCLE 400;

7. Layer Call Standards

ODS can only be consumed by DWD.

DWD may be used by DWS and ADS.

DWS is only callable by ADS.

Preferred flow: ODS → DWD → DWS → ADS (or ODS → DWD → ADS).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

modeling Hive layered architecture ETL Spark

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.