Big Data 19 min read

Data Warehouse Overview, Architecture, and Modeling Methodology

This article provides a comprehensive introduction to data warehouses, covering their definition, architectural layers, characteristics, modeling approaches such as Inmon and Kimball, fact and dimension table design, star and snowflake schemas, and best‑practice principles for building scalable, maintainable warehouse solutions in the big‑data ecosystem.

Big Data Technology & Architecture

Sep 13, 2019

Data Warehouse Overview

Data Warehouse (DW or DWH) is a strategic data store designed to support enterprise‑wide decision‑making by providing integrated, subject‑oriented, stable, and time‑variant data for analytical reporting and business intelligence.

From Traditional to Internet‑Scale Warehouses

The evolution from classic warehouses to modern internet‑scale warehouses is exemplified by Alibaba's data architecture, where the core modeling work resides in the data computation layer, transforming raw operational data into valuable analytical datasets.

Why Direct Access to Operational Systems Fails

Security or policy restrictions prevent direct access to certain business data.

Frequent version changes require constant re‑engineering of analytical queries.

Aggregating data from multiple system versions is difficult.

Hard‑coded column names and inconsistent data formats hinder analysis.

Transactional schemas are not optimized for analytical workloads.

Lack of proper metadata storage and unified data definitions.

Competing resource demands cause analytical workloads to suffer when sharing hardware with OLTP systems.

Data Warehouse Characteristics

1. Subject‑oriented: organized around business subjects.
2. Integrated: consolidates data from disparate operational sources.
3. Non‑volatile: primarily read‑only for analysis.
4. Time‑variant: captures historical snapshots.
5. Summarized: transforms operational data into decision‑ready formats.
6. Large‑scale: handles massive time‑series datasets.
7. Denormalized: often stores redundant data for performance.
8. Metadata‑rich: retains data about data.
9. Multi‑source: ingests both internal and external data.

Existence and Benefits of a Warehouse

1. Stores massive historical data for deep analysis.
2. Provides business users with easy data access.
3. Unifies disparate sources into a single queryable layer.
4. Offers extensibility to accommodate evolving business needs.
5. Ensures data quality, which is essential for trustworthy decisions.

Layered Modeling Rationale

With exploding data volumes, layered design improves query performance, reduces redundancy, ensures consistent metrics, and balances storage cost with compute efficiency.

Methodologies: Inmon vs. Kimball

Inmon (Top‑Down) : Builds an enterprise‑wide 3NF model, then extracts data through stages (ODS → DW → Data Marts). Emphasizes data‑source orientation, extensive ETL, and a unified data model.

Kimball (Bottom‑Up) : Starts from business processes, creates dimensional models (facts & dimensions) in data marts, then integrates them into a warehouse. Focuses on delivering business‑ready data quickly.

Fact and Dimension Tables

Fact tables store measurable events (e.g., orders) and have properties such as additivity, null‑handling, consistency, periodicity, and aggregation.

Dimension tables describe entities (e.g., products, cities) and include concepts like drill‑down, degenerate dimensions, denormalized flat dimensions, hierarchies, and handling of null attributes.

Star vs. Snowflake Schemas

The star schema directly links each dimension to the fact table, resulting in some redundancy but simple queries. The snowflake schema normalizes dimensions into sub‑dimensions, reducing storage at the cost of more joins.

Layered Architecture (ODS → CDM → ADS)

ODS (Operational Data Store) : Near‑raw ingestion of transactional and log data, preserving history and enabling raw analysis.

CDM (Common Data Model) consists of:

DWD (Detail Data Layer): cleansed, standardized, and possibly degenerated dimensions.

DWS (Summary Data Layer): wide tables with aggregated metrics for reuse.

ADS (Application Data Store) : Business‑specific, often non‑shared, complex calculations for downstream applications.

Modeling Principles

High cohesion & low coupling: group related data, separate unrelated data.

Separate core and extension models to protect performance.

Balance storage cost with compute performance.

Push common logic down to lower layers for consistency.

Idempotent processing: repeated runs yield the same results.

Standardized naming, data types, and null handling.

Use external Hive tables with columnar formats (ORC/Parquet) and compression.

Naming Conventions

Tables follow a pattern based on layer, domain, and granularity, e.g., ods_{domain}_{source}_{table}_{freq}, dwd_{domain}_{entity}_{gran}_{freq}, dws_{domain}_{topic}_{entity}_{gran}_{freq}, ads_{domain}_{purpose}_{gran}_{freq}.

Fields use lowercase, underscore separation, meaningful suffixes (_cnt, _price), avoid SQL keywords, and apply appropriate data types (e.g., decimal(28,6) for monetary values).

Overall, adhering to these guidelines helps build a reliable, scalable, and maintainable data warehouse that serves both analytical and business needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

modeling OLAP ETL Database Design

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.