Data Warehouse Overview, Architecture, and Modeling Methodology
This article provides a comprehensive introduction to data warehouses, covering their definition, architectural layers, characteristics, modeling approaches such as Inmon and Kimball, fact and dimension table design, star and snowflake schemas, and best‑practice principles for building scalable, maintainable warehouse solutions in the big‑data ecosystem.
Data Warehouse Overview
Data Warehouse (DW or DWH) is a strategic data store designed to support enterprise‑wide decision‑making by providing integrated, subject‑oriented, stable, and time‑variant data for analytical reporting and business intelligence.
From Traditional to Internet‑Scale Warehouses
The evolution from classic warehouses to modern internet‑scale warehouses is exemplified by Alibaba's data architecture, where the core modeling work resides in the data computation layer, transforming raw operational data into valuable analytical datasets.
Why Direct Access to Operational Systems Fails
Security or policy restrictions prevent direct access to certain business data.
Frequent version changes require constant re‑engineering of analytical queries.
Aggregating data from multiple system versions is difficult.
Hard‑coded column names and inconsistent data formats hinder analysis.
Transactional schemas are not optimized for analytical workloads.
Lack of proper metadata storage and unified data definitions.
Competing resource demands cause analytical workloads to suffer when sharing hardware with OLTP systems.
Data Warehouse Characteristics
1. Subject‑oriented: organized around business subjects.
2. Integrated: consolidates data from disparate operational sources.
3. Non‑volatile: primarily read‑only for analysis.
4. Time‑variant: captures historical snapshots.
5. Summarized: transforms operational data into decision‑ready formats.
6. Large‑scale: handles massive time‑series datasets.
7. Denormalized: often stores redundant data for performance.
8. Metadata‑rich: retains data about data.
9. Multi‑source: ingests both internal and external data.Existence and Benefits of a Warehouse
1. Stores massive historical data for deep analysis.
2. Provides business users with easy data access.
3. Unifies disparate sources into a single queryable layer.
4. Offers extensibility to accommodate evolving business needs.
5. Ensures data quality, which is essential for trustworthy decisions.Layered Modeling Rationale
With exploding data volumes, layered design improves query performance, reduces redundancy, ensures consistent metrics, and balances storage cost with compute efficiency.
Methodologies: Inmon vs. Kimball
Inmon (Top‑Down) : Builds an enterprise‑wide 3NF model, then extracts data through stages (ODS → DW → Data Marts). Emphasizes data‑source orientation, extensive ETL, and a unified data model.
Kimball (Bottom‑Up) : Starts from business processes, creates dimensional models (facts & dimensions) in data marts, then integrates them into a warehouse. Focuses on delivering business‑ready data quickly.
Fact and Dimension Tables
Fact tables store measurable events (e.g., orders) and have properties such as additivity, null‑handling, consistency, periodicity, and aggregation.
Dimension tables describe entities (e.g., products, cities) and include concepts like drill‑down, degenerate dimensions, denormalized flat dimensions, hierarchies, and handling of null attributes.
Star vs. Snowflake Schemas
The star schema directly links each dimension to the fact table, resulting in some redundancy but simple queries. The snowflake schema normalizes dimensions into sub‑dimensions, reducing storage at the cost of more joins.
Layered Architecture (ODS → CDM → ADS)
ODS (Operational Data Store) : Near‑raw ingestion of transactional and log data, preserving history and enabling raw analysis.
CDM (Common Data Model) consists of:
DWD (Detail Data Layer): cleansed, standardized, and possibly degenerated dimensions.
DWS (Summary Data Layer): wide tables with aggregated metrics for reuse.
ADS (Application Data Store) : Business‑specific, often non‑shared, complex calculations for downstream applications.
Modeling Principles
High cohesion & low coupling: group related data, separate unrelated data.
Separate core and extension models to protect performance.
Balance storage cost with compute performance.
Push common logic down to lower layers for consistency.
Idempotent processing: repeated runs yield the same results.
Standardized naming, data types, and null handling.
Use external Hive tables with columnar formats (ORC/Parquet) and compression.
Naming Conventions
Tables follow a pattern based on layer, domain, and granularity, e.g., ods_{domain}_{source}_{table}_{freq}, dwd_{domain}_{entity}_{gran}_{freq}, dws_{domain}_{topic}_{entity}_{gran}_{freq}, ads_{domain}_{purpose}_{gran}_{freq}.
Fields use lowercase, underscore separation, meaningful suffixes (_cnt, _price), avoid SQL keywords, and apply appropriate data types (e.g., decimal(28,6) for monetary values).
Overall, adhering to these guidelines helps build a reliable, scalable, and maintainable data warehouse that serves both analytical and business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
