Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling
This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.
1. Technical Overview
FLINK (1.12.1) – Distributed computing engine
HIVE (3.1.2) – Popular HQL data‑warehouse tool
PHOENIX (5.0.0) – HBase SQL query engine
SPARK (3.0.1) – Distributed computing engine
SQOOP (1.4.7) – Data ingestion and export service
TEZ (0.10.0) – Optimizes MapReduce DAG
YARN (3.1.1) – Distributed resource scheduler
HDFS (3.1.1) – Distributed storage service
ZOOKEEPER (3.4.13) – Distributed coordination service
ALERTMANAGER (0.21.0) – Sends monitoring alerts
GRAFANA (6.5.1) – Visualizes monitoring data
INFLUXDB (1.8.0) – Stores monitoring data
NODEEXPORTER (1.0.0) – Reads node resource metrics
PROMETHEUS (2.18.1) – Pulls monitoring data
USDPMONITOR (1.0.0) – Reads component monitoring metrics
HUE (4.8.0) – Visual management service
ZEPPELIN (0.9.0) – Visual management service
DOLPHINSCHEDULER (1.3.6) – Task scheduling service
2. Types of Tables in a Data Warehouse
2.1 Fact Tables
Fact tables are divided into transactional fact tables (immutable records such as transaction logs) and periodic fact tables (records that change over business cycles, e.g., order status). They contain only primary‑key values that reference dimension tables, store numeric measures, and grow rapidly.
2.2 Dimension Tables
Dimension tables describe business entities (e.g., regions, product categories) and can be regular or static. Each has a single primary key used as a foreign key in fact tables and is typically wide, flat, and non‑normalized with many low‑granularity text attributes.
2.3 Summary of Table Concepts
In a data warehouse, strict normalization is unnecessary because the primary goal is analytical querying, not frequent updates. Fact tables aim to correctly record historical information, while dimension tables are designed to enable flexible aggregation from appropriate perspectives.
2.4 Table Synchronization Strategies
2.4.1 Dimension Tables
Dimension data changes over time; choose full load or slowly changing dimension (SCD) approaches based on business needs. For slowly changing dimensions, use SCD (link) tables or history tables to preserve past values.
2.4.2 Fact Table Synchronization
Transactional Fact Tables
Only daily incremental loads are needed because data never changes after insertion.
Periodic Fact Tables
Daily full load – impractical due to data volume and redundancy.
Daily incremental – cannot reflect data changes for historical snapshots.
Link (SCD) table – combine daily inserts and changes to enable time‑slice snapshots; requires capturing daily new and changed rows.
2.4.3 Design of Link (SCD) Tables
Store only changed rows, avoiding redundant data.
Enable retrieval of historical snapshots.
Add two columns (dw_start_date, dw_end_date) to indicate row validity; dw_end_date defaults to 9999‑12‑31.
When data changes, set the old row's dw_end_date to the change time and insert a new row with dw_start_date set to that time.
Typical workflow: initial full load, then incremental loads into ODS, followed by merging into historical tables.
3. Dimensional Model Design
3.1 Basic Concepts of Dimensional Modeling
Dimensional modeling starts from analytical decision‑making needs, building models that serve analysis requirements and provide good performance for large‑scale complex queries. It is a method specifically for analytical databases, data warehouses, and data marts.
3.2 Three Modeling Patterns
3.2.1 Star Schema
The most common approach: a central fact table surrounded by directly linked dimension tables, forming a star shape. Characteristics:
Dimension tables only connect to the fact table; no inter‑dimension relationships.
Each dimension table’s primary key is a foreign key in the fact table.
Typical star schema diagram shown below.
3.2.2 Snowflake Schema
An extension of the star schema that normalizes dimension tables into multiple related tables. Although more normalized, it is harder to understand, has higher maintenance cost, and lower query performance, so it is less commonly used.
3.2.3 Constellation Schema
Derived from the star schema but based on multiple fact tables sharing dimension tables. In many real‑world scenarios, several fact tables exist within the same dimensional space, and a dimension table may be reused by multiple facts, making the constellation model prevalent in later stages of development.
4. Data Warehouse Theory
4.1 Why Layer the Data?
Clear data structure – each layer has a specific scope, simplifying usage and understanding.
Data lineage tracking – enables quick pinpointing of source issues.
Reduce duplicate development – standardized layers allow reusable intermediate data.
Simplify complex problems – each layer handles a single step, making maintenance easier.
4.2 Data Layers
4.2.1 Operational Data Store (ODS)
Raw data layer that stores source system data after ETL extraction, preserving original granularity and structure, typically partitioned by business domain.
Modeling Approach and Principles
Incremental extraction from source systems, retain timestamps as needed, partition tables periodically, keep data unchanged from source, and organize by business themes.
4.2.2 Detailed Data Layer (DWD)
Cleaned version of ODS data (removing nulls, dirty data, anomalies) with the same structure and granularity, providing detailed historical data for downstream layers.
Modeling Approach and Principles
Add business date fields to support re‑processing, partition by year/month/day, and merge incremental ODS data with previous DWD tables.
4.2.3 Data Service Layer (DWS)
Aggregates and models data from DWD by business themes (e.g., tourism consumption), involving dimensions, facts, metrics, and granularity.
Modeling Approach and Principles
Aggregate and derive new facts.
Join facts from other themes; cross‑domain possible.
May use denormalized designs for performance.
4.2.4 Data Application Layer (ADS)
Provides data products for reporting, dashboards, OLAP, and analytics, typically stored in ES, MySQL, Hive, or Druid.
Modeling Approach and Principles
Keep data volume small.
Use star‑schema dimensional modeling.
Include dimension surrogate keys and measures.
Add business date fields for re‑runs.
Store without partitioning.
4.2.5 Data Mart Layer (DM)
Intermediate layer between DWS and ADS, building theme‑based models that can serve both analysis and ADS, essentially a small data warehouse.
Modeling Approach and Principles
Minimize computation at access time, optimize retrieval.
Use star‑schema dimensional modeling.
Widen facts, pre‑compute measures.
Partition storage.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.