Big Data 24 min read

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

This article provides a comprehensive overview of data warehouses, explaining their purpose, differences from databases, OLTP vs OLAP, traditional versus internet data warehouse models, layered architecture, modeling theories, metric dictionaries, date dimensions, naming conventions, data governance, and incremental synchronization techniques with practical SQL examples.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

From a literal perspective, a data warehouse (DW) is a repository that stores various types of data organized according to specific structures and rules, distinct from traditional relational databases that primarily hold transactional business data.

Database vs Data Warehouse – Databases (MySQL, Oracle, PostgreSQL, etc.) handle OLTP (online transaction processing) workloads requiring high concurrency and transaction support, whereas data warehouses focus on OLAP (online analytical processing) for analytical queries with minimal DML operations.

The DW ecosystem integrates many existing technologies; traditional DWs rely on relational databases (e.g., Greenplum) while modern cloud‑based solutions leverage cheaper hardware and big‑data technologies, reducing strict architectural constraints.

In recent years, data warehouses have become mainstream in internet companies, evolving from a mysterious, high‑level concept to a widely understood data platform. Common interview questions include definitions of DW, its characteristics, OLTP/OLAP differences, ladder tables, synchronization methods, incremental loading, and ETL.

Traditional vs Internet Data Warehouse – Traditional DWs require dedicated teams and limited data exposure, while internet‑scale DWs emphasize open, self‑service data access, leading to challenges such as data quality, duplication, inconsistent metrics, and governance.

DW Architecture – A typical layered architecture consists of ODS (raw data), DWD (detail wide tables), DWS (data mart/subject‑oriented aggregation), and ADS (application/reporting layer). Both Inmon’s top‑down (EDW‑DM) and Kimball’s bottom‑up (DM‑DW) modeling approaches are often combined, alongside other models like Data Vault or Anchor.

Metric Dictionary – Metrics (KPIs) are managed centrally, often in Excel or a dedicated system, with standardized coding, types (basic, derived, calculated), business definitions, and ownership to ensure consistent reporting.

Date Dimension – A date dimension table stores a full calendar with attributes such as year, quarter, month, week, day, holidays, lunar calendar, and custom business flags, facilitating time‑based analysis. It can be populated via SQL, Java, or Python scripts.

Naming Conventions – Consistent naming (e.g., prefixes dwd_, dws_, ads_, dim_, tmp_) and word‑root usage help maintain clarity across tables, fields, and metrics, reducing future refactoring effort.

Data Governance – Governance covers data quality, standards, lineage, impact analysis, and continuous improvement processes, addressing issues like inconsistent standards, poor data quality, and unclear change impacts.

Incremental Loading – Incremental sync captures only changed records using create_time, update_time, and primary keys, avoiding full‑load overhead. Example SQL for full load:

--全量同步一般先delete,然后insert
delete from tmp_a;
insert into tmp_a xxx;
-- 或者直接 insert overwrite
insert overwrite table tmp_a xxx;

For incremental tables:

create table tmp_a(
    id bigint,
    create_time datetime,
    update_time datetime
);

Techniques include row_number partitioning, full joins, and left join + union all to derive the latest snapshot.

Ladder Tables – Ladder (slowly changing dimension) tables record every change of a record, though many internet companies now prefer partition‑based approaches.

Upstream/Downstream Agreements – Upstream systems must provide stable schemas, timestamps, and logical delete flags; downstream consumers (reports, BI tools) need clear documentation of table usage, field definitions, and change notifications.

Task Annotation – In Alibaba DataWorks, tasks are annotated with metadata such as @name, @description, @target, @source, @author, and @modify to improve traceability.

--    @name p_dwd_rack_machine
--    @description 货架宽表
--    @target rack.dwd_rack_machine

--    @source owo_ods.kylin__machine_release_his
--    @source owo_ods.kylin__machine_device_his

--    @author yuguiyang 2017-12-25
--    @modify
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLdata modelingETLData GovernanceIncremental Sync
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.