Big Data 25 min read

What Is a Data Warehouse? From Basics to Modern Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines the evolution from classic to internet‑scale warehouses, details modeling approaches and layered architectures, discusses KPI dictionaries, date dimensions, naming standards, data governance, incremental loading techniques, and upstream/downstream coordination.

dbaplus Community

Oct 14, 2023

What Is a Data Warehouse?

A Data Warehouse (DW or DWH) is a subject‑oriented, integrated, relatively stable, time‑variant collection of data designed to support management decision‑making.

Database vs. Data Warehouse

Databases store operational (OLTP) data with high concurrency and transaction support, while data warehouses store analytical (OLAP) data optimized for large‑scale queries and rarely involve inserts, updates, or deletes.

Traditional vs. Internet Data Warehouse

Traditional warehouses rely on relational databases (e.g., Greenplum) and heavy hardware; projects often involve large teams and long timelines. Internet‑scale warehouses prioritize lightweight, cloud‑based solutions, with analysts taking more ownership of tables and standards.

Data Warehouse Architecture

The core architecture is a layered model:

ODS (Operational Data Store) – raw data ingestion.

DM (Data Mart) – cleaned and integrated data.

DWD (Detail Layer) – wide tables with fine‑grained records.

DWS (Warehouse/Subject Layer) – aggregated, dimension‑modeled data (often called the data mart layer).

ADS (Application/Data Service) – reporting tables for end‑user consumption.

Modeling approaches include:

Inmon (hub‑and‑spoke) paradigm – top‑down, enterprise‑wide data warehouse.

Kimball paradigm – bottom‑up, bus architecture with dimensional modeling (star and snowflake schemas).

Other models such as Data Vault or Anchor are mentioned but not detailed.

Metric Dictionary (KPI)

KPI (Key Performance Indicator) is a quantifiable metric that reflects business performance. Metrics are categorized as:

Basic metrics – directly calculable (e.g., order count, transaction amount).

Derived metrics – built on basic metrics with additional dimensions (e.g., WeChat orders).

Calculated metrics – complex formulas that cannot be further decomposed (e.g., sell‑through rate, repurchase rate).

Maintaining a metric dictionary in a shared repository (often Excel, though a dedicated system is preferable) ensures consistent definitions and calculations.

Date Dimension

A date dimension is a static calendar table containing year, month, week, day, holiday, lunar calendar, and other derived attributes to support time‑based analysis.

Naming Conventions

Standardized naming reduces ambiguity. Examples include:

Table prefix: dwd_, dws_, ads_, bi_ followed by business domain and granularity.

Intermediate tables: mid_table_name_[0-9|dim].

Temporary tables: tmp_ prefix.

Manual tables: dwd_业务域_manual_xxx.

Dimension tables: dim_ prefix.

Data Governance

Data governance encompasses policies, standards, metadata management, data quality, impact analysis, and monitoring. It aims to treat data as an asset, ensure consistent definitions, and resolve issues such as poor data quality, divergent standards, and unclear downstream impact.

Incremental Loading

Full load copies all records; incremental load transfers only changed rows based on create_time and update_time fields, requiring a primary key for upserts.

Typical SQL patterns:

-- Full load usually: delete then insert
delete from tmp_a;
insert into tmp_a ...;

-- Or use insert overwrite (Hive/Impala)
insert overwrite table tmp_a select ...;

Incremental strategies include:

Row‑number window functions to keep the latest version per primary key.

Full join between incremental and historical tables.

Left join + union all.

Partitioned tables (e.g., daily partitions) with optional “link table” (slowly changing dimension) for change tracking.

Upstream/Downstream Agreements

Upstream systems must provide stable schemas, change logs, and timestamp fields ( create_time, update_time) as well as logical delete flags ( is_delete, is_valid). Downstream consumers (reports, BI tools) need clear documentation of table purposes, field meanings, and versioning.

Comment Standards

In DataWorks scripts, comments follow a structured format:

-- @name p_dwd_rack_machine
-- @description 货架宽表
-- @target rack.dwd_rack_machine
-- @source owo_ods.kylin__machine_release_his
-- @author yuguiyang 2017-12-25
-- @modify

These annotations capture task name, description, target table, source tables, author, and modification history, improving traceability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data modeling ETL Data Governance

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.