How to Design and Review a Data Warehouse Model: A Complete Guide
This document outlines a comprehensive data warehouse model design and review process, covering revision records, project overview, business requirements, conceptual and logical modeling, ETL workflow, exception handling, and acceptance criteria with practical examples and templates.
Document Revision Record
V1.0 (2025-03-12) – Author: 思琪 – Initial draft – Reviewed by supervisor.
V1.1 (2025-03-15) – Author: 李佳 – Added ETL design details – Reviewed by supervisor.
Project Overview
Background
Business need to build a data warehouse supporting analytical queries, integrating multiple source systems, and providing unified data for BI.
Key drivers: improve decision‑making efficiency, consolidate data from disparate systems.
Scope
Mandatory : subject areas, data sources, target users (e.g., BI team, business departments).
Optional : boundaries such as real‑time processing or third‑party data.
Business Requirement Analysis
Requirement List
RQ001 – Support multi‑dimensional sales performance analysis – Priority: High – Domain: Sales – Owner: 王某.
Key Metric Definitions
Metrics include sales amount, user retention rate, order conversion rate; each metric must define calculation logic and source tables.
Specify metric dimensions (e.g., product, region), time periods, and any modifiers.
Data Model Design
Data Architecture
Layered architecture: ODS (raw), DWD (clean), DWS (light aggregation), ADS (application).
Model diagram clarifies positioning, granularity, and business module association of new models.
Subject‑Area Division
Separate subject areas such as Sales, Customer, Supply Chain, each covering relevant entities and attributes.
Conceptual Model
High‑level ER diagram describing core entities (e.g., Customer, Order, Product) and their relationships.
Logical Model
Star/Snowflake schema with dimension tables and fact tables.
Example dimension table:
Table name: dim_customer
Fields: customer_id (PK), name, region, create_timePhysical Model
Target database platforms may include Hive or ClickHouse.
Define partitioning strategy (e.g., by date), index design, and storage compression (e.g., ORC, ZSTD).
Guidelines are intended for implementation in model‑code tasks.
ETL Design
Data Source Integration
Supported source types: relational databases, log files, REST APIs.
Extraction frequency can be full load or incremental (CDC).
Data Processing Flow
Cleaning rules: deduplication, null handling.
Transformation logic: aggregations, joins, derived columns.
Loading strategy: bulk insert into target layers.
Typical pipeline:
Data source → ODS (raw) → DWD (clean) → DWS (light aggregation) → ADS (application)Exception Handling
Data quality checks (DQC) such as primary‑key uniqueness.
Errors are logged; failed batches are retried according to a configurable policy.
Fallback and emergency procedures are defined for critical failures.
Review and Acceptance
Review Findings
Issue: redundant fields in a dimension table.
Suggestion: merge region and city fields.
Resolution: adjust logical model accordingly (owner: 张架, status: resolved).
Acceptance Criteria
Data quality: target tables must match source data counts and key constraints.
Performance: query response time ≤ 2 seconds; resource usage within defined limits.
Delivery: meet SLA for timeliness and baseline monitoring.
Big Data Tech Team
Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
