Big Data 13 min read

How to Build a Scalable Data Warehouse: Theory, Architecture, and Best Practices

This article outlines practical approaches to data warehouse construction, covering dimensional modeling, layered architecture, capability development, real‑time and batch processing with technologies like Hive, Spark, Flink, Iceberg, and discusses governance, security, and future trends toward data value and real‑time metrics.

Data Thinking Notes

Aug 15, 2024

How to Build a Scalable Data Warehouse: Theory, Architecture, and Best Practices

Introduction

This article introduces the practice of building a data warehouse within a data middle‑platform, sharing common dimensional‑modeling and layering theories, as well as the evolution of warehouse architecture and the accumulation of capabilities.

Data Warehouse Construction Theory

Construction starts with business analysis to identify core logic and related tables, then abstracts business and data flows into subject domains. Fact and dimension tables are defined, metrics are aligned with business goals, and dimensional modeling is applied with attention to layering and standards. Physical implementation focuses on development standards, deliverables, and quality.

Data Warehouse Layering

The lowest layer is ODS (source‑aligned data). Above ODS is the DW layer, split into DWD (basic cleaned data, internal use) and DWM (common data for reuse). Additional layers include DIM (dimension), DM (wide tables), ADS (application data), and TMP (temporary tables).

Modeling Principles

High cohesion, low coupling : Group related data into a single logical or physical model to reduce inter‑module dependencies.

Public logic sinking : Place shared business logic in the DWM layer to hide complexity from downstream consumers.

Cost‑performance balance : Accept some data redundancy to improve query speed, e.g., hierarchical region tables.

Consistency : Keep field meanings and naming conventions uniform across the warehouse.

Data rollback : Ensure that scheduled runs produce consistent historical results.

Metric Management

The OSM model aligns company goals with warehouse metrics. Metrics are built from Hive offline data, MySQL online data, and analytical data, forming a semantic model that undergoes review before publishing to the corporate metric library. Users can query metric definitions, lineage, and dimensions via the data encyclopedia, which integrates with OA tools for efficient usage.

Data Warehouse Architecture Introduction

The warehouse adopts a Lambda architecture. Batch processing uses Spark + Hive for offline data, while streaming uses Flink + Talos. DW and DM layers are accelerated with OLAP, and the results are unified for downstream consumption.

Real‑Time Stream State Expiration Issue

In real‑time order processing, the order fact table changes frequently while the order detail table is static. Because stream state expires after a fixed interval, delayed status changes can cause metric inaccuracies. The solution adds an offline stream to identify expired data, merges it with the real‑time stream, deduplicates, and forwards the corrected data downstream.

Iceberg‑Based Batch‑Stream Integration

Both offline (Hive) and real‑time (Talos) processing are replaced by Iceberg, which supports structured and unstructured data, provides transactional writes, and allows seamless data updates via MERGE INTO. However, Iceberg’s commit latency depends on checkpoint intervals, limiting true second‑level real‑time.

Data Warehouse Capability Building

A unified data architecture supports minute‑level batch‑stream processing with Iceberg and second‑level processing with Flink + Talos. Standards cover table naming, field naming, layer definitions, and DQC checks (integrity, consistency, null‑rate). Security measures include compliance, data classification, least‑privilege access, confidentiality agreements, and cluster isolation, with GDPR compliance for European data.

Metric Application via Data Encyclopedia

The data encyclopedia presents metric definitions, base information, dimension breakdowns, and lineage, enabling downstream users to quickly understand and consume metrics, with some metrics linked to corporate dashboards.

Summary and Outlook

After years of development, the company now operates a data warehouse widely used by operations and management. The team continuously refines architecture and standards, engages with industry best practices, and looks ahead to two trends: turning data into business value and achieving real‑time metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Data Warehouse Data Governance Iceberg Lambda architecture dimensional modeling

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.