Big Data 17 min read

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

This article explains why data warehouses should be layered, describes the classic ODS‑DW‑APP model, details each layer’s purpose and implementation techniques, presents an improved layering scheme with dimension and temporary tables, and answers common questions about parallel DWS and DWD processing.

Architecture Digest

May 25, 2017

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

In recent conversations the author often hears people claim they work on "big data" without recognizing the need for a well‑designed data warehouse, which encompasses ETL, scheduling, and modeling as a complete theory.

The article focuses on one crucial aspect of data warehouses: how to design data layers, and provides references for further reading.

The discussion targets three typical scenarios: early‑stage data projects that ingest raw data directly into business, mature projects where data usage becomes chaotic, and cases where repeated calculations waste resources and performance needs optimization.

Structure of the article: why layering is needed, classic layering models and their functions, two concrete design examples, and practical suggestions.

Clear data structure – each layer has a defined scope, making table usage easier.

Data lineage tracking – quickly locate the source of a problem.

Reduce duplicate development – shared intermediate tables cut down repeated calculations.

Simplify complex problems – break tasks into single‑step layers for easier maintenance.

Mask raw data anomalies.

Isolate business impact – changes in business logic don’t force full data re‑ingestion.

Typical messy dependency diagrams are contrasted with clean, well‑structured ones (see the two images below).

The theoretical three‑tier model consists of:

ODS (Operational Data Store) : the closest layer to source systems, storing data after extraction, cleaning, and loading, usually aligned with source business domains.

DW (Data Warehouse) : the core layer where data is organized by subjects using dimensions, facts, indexes, and granularity.

APP (Data Product) : the final layer providing data for analytics, reports, or downstream services, often stored in Elasticsearch, MySQL, Hive, or Druid.

A diagram of this model is shown below:

Technical practice highlights the data flow from sources to ODS using tools such as Sqoop or Canal for database extraction, and Flume, Spark Streaming, Storm, or Kafka for log ingestion. A flow diagram follows:

From ODS/DW to the APP layer, two patterns exist: scheduled batch jobs (using Hive, Spark, MR, writing results to Hive, HBase, MySQL, ES, Redis) and real‑time streams (using Spark Streaming, Storm, Flink, outputting to ES, HBase, Redis).

An earlier six‑layer design (including a Buffer layer) is described, detailing each layer’s concept, data generation method, storage format (Parquet, Impala tables), retention policy, and naming conventions.

To make the architecture more elegant, the author proposes removing the Buffer layer, merging the Light Summary (DWS) and Subject (DM) layers, and adding dedicated Dimension (DIM) and Temporary (TMP) layers. The revised diagram is shown below:

The Q&A section clarifies that DWS and DWD run in parallel, not sequentially, and that DWS does not depend on DWD; DWS handles lightweight aggregations while DWD stores cleaned dimension data.

In summary, proper data‑layering is vital for clear data lineage, feature generation, and metadata management, and should be considered early in any data‑warehouse project.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse ETL Data Architecture data layering

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.