Data Warehouse Architecture Overview: Layers, Sources, Modeling, Storage, and Management
This article explains the logical layered architecture of modern data warehouses, covering data sources, ODS, DW/DWS layers, collection, storage on HDFS, synchronization tools, dimensional modeling (star, snowflake, constellation), metadata management, and task scheduling and monitoring, highlighting best practices for scalable big‑data solutions.
In the Internet industry, besides massive data volumes, business timeliness is extremely high, often requiring real‑time processing. Business changes rapidly, so data warehouses cannot be built once and for all with a top‑down approach; new services must be integrated quickly and obsolete ones retired easily.
Overall Architecture
The following diagram shows the logical layered architecture of a data warehouse:
Data Sources
Data sources, as the name suggests, are the origins of data. As internet companies grow, the number of data sources increases, including various business origins such as SDK event tracking and customer reports.
ODS Layer
The source system tables are often stored unchanged in the ODS (Operational Data Store) layer, also called the staging area. It serves as the source for downstream processing and stores historical incremental or full data.
DW Layer
The detailed layer (DWD) and summary layer (DWS) are the core content of a data warehouse. Their data are generated from ODS via ETL, built according to Kimball’s dimensional modeling theory, and ensure dimensional consistency across sub‑domains through conformed dimensions and a data bus.
DWS Layer
The application summary layer aggregates the detailed data from DWD and DWS on a Hadoop platform, then synchronizes the results to the DWS database for downstream applications.
Data Collection
Data collection tasks gather data from various sources into storage, possibly performing simple cleaning.
A common example is user‑behavior data collection: SDK event tracking sends data to Kafka in real time, Spark performs lightweight cleaning, and the data is stored in HDFS as one of the data warehouse sources.
Data Storage
As companies scale, data volumes reach petabyte levels; traditional databases can no longer meet storage requirements. HDFS is currently the most suitable storage solution for data warehouses in big‑data environments.
For offline (batch) processing, Hive is the primary choice because of its rich data types, built‑in functions, high‑compression ORC/Parquet formats, and convenient SQL support, offering far higher efficiency than MapReduce. For real‑time processing, Flink is optimal, though it currently only supports Java and Scala APIs.
Data Synchronization
Data synchronization moves data between different storage systems. For example, many applications cannot directly read from HDFS for performance reasons, so aggregated data must be synced to other stores such as MySQL.
Sqoop can achieve this, but it is heavyweight, always launching a MapReduce job and requiring every Hadoop node to access the source database. Alibaba’s open‑source DataX provides a better solution.
Dimensional Modeling
Dimensional modeling is a method specifically for analytical databases, data warehouses, and data marts.
1. Star Schema
Star schema is the most common dimensional modeling approach; the diagram below shows its relational structure:
It consists of a fact table surrounded by dimension tables, with the following characteristics:
a. Dimension tables only relate to the fact table; they do not relate to each other.
b. Each dimension table has a single‑column primary key, which appears as a foreign key in the fact table.
c. The fact table is central, and dimensions are arranged around it in a star shape.
2. Snowflake Schema
Snowflake schema extends the star schema; each dimension table can further connect to sub‑dimension tables. The diagram below illustrates its structure:
Dimension tables in a star schema are larger and denormalized; snowflake splits them into smaller, normalized tables. Although more normalized, snowflake is rarely used because it increases development complexity, while redundancy is not a major issue in data warehouses.
3. Constellation Schema
Constellation schema is another extension of the star schema, allowing multiple fact tables:
The previously described models map multiple dimensions to a single fact table, but often there are several fact tables sharing dimensions. In later business stages, most modeling adopts the constellation schema.
4. Comparison of the Three Schemas
Summarizing, the relationship among star, snowflake, and constellation schemas is shown below:
Snowflake further normalizes dimension tables, while constellation allows multiple fact tables within a star‑like layout.
Metadata Management
Metadata, defined as “data about data,” describes the structure, operations, and content of DW/BI systems.
Metadata spans the entire data warehouse lifecycle, driving automated and visualized development.
Metadata is categorized into two types: technical metadata and business metadata.
Technical metadata describes system technical details such as data structures, processing, ETL, and front‑end presentation. Common technical metadata includes:
Distributed compute storage metadata (tables, columns, partitions, etc.).
Distributed compute runtime metadata (job types, input/output, parameters, execution time, etc.).
Scheduling task metadata (input/output fields, dependency types, relationships).
Data quality and operations metadata (monitoring, alerts, quality metrics, failures).
Business metadata describes business concepts, relationships, and rules, including terminology, processes, and metrics. Examples are dimension and attribute definitions, business processes, and key performance indicators, which help manage and use data effectively.
Data‑application metadata covers configurations and runtime information for reports and data products.
Task Scheduling and Monitoring
Data warehouse construction involves many programs and tasks (data collection, synchronization, cleaning, analysis, etc.) that require not only timed scheduling but also complex dependency management.
For example, analysis tasks must wait for corresponding collection tasks to finish; synchronization tasks must wait for analysis tasks. Therefore, a robust scheduling and monitoring system is essential as the central hub that dispatches and oversees all tasks.
Conclusion
Building a data warehouse is a comprehensive technical effort, and for enterprises with complex business, it requires a dedicated team working closely with business stakeholders.
Thus, an excellent data‑warehouse modeling team needs solid technical expertise and a clear, thorough understanding of real business needs.
Moreover, architecture should prioritize simplicity and stability over adding newer technologies, as long as the solution meets the requirements.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.