Big Data 14 min read

Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices

This article provides a thorough introduction to data warehouses, traces their evolution, explains construction methodologies, compares offline, Lambda, and Kappa architectures, and presents real‑time warehouse case studies from Alibaba, Meituan, Xiaomi, Netflix, and OPPO, highlighting practical implementation details and challenges.

DataFunTalk
DataFunTalk
DataFunTalk
Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices

1. Data Warehouse Overview A data warehouse is a subject‑oriented, integrated, non‑volatile, time‑variant collection of data that supports management decision‑making.

2. Development History With growing data volume, variety, and real‑time decision needs, warehouses have evolved to handle unstructured logs, IoT streams, and require advanced ETL and storage solutions.

3. Construction Methodology Emphasizes subject‑oriented modeling, multidimensional analysis (facts and dimensions), and star schemas to produce fact tables and dimension tables for analytical queries.

4. Architecture Evolution From Inmon’s classic offline big‑data architecture to Lambda (offline + real‑time layer) and finally Kappa (stream‑only with replay), each step addresses latency, resource usage, and code duplication challenges.

5. Real‑time Warehouse Case (Cainiao) Describes a two‑layer real‑time model (DWD detail layer and DWS aggregation layer), data flow through message queues, ADS for OLAP, and HBase for KV queries, plus preparation for high‑traffic events like Double‑11.

6. Offline vs Real‑time Comparison Highlights architectural differences (Kappa vs offline big‑data), modeling similarities, and the heightened sensitivity of real‑time systems to data volume and stability.

7. Flink‑Based Real‑time Warehouse Practices Summarizes expert talks from Flink Forward Asia covering large‑scale implementations at Meituan, Xiaomi, Netflix, and OPPO, focusing on streaming ingestion, processing, and the shift from batch to unified stream processing.

big dataFlinkreal-time analyticsdata warehouseETLlambda architectureKappa architecture
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.