Big Data 27 min read

How to Build a Real-Time Data Warehouse: Architectures, Challenges, and Industry Practices

This article examines the growing demand for real‑time data warehouses, compares mature streaming frameworks, evaluates Lambda, Kappa and hybrid architectures, reviews industry implementations from Didi and OPPO, and proposes a standard‑layer + stream + data‑lake solution with Apache Paimon, Hudi, and Iceberg.

WeiLi Technology Team

Aug 2, 2023

How to Build a Real-Time Data Warehouse: Architectures, Challenges, and Industry Practices

Real-Time Data Warehouse Construction Background

Companies increasingly need real‑time data for product decisions and internal governance, but traditional offline warehouses operate on a T+1 schedule with daily batch jobs, which cannot meet low‑latency requirements.

1. Urgent Real‑Time Demand

Business scenarios now require sub‑hour or second‑level data freshness, making the classic offline approach insufficient.

2. Maturing Real‑Time Technologies

Streaming frameworks have evolved through three generations—Storm, Spark Streaming, and Flink—allowing SQL‑based development and tighter integration with offline warehouse designs. Development platforms also provide better support for debugging and operations, reducing costs.

Purpose of Building a Real‑Time Warehouse

1. Solve Traditional Warehouse Issues

The goal is to combine classic warehouse theory with streaming techniques to overcome the low timeliness of offline data.

Business decisions increasingly depend on real‑time data.

Lack of standards for real‑time data leads to poor usability and resource waste.

Platform tools now support real‑time development, lowering costs.

2. Real‑Time Warehouse Use Cases

Real‑time OLAP analysis / interactive queries

Real‑time dashboards

Real‑time business monitoring

Real‑time metric aggregation

Real‑time data service APIs

Real‑Time Warehouse Architecture Design

The original data‑warehouse concept was proposed by Inmon in 1990. With the explosion of data, big‑data tools replaced classic warehouse components, forming an offline big‑data architecture.

As real‑time requirements grew, an acceleration layer was added on top of the offline architecture, creating the Lambda architecture.

Later, with more event‑driven sources, the architecture shifted to a Kappa model that treats streaming as the core.

1. Lambda Architecture

To meet real‑time metric needs, a streaming pipeline is added to the offline warehouse, ingesting data via message queues and performing incremental calculations before merging with batch results.

Maintains two codebases for batch and stream processing.

Uses stream engines (e.g., Flink) for real‑time data and batch engines (e.g., Spark) for offline data.

Duplicate logic increases resource consumption.

Requires many components (Hadoop, Hive, Spark, Oozie, Flink, Kafka, Kudu, etc.), raising operational complexity.

Stream Computing https://cloud.tencent.com/product/oceanus?from=20065&from_column=20065

Batch Computing https://cloud.tencent.com/product/batch?from=20065&from_column=20065

2. Kappa Architecture

Kappa simplifies Lambda by converting all sources to streams and using a single streaming engine for both batch and real‑time processing, reducing operational overhead.

Kappa is essentially Lambda without the batch part.

Historical reprocessing throughput is lower than batch but can be mitigated by adding resources.

Challenges include data loss, out‑of‑order data, and schema synchronization.

Migrating legacy offline data is also a concern.

3. Hybrid Architecture

Completely replacing offline ETL with streaming is risky; many organizations adopt a hybrid approach, using both Lambda and Kappa where appropriate.

Lambda

Kappa

Hybrid

4. Deep Dive into Real‑Time Warehouse Architecture

Real‑Time Query Requirements

Understanding industry demands helps evaluate design trade‑offs and maximize value under existing constraints.

Real‑time scenarios are split into two categories: sub‑second/millisecond monitoring and alerting, and minute‑level reporting (e.g., 10‑30 minutes).

Common solutions include:

Lambda architecture

Kappa architecture

Standard layer + stream + batch

Standard layer + stream + data lake

Full‑scene MPP databases (e.g., ClickHouse, Doris)

Solution 1: Kappa

Data from multiple sources is sent to Kafka, processed by Flink, and written to MySQL/Elasticsearch/HBase/Druid for downstream queries.

Advantages: simple design, real‑time data.

Disadvantages: each new report requires a new Flink job; large data volumes demand sizable Flink clusters and high memory usage.

Solution 2: Standard Layer + Stream

To reduce maintenance cost, data is organized into ODS, DWD, DWS, ADS layers. Raw data lands in ODS, Flink performs real‑time cleaning and transformation to produce DWD, which is then streamed to Kafka. DWS aggregates lightly, and ADS serves business‑specific applications.

Real‑time Computing https://cloud.tencent.com/product/oceanus?from=20065&from_column=20065

Pros: clear data responsibilities per layer.

Cons: multiple Flink jobs increase complexity; heavy Kafka usage raises load; schema management is cumbersome.

Solution 3: Standard Layer + Stream + Batch

Combines real‑time and offline processing by adding Spark‑based batch jobs on HDFS to the streaming pipeline.

Pros: supports both real‑time OLAP and large‑scale offline analytics.

Cons: data quality management is complex; schema unification is difficult; upsert not supported.

Solution 4: Standard Layer + Stream + Data Lake

To address data‑quality and upsert issues, a unified stream‑batch data‑lake architecture based on Delta Lake / Hudi / Iceberg is adopted.

Iceberg, for example, unifies storage and computation, supports both batch and streaming writes, and offers rich OLAP ecosystem compatibility.

Industry Real‑Time Warehouse Cases

1. Didi Ride‑Sharing Real‑Time Warehouse

Didi built a real‑time warehouse for its ride‑sharing business, achieving layered data (ODS, DWD, ADS), reduced resource consumption, and enriched data services.

2. OPPO Real‑Time Computing Platform

OPPO’s solution resembles the standard layer + stream model.

3. Didi Big Data Platform Architecture

Also follows the standard layer + stream approach.

Proposed Real‑Time Warehouse for "Micro‑Carp" Project

Based on the analysis, the recommended architecture is a standard‑layer system combined with stream processing and a data lake.

Current Warehouse Issues

Real‑time and offline warehouses are isolated, creating data islands.

Intermediate data is hard to query and debug.

Complex pipelines cause rollback difficulties.

Kudu integration with HDFS and cloud storage is problematic.

Planned New Architecture

Adopt Apache Paimon as the core lake format, supported by Flink for CDC, and integrate with OSS/S3/COS storage. Complement with Trino/Presto for OLAP and consider Doris/StarRocks for serving.

Technology Options

Apache Paimon

Provides fast ingestion, CDC support, and efficient real‑time analytics using LSM storage; compatible with Flink, Spark, Hive, Trino.

Apache Hudi

Offers indexed updates, incremental queries, ACID transactions, and CDC ingestion.

Apache Iceberg

Standardized table format with schema evolution, partitioning, snapshotting, and broad engine support.

Migration Plan

Phase 1: Introduce Paimon, test ingestion performance for event data and CDC.

Phase 2: Migrate selected jobs, validate stability in production.

Phase 3: Migrate all workloads and retire legacy components (Kudu, HBase, Druid, Impala).

Summary

The article surveys mainstream real‑time warehouse designs, compares their trade‑offs, and concludes that a standard‑layer + stream + data‑lake architecture best fits the company’s needs, with a phased migration to Apache Paimon and related ecosystem components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stream processing Apache Flink Real-Time Data Warehouse Lambda architecture Kappa architecture

Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.