Big Data 19 min read

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

This article explains why modern enterprises need real‑time data‑warehouse architectures, breaks down traditional layered warehouse concepts, compares Lambda and Kappa models, evaluates five practical real‑time solutions—including Iceberg‑based lakehouse and MPP databases—provides code snippets, and offers selection guidance with real‑world company examples.

ITPUB

Apr 19, 2022

Which Real-Time Data Warehouse Architecture Fits Your Needs? A Deep Dive

Why Real‑Time Data Warehouse Architecture Is Needed

As digital transformation accelerates, enterprises generate massive, increasingly complex data, making traditional batch‑oriented warehouses insufficient for timely insights. Real‑time data‑warehouse architectures address the challenge of storing and computing large‑scale, complex data with low latency.

Warehouse Layering and Its Necessity

Typical data‑warehouse layers are ODS (raw source data), DWD (detail layer after cleaning), DWS (subject‑oriented data marts), and ADS (application‑specific aggregated data). Layering reduces processing complexity, improves stability, and enables localized adjustments when errors occur.

From Lambda to Kappa Architecture

Lambda architecture separates real‑time streams (e.g., Flink) from batch processing (e.g., Spark), storing results in different sinks. While it offers low‑latency and historical analysis, maintaining consistency between streams and batches is difficult. Kappa architecture simplifies this by converting all sources to streams and using a single stream‑processing engine, but it requires all data to be real‑time and can incur high development and resource costs.

Five Common Real‑Time Warehouse Solutions

Kappa Architecture : All data ingested via Kafka, processed by Flink, and written to stores such as MySQL, Elasticsearch, HBase, or Druid. Benefits: simplicity and true real‑time. Drawbacks: each new report needs a new Flink job; large data volumes demand massive Flink clusters.

Standard Layering + Stream Processing : Retains ODS/DWD/DWS/ADS layers; uses Flink to clean, transform, and aggregate data at each layer, pushing results to Kafka for downstream consumption. Benefits: clear data responsibilities. Drawbacks: multiple Flink clusters increase operational complexity and load.

Standard Layering + Stream + Batch : Adds Spark batch jobs on top of the layered architecture, allowing both real‑time OLAP queries and large‑scale offline analytics. Benefits: leverages strengths of both stream and batch. Drawbacks: higher maintenance overhead and schema management challenges.

Standard Layering + Stream + Data Lake (Iceberg/Hudi/Delta Lake) : Unifies storage on an Iceberg lake, enabling stream‑write and batch‑read, upserts, small‑file handling, and full OLAP support (Hive, Spark, Presto, Impala). This solves consistency, schema, and upsert problems while keeping all layers queryable in real time.

Full‑Scene MPP Database (e.g., ClickHouse, StarRocks) : Directly writes real‑time data to an MPP store, optionally ingesting offline files via Kafka or bulk SQL import. Provides fast OLAP queries without a separate lake layer, simplifying architecture for smaller teams.

Iceberg Practical Example

Writing streaming data to Iceberg:

data.writeStream.format("iceberg")
  .outputMode("append")
  .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
  .option("data_path", tableIdentifier)
  .option("checkpointLocation", checkpointPath)
  .start()

Filtering Iceberg data by date:

Table table = ...
Actions.forTable(table).rewriteDataFiles()
  .filter(Expressions.equal("date", "2022-03-18"))
  .targetSizeInBytes(500 * 1024 * 1024) // 500 MB
  .execute();

Selection Guidance

Choose Kappa for simple, stream‑centric workloads. Opt for standard layering + stream when data governance, multi‑topic access, and fine‑grained permissions are required. Use layering + stream + batch if both real‑time and heavy offline analytics are needed. Adopt the Iceberg lakehouse for large‑scale, complex scenarios with upsert and schema unification needs. Consider an MPP database for small teams needing a single‑store solution.

Big‑Company Implementations

OPPO and Didi employ architectures similar to the standard‑layering + stream solution, while Bitmain combines standard layering, stream, batch, and ClickHouse for a highly complex stack.

Conclusion and Further Thoughts

Real‑time data‑warehouse design should evolve from business requirements, not be copied wholesale. Most organizations will benefit from a layered approach combined with stream processing, but the exact mix of lakehouse, batch, or MPP technologies depends on team size, data volume, and query latency needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Real-Time Data Warehouse Spark Iceberg Lambda architecture Kappa architecture

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.