How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture
This article examines the evolution of big‑data architectures, identifies the latency and maintenance issues of classic Lambda designs, and presents a hybrid Lambda‑Kappa solution that unifies streaming and batch processing to achieve minute‑level data freshness and second‑level query latency while reducing development cost.
Big Data Architecture Evolution
The classic offline data warehouse consists of four layers: Operational Data Store, Detail Layer, Summary Layer, and Application Data Store. It offers simple architecture and low cost but suffers from poor data timeliness.
Lambda architecture, introduced by Nathan Marz in 2011, adds a Speed Layer to improve timeliness while retaining the Batch Layer for accuracy. However, it introduces duplicated code, higher resource consumption, and data inconsistency between batch and speed layers.
Kappa architecture, proposed by Jay Kreps in 2014, removes the Batch Layer and uses a single code base for both real‑time and offline processing, solving many Lambda drawbacks but still facing challenges such as data back‑tracking and complex stream joins.
Background: Limitations of the Existing Lambda‑Based Warehouse
The legacy system, a Lambda‑style warehouse, faced three major problems:
Query latency of tens of minutes due to thousands of tables and extensive joins.
Data delay of hours to days, with some hourly data arriving several hours late.
Inconsistent real‑time and offline data, requiring separate code bases and increasing maintenance effort.
Technical Solution: Unified Stream‑Batch Real‑Time Warehouse
We adopted a hybrid Lambda‑Kappa architecture that makes the following key changes:
Data Flow Decision per Field – Each column is classified as either real‑time or offline based on its timeliness requirement, eliminating the need for parallel batch and speed pipelines.
Wide‑Table Modeling – Instead of layered modeling, we merge real‑time and offline fields into a minute‑level wide table, reducing join complexity and query latency.
Key Breakthroughs
Data Update Handling – For mutable database records, we capture binlog changes, write them to a message queue, and use a Copy‑On‑Write mechanism to merge base and delta files every five minutes, providing second‑level query freshness.
Multi‑Table Join Optimization – Each table produces a base file (full snapshot) and a small delta file (incremental changes). Three successive joins operate on these small deltas, keeping overall latency low.
Database‑Log Correlation – Full database snapshots are cached in a high‑performance store; streaming logs are processed in real time and joined with the cache, then written to files. Hot data is merged every minute, cold data daily, balancing cost and latency.
Data Watermark Management – Wide‑table generation waits for all dependent sources; fast‑producing tables contribute minutes‑level data, while slower tables may lag T+1 or T+2, still delivering a consistent wide view.
Results and Future Planning
The unified architecture reduced data ingestion latency from hours/days to minutes and cut query time from minutes to seconds. Development and maintenance costs dropped because a single code base now serves both real‑time and offline logic, eliminating data divergence.
Future work includes further improving engine query performance and enhancing the user experience of upstream query tools, with an open invitation to the community for collaboration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
