Big Data 13 min read

Unified Real‑Time and Batch Data Warehouse Architecture with Hudi Lakehouse

The article explains the mainstream Lambda data‑warehouse architecture, its benefits and challenges, then introduces Hudi as a lake‑house solution that unifies real‑time and offline storage, describes the multi‑layer service design, and showcases three practical scenarios—stream processing, real‑time multidimensional analysis, and stream‑batch data reuse—demonstrating how the integrated architecture improves latency, cost, and operational complexity.

Big Data Technology & Architecture

Sep 18, 2023

Unified Real‑Time and Batch Data Warehouse Architecture with Hudi Lakehouse

Mainstream Data Warehouse Architecture

The Lambda architecture combines real‑time and offline pipelines, allowing batch processing to provide comprehensive and accurate data while stream processing offers low‑latency results, balancing latency, throughput, and fault tolerance. In practice, batch and stream results are merged to satisfy ad‑hoc queries.

Its advantages lie in clear responsibility boundaries, high fault tolerance, and complexity isolation, manifested in three aspects: clear duty separation, fault tolerance (batch T+1 results can overwrite stream results), and complexity isolation (offline processing is simpler than real‑time processing).

However, the architecture also suffers from several issues related to computation, operations, and cost, including data alignment problems between batch and stream results, duplicated development and maintenance effort for batch and stream code, and doubled storage and resource consumption.

Data Lake Solution

Hudi, an open‑source lake‑house framework, addresses the need for a unified real‑time/offline storage layer. Its core features include streaming source/sink for minute‑level data visibility, support for offline batch updates with Hive‑compatible insert/overwrite and upsert/delete capabilities, and seamless integration with engines such as Spark, Flink, and Presto.

Although Hudi provides a unified storage solution, further optimization is needed to meet stringent real‑time warehouse standards.

Lakehouse Integration Requirements

A unified lake‑house storage must support high‑throughput batch reads comparable to Hive tables, provide partitioned concurrent updates, enable second‑level low‑latency stream reads/writes with millions of RPS, guarantee Exactly‑Once or At‑Least‑Once semantics, and integrate with multiple query engines.

The proposed solution consists of three layers: a persistence layer reusing Hudi’s file layout (base columnar files and log row files), a metadata layer managing tables, partitions, timelines, and providing ACID guarantees, and a service layer containing BTS (in‑memory acceleration) and TMS (table optimization) components.

Data Distribution

The physical layout mirrors Hudi’s concepts: Table, Partition, FileGroup (grouping base and log files), Block (in‑memory space for sorted writes), WAL Log (persistent storage for block eviction), and the relationship between tasks and blocks.

Data Model

Each stream‑batch unified table offers two views: an Append‑Only incremental view for real‑time calculations and a snapshot view for offline batch processing, where the snapshot view retains only the latest record per primary key.

Data Read/Write

Load is separated between stream‑sensitive low‑latency jobs accelerated by BTS and batch‑oriented high‑throughput jobs that interact directly with the storage layer. Consistency is ensured across concurrent stream and batch writes.

BTS Architecture

BTS consists of a Master (Block Load Balancer, Block Metadata Manager, Transaction Manager) and Table Servers (Session Manager, DataService, Transaction Manager, MemStore, WAL). It provides fast in‑memory reads/writes, column pruning, predicate push‑down, and durability via WAL.

Practical Scenarios

Stream Processing: Replacing a complex MQ‑based pipeline with Hudi tables reduces component dependencies, simplifies debugging, and enables low‑cost historical data replay.

Real‑time Multidimensional Analysis: Using Hudi eliminates the need for ClickHouse, lowers storage costs, and allows full‑data queries via Presto.

Stream‑Batch Data Reuse: Storing DWD layer data in Hudi lets offline warehouses reuse the same dataset, saving compute and storage resources and improving data readiness.

Future work focuses on enhancing engine performance (multi‑task concurrent writes, WAL merging, async flush), stability (load balancing, multi‑region deployment, disaster recovery), and business features (Kafka‑like partitioning and consumer groups).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Batch Processing Data Warehouse Lambda architecture Lakehouse Hudi

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.