Big Data 22 min read

Unified Batch‑Stream Storage with Hudi and LAS: Architecture, Design, and Deployment

This article presents a comprehensive overview of a batch‑stream unified storage solution built on Hudi and the Lakehouse Analysis Service (LAS), covering background challenges, architectural design, data organization, read/write mechanisms, BTS architecture, real‑world deployment scenarios, and future development plans.

DataFunTalk

Sep 4, 2023

Unified Batch‑Stream Storage with Hudi and LAS: Architecture, Design, and Deployment

Background and Challenges Traditional data warehouses rely on separate batch (offline) and streaming (real‑time) pipelines, leading to duplicated code, double resource consumption, and inconsistent query semantics. These issues motivate a unified batch‑stream storage approach.

Design Solution The proposed solution leverages a lake‑house architecture with Hudi as the underlying storage engine, enhanced by an in‑memory service layer (BTS) and Table Service Management (TSM) to provide low‑latency reads, high‑throughput writes, and multi‑engine support (Spark, Flink, Presto).

Data Organization Data is logically partitioned into tables, file groups, and blocks. Writes first go to a write‑ahead log (WAL) for durability, then to in‑memory blocks, which are periodically flushed to persistent storage (HDFS) following Hudi‑style base and log files.

Read/Write Mechanism Batch updates write directly to persistent storage, while streaming reads/writes interact with the BTS memory layer first, falling back to WAL or persistent files when needed. This design achieves seconds‑level latency and supports exactly‑once or at‑least‑once semantics.

BTS Architecture BTS follows a master‑slave model: the master manages metadata, and Table Servers (slaves) handle block‑level reads/writes, WAL management, and asynchronous compaction/clustering to optimize query performance.

Deployment Scenarios The solution is applied to multiple use cases, including real‑time data processing pipelines, multidimensional OLAP dashboards, log‑based batch‑stream reuse, and internal Feishu data warehouse workloads, demonstrating reduced data duplication and improved latency.

Future Plans Upcoming work focuses on finer‑grained load balancing, enhanced query indexing, deeper integration with native engines, and vectorized processing of log, block, and Parquet files.

Q&A Highlights Discussions addressed Hudi vs. Kafka trade‑offs, LAS indexing mechanisms, performance gains from BTS acceleration, and consistency guarantees provided by WAL‑based writes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Data Warehouse LAS Lakehouse Hudi Batch-Stream

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.