Unified Batch‑Stream Storage with Hudi and LAS: Architecture, Design, and Deployment
This article presents a comprehensive overview of a batch‑stream unified storage solution built on Hudi and the Lakehouse Analysis Service (LAS), covering background challenges, architectural design, data organization, read/write mechanisms, BTS architecture, real‑world deployment scenarios, and future development plans.
Background and Challenges Traditional data warehouses rely on separate batch (offline) and streaming (real‑time) pipelines, leading to duplicated code, double resource consumption, and inconsistent query semantics. These issues motivate a unified batch‑stream storage approach.
Design Solution The proposed solution leverages a lake‑house architecture with Hudi as the underlying storage engine, enhanced by an in‑memory service layer (BTS) and Table Service Management (TSM) to provide low‑latency reads, high‑throughput writes, and multi‑engine support (Spark, Flink, Presto).
Data Organization Data is logically partitioned into tables, file groups, and blocks. Writes first go to a write‑ahead log (WAL) for durability, then to in‑memory blocks, which are periodically flushed to persistent storage (HDFS) following Hudi‑style base and log files.
Read/Write Mechanism Batch updates write directly to persistent storage, while streaming reads/writes interact with the BTS memory layer first, falling back to WAL or persistent files when needed. This design achieves seconds‑level latency and supports exactly‑once or at‑least‑once semantics.
BTS Architecture BTS follows a master‑slave model: the master manages metadata, and Table Servers (slaves) handle block‑level reads/writes, WAL management, and asynchronous compaction/clustering to optimize query performance.
Deployment Scenarios The solution is applied to multiple use cases, including real‑time data processing pipelines, multidimensional OLAP dashboards, log‑based batch‑stream reuse, and internal Feishu data warehouse workloads, demonstrating reduced data duplication and improved latency.
Future Plans Upcoming work focuses on finer‑grained load balancing, enhanced query indexing, deeper integration with native engines, and vectorized processing of log, block, and Parquet files.
Q&A Highlights Discussions addressed Hudi vs. Kafka trade‑offs, LAS indexing mechanisms, performance gains from BTS acceleration, and consistency guarantees provided by WAL‑based writes.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.