LakeSoul: An Open‑Source Real‑Time Data Lakehouse Framework – Design, Architecture, Benchmarks and Future Roadmap
This article introduces LakeSoul, an open‑source end‑to‑end real‑time lakehouse framework, detailing its design philosophy, key technologies such as ELT, metadata management, upsert and merge‑on‑read capabilities, performance benchmarks, real‑world use cases, and the roadmap for future enhancements.
LakeSoul Design Philosophy
LakeSoul is positioned as an end‑to‑end open‑source real‑time lakehouse framework that adopts an ELT model, allowing data to be ingested into the lake first and then processed in layered models, unifying storage, compute, and AI/BI capabilities.
Background
In cloud‑native environments, object storage provides cheap, scalable storage for massive structured, semi‑structured, and unstructured data. Traditional ETL pipelines suffer from multiple processing chains, inconsistent storage, high maintenance costs, and lack of ACID guarantees. LakeSoul addresses these issues with a unified ELT approach.
LakeSoul Positioning
LakeSoul offers cloud‑native lake‑warehouse construction, low‑code data ingestion supporting both real‑time and batch, high‑throughput upsert, ACID and time‑travel capabilities, and integrated AI/BI support (SQL, Pandas, PyTorch).
Overall Architecture
The top layer is a distributed metadata service managing schemas and providing ACID‑based concurrency control, supporting millions of partitions and billions of files. The compute layer integrates engines such as Flink, Spark, Hive, and future Presto support. The storage layer connects to HDFS, S3, MinIO, OSS, using open formats like Parquet and Avro.
Technical Highlights
Metadata Layer
LakeSoul uses PostgreSQL for metadata management, providing primary‑key based tables, transactional concurrency control, snapshot reads, and two‑phase commit for exactly‑once semantics, scaling to billions of metadata entries.
Upsert and Merge‑On‑Read (MOR)
Upsert generates new versions with primary‑key‑based hash partitioning and sorting, enabling high‑throughput writes (10⁵+ rows/sec per core) and efficient MOR that merges sorted files at read time, with customizable operators for aggregation or null handling.
IO Layer
The IO layer, implemented in Rust, provides language‑agnostic read/write APIs (C, Java, Python) and asynchronous acceleration, delivering 3‑4× read and 1.4× write performance improvements over native Spark‑Parquet.
LakeHouse Ecosystem
LakeSoul supports automatic real‑time ingestion from heterogeneous sources (CDC, Kafka, databases), incremental ODS/DWD/DWS modeling, snapshot reads, rollbacks, and downstream integration with engines like Flink, Spark, Pandas, and PyTorch.
Benchmarks
Using a CCF data‑lake competition dataset (11 files, 10 incremental versions), LakeSoul outperforms Iceberg and Hudi in both copy‑on‑write and merge‑on‑read modes, achieving several‑fold speedups in read and write due to its Rust‑based IO and efficient metadata handling.
Application Cases
LakeSoul enables real‑time large‑wide tables without costly joins, supports multi‑source data synchronization via Flink CDC, and provides low‑code incremental operators (filter, group‑by, join) defined in YAML, guaranteeing exactly‑once semantics.
Future Roadmap
Further IO performance optimizations and native compaction integration.
Support for non‑primary‑key merge‑into and expanded ecosystem connectors (Presto, Pandas, etc.).
Release incremental read operators, materialized views, and incremental updates.
Donate the project to the Linux Foundation AI & Data open‑source organization to broaden community impact.
Q&A Highlights
LakeSoul’s metadata layer uses PostgreSQL, avoiding small‑file issues of Iceberg/Hudi.
Flink CDC enables one‑click whole‑database synchronization with automatic schema change detection.
Rust‑based IO provides vectorized operations and significant read/write speedups.
Operator‑based MOR simplifies custom merge logic compared to Hudi payloads.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.