Tianqiong OLAP Real‑time Lakehouse Fusion Platform Architecture Practice
This article explains why lake‑warehouse fusion is needed, describes the challenges of integrating real‑time data warehouses with data lakes, introduces a new StarRocks‑based architecture that supports real‑time ingestion, cooling, offline loading, and adaptive hot‑cold query rewriting, and outlines future plans and Q&A.
The article begins by answering why lake‑warehouse fusion is required, highlighting the high cost and latency of pure real‑time data warehouses, the limited performance of data lakes, and the need for unified storage, consistent metadata, and cost‑effective data governance.
It then discusses the difficulties of merging lake and warehouse metadata, providing unified data catalogs, meeting real‑time requirements, and achieving warehouse‑level performance within a lake environment.
Next, a traditional lake‑warehouse monolithic architecture is presented, where data is ingested via MQ (Kafka) to Flink, written to the lake, and queried through external engines such as Presto or Spark, with optional export to OLAP engines like StarRocks for high‑performance queries.
Building on this, a new real‑time lake‑warehouse fusion architecture is introduced. StarRocks is chosen as the core real‑time warehouse due to its vectorized execution, compression, and materialized view capabilities. Data sources (Pulsar, Tube, Kafka, TDW, Hive) are loaded into StarRocks via routine, stream, or broker loads.
After data is stored in StarRocks, scheduled tasks export hot data to the lake (Iceberg, Hudi, Hive) in Parquet format, supporting automatic table creation and schema evolution. The export process creates partition‑level tasks that write data to HDFS and update lake metadata.
For real‑time ingestion, a native Pulsar source is added to StarRocks, allowing high‑throughput consumption (up to 1.65 M messages/s) and seamless write‑through to the warehouse, followed by cooling tasks that move data to the lake.
Offline ingestion is handled by a unified scheduling platform that triggers data imports from Hive or other lake tables into StarRocks, supporting hourly, daily, or monthly granularity and automatic table creation.
The system also provides near‑real‑time incremental consumption of lake data back into StarRocks using routine load, enabling precise synchronization of snapshot‑based changes.
To support hot‑cold query fusion, the query planner rewrites SQL statements based on metadata that maps hot (warehouse) and cold (lake) partitions, generating separate sub‑queries that are unioned and optimized with push‑down aggregation.
Performance tests using TPC‑H show that fused queries can achieve up to three times the speed of pure lake queries.
Future work aims to decouple the lake and warehouse further, using SuperSQL to intelligently route queries to the appropriate storage layer and to support adaptive hot‑cold query planning.
The article concludes with a Q&A covering topics such as configuration complexity, the distinction between lake‑warehouse integration and monolithic solutions, cooling data query paths, and the trade‑offs between StarRocks and Flink.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.