How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes
This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.
Overview of Apache Doris
Doris is an MPP analytical database designed for multidimensional analysis, reporting, and user‑profile analytics. It includes its own analysis and storage engines, supports vectorized execution, is MySQL‑compatible, and consists of Frontend (FE) nodes for query planning and Backend (BE) nodes for execution and storage.
Introduction to Hudi
Hudi is a next‑generation streaming data‑lake platform that provides table‑format management with ACID, MVCC, incremental reads, and support for Spark, Flink, Presto, and Trino.
Technical Background: Real‑Time Data‑Warehouse Architectures
Traditional warehouses evolved from batch‑oriented Lambda architecture (separate online and offline paths) to Kappa architecture (single stream processing) and finally to data‑lake‑based solutions such as Iceberg, Hudi, and DeltaLake, which add ACID, schema evolution, and time‑travel capabilities.
Limitations of Existing Approaches
Lambda requires duplicate code for online and offline analytics, increasing maintenance complexity.
Kappa’s stream engine has weaker batch performance and message‑queue storage limits historical data.
Data‑lake solutions improve latency but cannot directly serve external queries and need additional services.
To bridge this gap, Doris adds native support for querying Hudi tables, enabling analysts to run federated queries across Doris and Hudi data.
Design Options for Doris‑Hudi Integration
Four possible solutions were evaluated:
Implement a full Hudi C++ client in BE – long development and maintenance effort.
Use a Thrift‑based broker to invoke the Hudi Java client – adds a broker role and incurs data transfer overhead.
Embed a JVM via JNI in BE to call the Hudi Java client – higher performance but requires JVM management.
Read Hudi Parquet base files directly with the BE Arrow Parquet C++ API, ignoring delta files – highest performance and simplest implementation.
The fourth approach was chosen for the initial release, supporting COW table Snapshot Queries and MOR table Read‑Optimized Queries.
Implementation Steps
1. Create a Hudi External Table
Specify ENGINE=HUDI and provide Hudi‑specific properties such as the Hive Metastore URI, database, and table name. The table metadata is stored in Doris without moving any data.
CREATE TABLE example_db.t_hudi (
column1 int,
column2 string
) ENGINE=HUDI
PROPERTIES (
"hudi.database" = "hudi_db",
"hudi.table" = "hudi_table",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);2. Query the Hudi External Table
During analysis, FE retrieves the table schema and file locations from Hive Metastore, then plans a HudiScanNode which:
Collects data‑file paths from the Hudi table.
Generates scan ranges based on those files.
Dispatches HudiScan tasks to BE nodes.
BE reads the Parquet files using the native C++ reader.
This flow enables Doris to query Hudi data without data duplication.
Future Roadmap
Current support includes COW Snapshot Queries and MOR Read‑Optimized Queries. Planned enhancements are:
Implement MOR Snapshot Queries by merging base and delta files (requires AVRO support).
Add Incremental Queries for both COW and MOR tables.
Develop native BE interfaces to read Hudi base and delta files (C++/Rust), eliminating the need for Java dependencies.
These efforts are ongoing in collaboration with the Hudi community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
