How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes
This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.
1. Doris Overview
Doris is an MPP analytical database designed for multi‑dimensional analysis, reporting, and user‑profile queries. It includes its own query and storage engines, supports vectorized execution, and is MySQL‑compatible.
2. Hudi Overview
Apache Hudi provides table‑format management for data lakes, offering ACID transactions, MVCC, updates, deletes, incremental reads, and integration with Spark, Flink, Presto, and Trino.
3. Technical Background: Real‑Time Data Warehousing
Traditional Lambda architecture separates batch and streaming paths, leading to duplicated code, complex maintenance, and high update costs. Kappa architecture merges the two paths but suffers from weaker batch performance, limited message‑queue storage, potential out‑of‑order data, and extra integration effort. Modern data‑lake formats (Iceberg, Hudi, DeltaLake) address these issues by supporting ACID, schema evolution, and time‑travel, yet they still require additional services for low‑latency queries.
4. Design Principles for Doris‑Hudi Integration
Because Hudi is Java‑based while Doris BE runs in C++, four integration options were evaluated:
Implement a native Hudi C++ client (high effort, high maintenance).
Use a Thrift broker to forward requests to a Java client (adds broker responsibilities, incurs network overhead).
Launch a JVM via JNI inside BE to call the Java client (maintains community compatibility, good performance, but adds JVM management complexity).
Read Hudi Parquet base files directly with the existing C++ Arrow Parquet API, ignoring delta files (simplest, highest performance, but limited to COW snapshot queries and MOR read‑optimized queries).
The fourth approach was chosen for the first release.
5. Implementation Details
Creating a Hudi external table in Doris involves specifying ENGINE=HUDI and providing Hive Metastore connection details, database, and table names. The table definition can optionally include the full Hudi schema, which must match the Hive Metastore schema.
CREATE TABLE example_db.t_hudi (
column1 int,
column2 string
) ENGINE=HUDI
PROPERTIES (
"hudi.database" = "hudi_db",
"hudi.table" = "hudi_table",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);When querying a Hudi external table, the FE retrieves metadata from the Hive Metastore, builds a HudiScanNode, generates scan ranges from the listed data files, and dispatches the scan task to BE nodes. BE uses the native Parquet reader to read the base files.
6. Future Roadmap
Support MOR snapshot queries by merging base and delta files (requires native AVRO reader).
Implement incremental queries for both COW and MOR tables.
Provide a native BE interface to read Hudi delta files, collaborating with the Hudi community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
