Big Data 10 min read

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

DataFunSummit

Sep 7, 2022

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

Apache Doris is an MPP analytical database designed for multi‑dimensional analysis, reporting, and user‑profile analytics, offering a vectorized execution engine, MySQL protocol compatibility, and high‑concurrency low‑latency query services.

Hudi is a next‑generation streaming data lake platform that provides table‑format management, ACID transactions, MVCC, incremental reads, and supports engines such as Spark, Flink, Presto, and Trino.

Traditional Lambda and Kappa architectures have drawbacks for real‑time data warehouses, prompting the industry to adopt data‑lake technologies like Iceberg, Hudi, and DeltaLake, which bring ACID, schema evolution, and time‑travel capabilities.

To enable Doris to analyze Hudi data, three integration approaches were considered: (1) building a full Hudi C++ client, (2) using a Thrift broker to invoke the Hudi Java client, (3) embedding a JVM via JNI in the BE process, and (4) reading Hudi Parquet base files directly with the BE Arrow Parquet C++ API. The fourth approach was chosen for its performance, initially supporting COW Snapshot Queries and MOR Read‑Optimized Queries.

Implementation steps:

1. Create a Hudi external table in Doris by specifying ENGINE=HUDI and providing Hive Metastore URI, database, and table names. The table metadata is stored in Doris without moving any data.

CREATE TABLE example_db.t_hudi ENGINE=HUDI PROPERTIES (
    "hudi.database" = "hudi_db",
    "hudi.table" = "hudi_table",
    "hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);

CREATE TABLE example_db.t_hudi (
    column1 int,
    column2 string
) ENGINE=HUDI PROPERTIES (
    "hudi.database" = "hudi_db",
    "hudi.table" = "hudi_table",
    "hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);

2. Query the Hudi external table . During analysis, the FE fetches metadata from Hive Metastore, plans a HudiScanNode, obtains data file lists, generates scan ranges, dispatches tasks to BE nodes, and BE reads the Parquet files using the native reader.

Obtain Hudi data file locations.

FE adds HudiScanNode to the fragment.

Generate scan ranges from file lists.

Dispatch scan tasks to BE.

BE reads files via native Parquet reader.

Future roadmap:

Support MOR Snapshot Queries by adding native AVRO reading for delta files.

Implement Incremental Queries for both COW and MOR tables.

Provide native C++/Rust interfaces for reading Hudi base and delta files directly in BE.

The integration of Doris with Hudi is now part of the Apache Doris community, enabling high‑performance, real‑time analytics over data‑lake tables while continuing to evolve with upcoming features.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Hudi Apache Doris SQL Integration

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.