Databases 10 min read

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

dbaplus Community

Sep 14, 2022

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

1. Doris Overview

Doris is an MPP analytical database designed for multi‑dimensional analysis, reporting, and user‑profile queries. It includes its own query and storage engines, supports vectorized execution, and is MySQL‑compatible.

2. Hudi Overview

Apache Hudi provides table‑format management for data lakes, offering ACID transactions, MVCC, updates, deletes, incremental reads, and integration with Spark, Flink, Presto, and Trino.

3. Technical Background: Real‑Time Data Warehousing

Traditional Lambda architecture separates batch and streaming paths, leading to duplicated code, complex maintenance, and high update costs. Kappa architecture merges the two paths but suffers from weaker batch performance, limited message‑queue storage, potential out‑of‑order data, and extra integration effort. Modern data‑lake formats (Iceberg, Hudi, DeltaLake) address these issues by supporting ACID, schema evolution, and time‑travel, yet they still require additional services for low‑latency queries.

4. Design Principles for Doris‑Hudi Integration

Because Hudi is Java‑based while Doris BE runs in C++, four integration options were evaluated:

Implement a native Hudi C++ client (high effort, high maintenance).

Use a Thrift broker to forward requests to a Java client (adds broker responsibilities, incurs network overhead).

Launch a JVM via JNI inside BE to call the Java client (maintains community compatibility, good performance, but adds JVM management complexity).

Read Hudi Parquet base files directly with the existing C++ Arrow Parquet API, ignoring delta files (simplest, highest performance, but limited to COW snapshot queries and MOR read‑optimized queries).

The fourth approach was chosen for the first release.

5. Implementation Details

Creating a Hudi external table in Doris involves specifying ENGINE=HUDI and providing Hive Metastore connection details, database, and table names. The table definition can optionally include the full Hudi schema, which must match the Hive Metastore schema.

CREATE TABLE example_db.t_hudi (
    column1 int,
    column2 string
) ENGINE=HUDI
PROPERTIES (
    "hudi.database" = "hudi_db",
    "hudi.table" = "hudi_table",
    "hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);

When querying a Hudi external table, the FE retrieves metadata from the Hive Metastore, builds a HudiScanNode, generates scan ranges from the listed data files, and dispatches the scan task to BE nodes. BE uses the native Parquet reader to read the base files.

6. Future Roadmap

Support MOR snapshot queries by merging base and delta files (requires native AVRO reader).

Implement incremental queries for both COW and MOR tables.

Provide a native BE interface to read Hudi delta files, collaborating with the Hudi community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL real-time analytics Data Lake Apache Hudi Apache Doris

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.