Big Data 12 min read

Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans

This article introduces Apache Doris, an MPP analytical database, and explains how it integrates with the Hudi data lake format, covering architectural features, design choices, implementation steps including external table creation and query processing, and outlines future enhancements for supporting MOR snapshots and incremental queries.

DataFunTalk

Jul 18, 2022

Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans

Guest speaker Du Junling, a Big Data Engineer at ByteDance, presented on the DataFunTalk platform.

Overview of Doris : Doris is an MPP analytical database designed for multi‑dimensional analysis, reporting, and user profiling. It includes its own analysis and storage engines, supports vectorized execution, does not depend on external components, and is compatible with the MySQL protocol.

Well‑designed architecture enabling high‑concurrency, low‑latency queries and high‑throughput interactive analysis.

Supports batch and streaming data loads, updates, and deletes with unique/aggregate data models.

Provides high availability, fault tolerance, and elastic scaling (FE leader/follower failover).

Supports aggregate tables and materialized views with dynamic updates.

MySQL‑compatible protocol for easy client integration.

Doris consists of Frontend (FE) nodes handling request parsing, optimization, planning, metadata management, and Backend (BE) nodes executing plans and storing data.

Introduction to Hudi : Hudi is a next‑generation streaming data‑lake platform offering table‑format management, ACID transactions, MVCC, updates/deletes, incremental reads, and compatibility with Spark, Flink, Presto, Trino, etc.

Hudi defines two table types (Copy‑On‑Write and Merge‑On‑Read) and three query types (Snapshot, Read‑Optimized, Incremental).

Technical background : Real‑time data‑warehouse requirements have evolved from Lambda architecture (separate batch and streaming paths) to Kappa architecture (single path) and finally to data‑lake‑based solutions (Iceberg, Hudi, DeltaLake) that provide ACID, time‑travel, and schema evolution.

To bridge Doris and Hudi for federated analysis, a feature was developed to query Hudi data from Doris.

Design principles : Because Hudi is Java‑based and Doris BE is C++, three integration approaches were considered:

Implement a full Hudi C++ client (long development cycle, high maintenance).

Use a Thrift broker to forward requests to a Java client (adds broker responsibilities, incurs data transfer overhead).

Embed a JVM via JNI in BE to call the Java client (maintains Java logic, good performance, but adds JVM management complexity).

Read Hudi Parquet base files directly with the BE Arrow Parquet C++ API (highest performance, but initially only supports base files, not delta files).

The fourth approach was chosen for the first release, supporting Snapshot Queries for Copy‑On‑Write tables and Read‑Optimized Queries for Merge‑On‑Read tables.

Implementation steps :

Create a Hudi external table in Doris by specifying ENGINE=HUDI and providing Hive Metastore URI, database, and table names. The table metadata is stored in Doris without moving any data. Schema can be fully or partially specified, matching the Hive Metastore definition.

CREATE TABLE example_db.t_hudi ENGINE=HUDI
PROPERTIES (
  "hudi.database" = "hudi_db",
  "hudi.table" = "hudi_table",
  "hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);

CREATE TABLE example_db.t_hudi (
  column1 int,
  column2 string
) ENGINE=HUDI
PROPERTIES (
  "hudi.database" = "hudi_db",
  "hudi.table" = "hudi_table",
  "hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);

Query the Hudi external table: during analysis, the FE retrieves the Hive Metastore address and schema, then plans a HudiScanNode which obtains the list of data files, generates scan ranges, dispatches scan tasks to BE nodes, and BE reads the Parquet files via the native reader.

Obtain Hudi table data locations.

FE adds HudiScanNode to the fragment.

Generate scan ranges from the file list.

Dispatch scan tasks to BE.

BE reads files using the native Parquet reader.

Future roadmap :

Support Snapshot Queries for Merge‑On‑Read tables (requires native reading of delta files in Avro format).

Implement Incremental Queries for both Copy‑On‑Write and Merge‑On‑Read tables.

Develop native C++/Rust interfaces for reading Hudi base and delta files directly in BE.

The presentation concluded with thanks and a reminder to like, share, and follow the content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL real-time analytics Data Lake Hudi Apache Doris

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.