Big Data 12 min read

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Big Data Technology & Architecture

Sep 2, 2020

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi (Hadoop Updates and Incrementals) is an open‑source data‑lake solution originally developed by Uber. It enables ingest, management and incremental processing of large analytical datasets stored on HDFS‑compatible file systems such as HDFS or S3.

Ingests and manages large analytical datasets on HDFS, reducing ingestion latency.

Built on Spark to perform updates, inserts, and deletes on HDFS data.

Provides stream primitives: upserts (changing datasets) and incremental pulls (fetching changed data).

Supports insert/update operations on Parquet files in HDFS.

Integrates with the Hadoop ecosystem (Spark, Hive, Parquet) via custom InputFormat.

Offers Savepoint functionality for data recovery.

Currently supports Spark 2.x, recommended Spark 2.4.4+.

Basic Architecture

Unlike Kudu, which is an OLTP‑oriented storage system, Hudi is designed for Hadoop‑compatible file systems (HDFS, S3) and relies heavily on Spark for incremental processing and rich query capabilities. It can combine batch and streaming by acting as both source and sink in a pipeline, storing data in distributed file systems.

Hudi works together with query engines such as Hive, Spark and Presto, exposing its own tables that can be queried for snapshot, incremental, and read‑optimized queries.

Timeline

Each Hudi table maintains a Timeline composed of Instant objects. An Instant represents an operation on the table and includes an Instant Action (e.g., COMMIT, CLEAN, DELTA_COMMIT, COMPACTION, ROLLBACK, SAVEPOINT), an Instant Time (monotonically increasing timestamp), and an Instant State (REQUESTED, INFLIGHT, COMPLETED).

Example: Between 10:00 and 10:20, an upsert runs every five minutes, generating a series of COMMIT instants. When conditions are met, CLEAN and COMPACTION actions are triggered in the background (e.g., CLEAN at 10:05, COMPACTION at 10:10).

Data may experience latency between generation and arrival in Hudi; the Timeline allows efficient incremental queries by scanning only the files that changed after a given time, avoiding full scans of older buckets.

Files and Index

Hudi organizes a table under a base path on HDFS, partitioned into directories (partition paths). Each partition contains file groups identified by a unique file ID. A file group holds file slices, each consisting of a base Parquet file and zero or more log files (*.log.*) that capture inserts/updates after the base file was created.

Hudi uses an MVCC design: COMPACTION merges log files into the base file, while CLEAN removes obsolete file slices. Records are indexed by a Hoodie Key (record key + partition path) that maps to a file group ID.

Hudi Table Types

Hudi supports two table types:

Copy‑On‑Write (COW) : Stores data only in columnar files (Parquet). Updates create new versions of file slices; COMPACTION produces new base files. This can cause write amplification because each update rewrites the entire base file.

Merge‑On‑Read (MOR) : Stores data in both columnar (Parquet) base files and row‑oriented log files (Avro). Updates are written to log files; COMPACTION merges logs into new base files. MOR balances read and write amplification.

In COW tables, each INSERT or UPDATE generates a new file slice; queries can filter by COMMIT timestamps to retrieve the appropriate version. In MOR tables, log files are merged during COMPACTION, and queries can be Snapshot (latest view), Incremental (new data after a commit), or Read‑Optimized (only base files).

Hudi Query Types

Snapshot Query : Returns the latest snapshot of data as of the most recent COMMIT or COMPACTION. For COW tables it reads Parquet files; for MOR tables it merges base and log files on‑the‑fly.

Incremental Query : Returns only the data written after a specified COMMIT/COMPACTION, i.e., the newest records.

Read Optimized Query : Returns the latest data that existed before a given COMMIT/COMPACTION, reading only the base Parquet files.

Engine Support Matrix

External query engines such as Hive, Spark SQL and Presto can query Hudi tables directly. Both COW and MOR tables support Snapshot, Incremental and Read‑Optimized queries. The following matrices illustrate the capabilities (images omitted for brevity).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive Data Lake Spark Apache Hudi Incremental Processing Query Types

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.