Big Data 21 min read

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

This article introduces Apache Hudi, explaining its core concepts, design principles, table architecture, write and compaction mechanisms, and the three query modes that enable efficient batch and incremental processing on modern data lakes.

Big Data Technology & Architecture

Oct 21, 2020

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

1. Introduction

Apache Hudi (Hudi) enables storage of massive datasets on top of Hadoop‑compatible storage while offering two primitives that extend traditional batch processing to streaming on a data lake.

The two primitives are:

1) Update/Delete records : Hudi uses fine‑grained file/record‑level indexes to support updates and deletes and provides transactional guarantees for writes. Queries read the latest committed snapshot.

2) Change streams : Hudi can generate an incremental stream of all records that were inserted, updated, or deleted since a given point in time, unlocking new query patterns.

These primitives tightly couple to provide DFS‑based streaming and incremental processing capabilities similar to consuming events from a Kafka topic and maintaining state.

Architectural advantages include:

1) Efficiency gains : Record‑level updates allow Hudi to rewrite only changed records instead of whole partitions, reducing compute resources dramatically.

2) Faster ETL pipelines : Incremental queries let downstream Spark/Hive jobs process only new changes, accelerating data pipelines.

3) Fresh data delivery : By avoiding full‑table rewrites, pipelines finish faster and deliver near‑real‑time data.

4) Unified storage : Real‑time access eliminates the need for separate serving stores or data marts.

2. Design Principles

2.1 Streaming Read/Write

Hudi is built from the ground up for large‑scale ingest and egress, borrowing database indexing ideas to map record keys to file locations and to track record‑level metadata for precise incremental streams.

2.2 Self‑Management

Hudi balances write freshness and read performance by offering three query types (real‑time snapshot, incremental, and read‑optimized) and automatically optimizes parallelism, file sizing, and rollback handling.

2.3 Log‑Structured Storage

Hudi adopts an append‑only, cloud‑friendly log‑structured design that works seamlessly across cloud providers.

2.4 Key‑Value Data Model

Each record has a unique key (optionally with a partition path), which reduces the search space for index lookups.

3. Table Design

Hudi tables consist of three main components:

1) An ordered timeline of metadata, similar to a database transaction log.

2) Hierarchical data files (base files) that store the actual records.

3) Indexes (multiple implementations) that map keys to file IDs.

Key features include upsert support, efficient incremental scans, atomic commits with rollback, MVCC‑style snapshot isolation, file‑size management, self‑compacting updates, and GDPR‑compliant delete capabilities.

3.1 Timeline

The timeline records every instant (operation) performed on the dataset, providing an instant view of the table and supporting ordered retrieval. Instants are stored as .hoodie metadata files, with the latest instant kept in a single file and older ones archived.

Instant components:

• Operation type (e.g., COMMIT, CLEAN, DELTA_COMMIT, COMPACTION, ROLLBACK, SAVEPOINT)

• Timestamp (monotonically increasing)

• State (REQUESTED, INFLIGHT, COMPLETED)

3.2 Data Files

Data is organized into partitions, file groups, and file slices. Each slice contains a base file (Parquet) and zero or more log files that capture updates. Hudi uses MVCC: compaction merges log files into new base files, while cleaning removes obsolete slices.

3.3 Indexes

Hudi provides three index implementations (HBaseIndex, HoodieBloomIndex, InMemoryHashIndex) that map a record key + partition path to a file ID, enabling fast upserts without full table scans. Global indexes ignore partition paths, while non‑global indexes require them, offering better scalability.

4. Write Models

4.1 Copy‑On‑Write (COW) Table

Writes go directly to new Parquet base files; no log files are created. Updates rewrite the entire file slice, while inserts are batched into files that respect the configured maximum size.

4.2 Merge‑On‑Read (MOR) Table

Writes first create log files; a later compaction operation merges logs with base files. MOR supports multiple query types (snapshot, incremental, read‑optimized) and allows fine‑grained control over log file size.

5. Write Operations

5.1 Upsert, Insert, Bulk Insert

• Upsert : Default operation; uses the index to decide insert vs. update, then batches records for optimal file sizing.

• Insert : Skips index lookup for faster ingestion when duplicate keys are acceptable.

• Bulk Insert : Sort‑based writer that scales to hundreds of terabytes for initial dataset loading, focusing on file‑size optimization rather than transactional guarantees.

5.2 Compaction

Compaction merges log files into base files for MOR tables. It can be synchronous (blocking the next write) or asynchronous (running in parallel, providing near‑real‑time freshness).

5.3 Cleaning

Cleaning removes obsolete file slices to reclaim storage. It can be configured by number of commits/delta‑commits retained or by the number of file slices per group.

5.4 DFS Access Optimizations

Hudi mitigates small‑file problems by profiling workloads, reusing existing file groups, caching timeline information, and allowing configurable ratios between base and log file sizes.

6. Query Modes

Depending on the table type, Hudi supports three query patterns:

6.1 Snapshot Query

Returns the latest view of the table after a given commit or delta‑commit. For MOR tables, it merges base and log files on‑the‑fly to provide near‑real‑time snapshots.

6.2 Incremental Query

Returns only the records that changed since a specified commit/delta‑commit, enabling efficient change‑data‑capture pipelines.

6.3 Read‑Optimized Query

Exposes only the latest base files (no log merging), delivering performance comparable to a native Parquet table.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ETL Data Lake Spark Apache Hudi Incremental Processing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.