Big Data 9 min read

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

This article explains Apache Hudi’s core architecture, detailing the timeline mechanism, file layout, indexing strategies, the two main table types (Copy‑On‑Write and Merge‑On‑Read), and various query modes such as snapshot, time‑travel, read‑optimized and incremental queries.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

Overview

Apache Hudi (version 1.0) defines how data is stored and how write and read operations are performed on tables. It introduces widely adopted table types that let users balance trade‑offs based on workload characteristics.

Timeline (TimeLine)

The Hudi timeline records every commit, compaction, clean, and other instant actions. It consists of a series of instants each having a state (e.g., REQUESTED, INFLIGHT, COMPLETED). The timeline enables features such as time‑travel queries and incremental reads.

File Layout

Hudi organizes data into file groups. Each group contains base files (the current snapshot) and, for Merge‑On‑Read tables, log files that capture row‑level changes.

Indexes

Hudi provides several index types to speed up record location:

Bloom filter index

Record index

Column (or expression) index

Secondary index

Indexes can be global (covering the whole table) or non‑global (limited to a partition).

Table Types

Copy‑On‑Write (COW)

COW tables are optimized for read‑heavy workloads. When a record is updated or deleted, Hudi creates a new base file for the affected file group; no log files are written. This guarantees that queries read only base files, delivering high read performance.

写时复制 (COW) 表类型针对读取密集型工作负载进行了优化。在此模式下,记录更新或删除会触发在文件组中创建新的基础文件,并且不会写入日志文件。这确保每个查询仅读取基础文件,从而提供较高的读取性能,而无需动态合并日志文件。

Because the entire file group may be rewritten on each write, write latency can be higher, especially when only a few records change.

Illustration of the COW workflow:

COW timeline diagram
COW timeline diagram

Each commit creates a new slice; queries see only the latest committed slices, preventing exposure to in‑flight writes.

Automatic updates on existing files instead of full table refresh.

Ability to read only modified data, avoiding unnecessary scans.

Strict file‑size control to maintain query performance.

Merge‑On‑Read (MOR)

MOR tables store updates in lightweight log files (e.g., Avro) and periodically compact them into base files. During query execution, log files are merged with base files on‑the‑fly, offering lower write latency and near‑real‑time data availability.

读取时合并 (MOR) 表类型通过使用定期压缩将轻量级日志文件与基文件合并,从而平衡写入和读取性能。数据更新和删除操作会写入日志文件(以基于行的格式,例如 Avro 或列式/基文件格式),然后在查询执行期间将日志文件中的这些更改动态地与基文件合并。

The trade‑off is that query performance can vary depending on whether log files have been compacted.

MOR workflow diagram
MOR workflow diagram

Two query modes are supported:

Read Optimized Query : reads only base files, suitable for batch workloads.

Snapshot Query : merges log files with base files to provide the latest view of the data.

Query Types

Snapshot Queries : Return the latest committed snapshot of the table.

Time Travel Queries : Access the table state at a specific past instant, useful for reproducible ML experiments.

Read Optimized Queries (MoR tables only) : Use only columnar base files for fast snapshot reads.

Incremental Queries (Latest State) : Return rows changed since a given instant.

Incremental Queries (CDC) : Provide change‑data‑capture streams with before/after images of each record.

Query type comparison
Query type comparison

Trade‑offs Between Table Types

COW offers superior read performance at the cost of slower writes, while MOR provides faster writes and near‑real‑time visibility but may incur variable read latency depending on log compaction.

big dataIndexesdata lakeApache HuditimelineTable Types
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.