Big Data 8 min read

Understanding Apache Hudi Core Concepts: Timeline, File Layout, and Table Types

This article explains Apache Hudi's architecture, including its timeline mechanism, file layout, indexing strategies, table types (COW and MOR), query options, storage format versioning, backward compatibility, and key configuration settings for managing data lake tables.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Understanding Apache Hudi Core Concepts: Timeline, File Layout, and Table Types

Hudi Architecture

Apache Hudi is a data lake storage framework that organizes tables, files, and metadata to support incremental processing and efficient queries.

1. Timeline

1.1 Timeline concept

1.2 Components of the Hudi timeline

1.3 Instant action types on the timeline

1.4 State types on the timeline

1.5 Example of a timeline

2. File Layout

Hudi stores tables under a base path with a directory hierarchy. Tables can be partitioned according to schema-defined partition columns. Within each partition, files are grouped into file groups identified by a UUID. Each file group contains multiple file slices. A slice consists of a base file (parquet, orc, or hfile) written by a commit at a specific instant and a set of log files (.log) written by commits before the next base file request. Hudi uses Multi‑Version Concurrency Control (MVCC); compaction merges log and base files to create new slices, and cleaning removes unused or old slices to reclaim space. All metadata, including the timeline and metadata tables, resides in a special .hoodie directory under the base path.

3. Index

3.1 Introduction

3.2 Comparison with Hive without indexes

3.3 Hudi index types

3.4 Global vs. non‑global indexes

4. Table Types

4.1 Copy on Write (COW) table

Concept

Working principle

Management improvements over traditional tables

4.2 Merge on Read (MOR) table

Concept

Working principle

4.3 Summary of trade‑offs between COW and MOR

5. Query Types

Snapshot Queries

Incremental Queries

Read Optimized Queries

6. Storage File Organization

Below is the general organization of Hudi table storage files.

用于矢量化读取、列压缩和高效列式访问的列式格式,适用于分析/数据科学
用于快速扫描以读取整个记录的行式 avro 文件
用于高效搜索索引记录的随机访问优化 HFile(基于 SSTable 格式)
2.日志文件

Log files store incremental changes (updates, inserts, deletes) to the base file after its creation. They contain blocks such as data, command, and delete blocks that encode specific changes. Data blocks encode updates/inserts and can be customized for different needs.

面向行的 avro 文件,用于快速/轻量级写入
随机访问优化的 HFile,用于高效搜索索引记录(基于 SSTable 格式)
列式 parquet 文件,用于矢量化日志合并。

7. Storage Format Version Control

Elements of Hudi’s storage format (e.g., log format, log block structure, timeline files, schema) are versioned and tied to a monotonically increasing table version number. The version increments whenever a change occurs in the storage layout.

Backwards Compatible Reading

Hudi ensures backward compatibility so newer software versions can read recent older table versions. The recommended upgrade path is to first upgrade all readers (interactive query engines), then upgrade writers and table services. Hudi also provides automatic upgrade during subsequent writes, performing version upgrades without downtime.

Backwards Compatible Writing

In complex pipelines where components act as both readers and writers, upgrading may require upgrading downstream jobs first, then progressing upstream. Hudi allows writing to the most recent older table version so the new binaries can be rolled out across the deployment before finally upgrading to newer table versions, with readers adapting dynamically.

8. Configuration

hoodie.write.table.version (default: latest, optional) – The writer stores the table version; if the table exists, this version should match the current table version. Set a lower version when performing an upgrade.

hoodie.write.auto.upgrade (default: true, optional) – When enabled, the writer automatically migrates the table to the specified version if the current version is lower.

big datadata lakeApache HudiCopy-on-WriteMVCCMerge-On-ReadTable Formats
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.