Big Data 8 min read

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

This article introduces Apache Hudi’s storage format, explaining the table layout, metadata and data file organization, the naming conventions of timeline actions, and the trade‑offs between Copy‑on‑Write and Merge‑on‑Read table types for transactional data lakes.

DataFunSummit
DataFunSummit
DataFunSummit
Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

Apache Hudi is a transactional data‑lake platform that brings database and data‑warehouse capabilities to object storage. This article introduces Hudi’s storage format, describing the table layout, metadata files, and data files.

Hudi stores table metadata under <base_path>/.hoodie/ , including hoodie.properties and Timeline files that record actions such as commit and deltacommit . The naming convention for Timeline files is <action_timestamp>.<action_type>[.<action_state>] , where the state can be requested , inflight , or completed .

# an example of deltacommit actions on Timeline
20230827233828740.deltacommit.requested
20230827233828740.deltacommit.inflight
20230827233828740.deltacommit
<action timestamp>.<action type>[.<action state>]

Data files are divided into Base Files (columnar, e.g., Apache Parquet) that store the main records and Log Files (row‑oriented, e.g., Apache Avro) that capture incremental changes. A Base File together with its associated Log Files forms a File Slice; multiple slices compose a File Group, enabling efficient reads and writes.

Hudi supports two table types: Copy‑on‑Write (CoW) and Merge‑on‑Read (MoR). CoW rewrites entire files on each write, offering low‑latency reads at the cost of higher write amplification, while MoR appends changes to Log Files, reducing write amplification but requiring merge‑on‑read for the latest view.

The article concludes with a brief recap and points to future posts that will cover other aspects of Hudi, such as compaction, cleaning, and query modes.

Big Datametadatadata lakeApache HudiStorage FormatFile Layout
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.