Big Data 8 min read

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

This article introduces Apache Hudi’s storage format, explaining the table layout, metadata and data file organization, the naming conventions of timeline actions, and the trade‑offs between Copy‑on‑Write and Merge‑on‑Read table types for transactional data lakes.

DataFunSummit

Jun 19, 2024

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

Apache Hudi is a transactional data‑lake platform that brings database and data‑warehouse capabilities to object storage. This article introduces Hudi’s storage format, describing the table layout, metadata files, and data files.

Hudi stores table metadata under <base_path>/.hoodie/, including hoodie.properties and Timeline files that record actions such as commit and deltacommit. The naming convention for Timeline files is

<action_timestamp>.<action_type>[.<action_state>]

, where the state can be requested, inflight, or completed.

# an example of deltacommit actions on Timeline
20230827233828740.deltacommit.requested
20230827233828740.deltacommit.inflight
20230827233828740.deltacommit

<action timestamp>.<action type>[.<action state>]

Data files are divided into Base Files (columnar, e.g., Apache Parquet) that store the main records and Log Files (row‑oriented, e.g., Apache Avro) that capture incremental changes. A Base File together with its associated Log Files forms a File Slice; multiple slices compose a File Group, enabling efficient reads and writes.

Hudi supports two table types: Copy‑on‑Write (CoW) and Merge‑on‑Read (MoR). CoW rewrites entire files on each write, offering low‑latency reads at the cost of higher write amplification, while MoR appends changes to Log Files, reducing write amplification but requiring merge‑on‑read for the latest view.

The article concludes with a brief recap and points to future posts that will cover other aspects of Hudi, such as compaction, cleaning, and query modes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data metadata Data Lake Apache Hudi storage format File Layout

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.