Big Data 8 min read

Apache Hudi from Zero to One: Introduction to Table Services – Compaction, Cleaning, and Indexing (Part 5)

This article introduces Apache Hudi's table services, explaining the concepts, execution modes, and detailed workflows of compaction, cleaning, and indexing, and how they optimize storage layout and read/write performance in large‑scale data lake environments.

DataFunSummit

Aug 19, 2024

Apache Hudi from Zero to One: Introduction to Table Services – Compaction, Cleaning, and Indexing (Part 5)

Overview

Table services are maintenance jobs that operate on a table without adding new data, improving storage layout and enabling more efficient future reads and writes. They consist of scheduling (generating a plan) and execution (applying the plan). Hudi supports three execution modes: Inline, Semi‑async, and Full‑async.

Execution Modes

In Inline mode, scheduling and execution happen synchronously after a write, which is simple but adds latency. Semi‑async keeps inline scheduling but separates execution, allowing the executor to run as an independent job, possibly on a different cluster. Full‑async fully decouples table‑service execution from writes, useful for managing many tables with dedicated schedulers.

Compaction

Compaction merges log files into base files for Merge‑On‑Read tables, creating a new file‑slice version. The scheduling step checks the CompactionTriggerStrategy, creates a .compaction.requested entry, and generates a plan based on the CompactionStrategy. Execution loads the serialized operations, runs them in parallel, writes merged records via MergeHandle or CreateHandle, and records a .commit entry.

Since version 0.13.0, Hudi also offers experimental Log Compaction to reduce write amplification by compacting log files.

Cleaning

Cleaning removes old file‑slice versions to reclaim storage space. The scheduler uses CleaningTriggerStrategy (currently based on commit count) to create a .clean.requested entry after a configured number of commits. Three cleaning policies are supported: by commits, by file version, and by hours. Execution deserializes CleanFileInfo and deletes the targeted files in parallel, then records a .clean operation.

Indexing

Introduced experimentally in version 0.11.0, indexing builds indexes for metadata tables. The article references an external blog and RFC‑45 for details. An API called updateLocation() is used in inline indexing, while the current indexing service runs in full‑async mode to keep high write throughput while maintaining up‑to‑date indexes.

Review

The article covered the concept of table services, detailed the compaction and cleaning processes, and gave a brief overview of indexing, providing readers with a solid understanding of how Hudi optimizes storage and performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Compaction Apache Hudi Cleaning Table Services

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.