Big Data 9 min read

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article explains Apache Hudi’s write‑side indexing, detailing the indexing API, various index types—including simple, Bloom, bucket, HBase, and record‑level indexes—and their mechanisms, helping readers understand how Hudi validates record existence and optimizes updates and deletions.

DataFunSummit
DataFunSummit
DataFunSummit
Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article, translated from the original English blog, provides a deep dive into Apache Hudi’s write‑side indexing, covering the indexing API and the different index types available for optimizing write operations.

Index API

tagLocation() : Called when a batch of records is processed, it marks each record, determines its existence in the table, and associates it with location information. The resulting "tagged record" populates the currentLocation field in the HoodieRecord model.

updateLocation() : After data is written, some indexes update the location metadata to stay in sync with the table. This step runs only for index types that require post‑IO updates.

isGlobal() : Distinguishes global indexes, which enforce uniqueness across all partitions, from non‑global (partition‑level) indexes that validate uniqueness only within a partition.

canIndexLogFiles() : Indicates whether an index can handle log files of Merge‑on‑Read tables, affecting how the writer creates file handles.

isImplicitWithStorage() : Shows whether the index is persisted implicitly together with the data files or stored separately.

Index Types

1. Simple Index

A non‑global default index that scans all base files in the relevant partitions to find matching keys. A global variant scans all partitions, handling record moves across partitions and supporting Merge‑on‑Read log files.

The simple index marks records that match existing keys, filling currentLocation . Unmatched records remain unchanged and are later merged.

2. Bloom Index

Uses a two‑stage filter to reduce the number of keys and files examined. First, it compares input keys against range metadata in base‑file footers; second, it checks candidate keys against deserialized Bloom filters.

A global Bloom Index operates similarly but at the table level, handling partition updates.

3. Bucket Index

Based on hashing, it maps keys to fixed buckets that correspond to file groups, eliminating the need for storage‑layer reads. Two variants exist: Simple Bucket Index with a fixed number of buckets, and Consistent‑Hashing Bucket Index that dynamically re‑hashes buckets when a file group grows too large.

4. HBase Index

Implemented via an external HBase server, it stores a global mapping of record keys to file groups, offering fast lookups and horizontal scalability at the cost of additional operational overhead.

5. Record‑Level Index

Introduced in Hudi 0.14.0, this global index stores the key‑to‑file‑group mapping inside the Hudi table itself, avoiding the need for an external server.

Review

The article covered the Hudi write indexing API, detailed the simple and Bloom index workflows, and gave a brief overview of bucket, HBase, and record‑level indexes, providing readers with a solid understanding of how Hudi validates record existence and optimizes updates and deletions.

Big DataIndexingdata lakeApache HudiWrite Optimization
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.