Big Data 17 min read

How Hudi’s New Bucket Index Supercharges Upserts in Massive Data Lakes

Introducing Hudi’s Bucket Index, a hash‑based indexing module that replaces Bloom Filter and HBase indexes, dramatically reducing file reads and writes, improving upsert performance, enabling efficient bucket pruning and joins, and offering practical guidance on design, implementation, and future extensions.

Volcano Engine Developer Services

Mar 3, 2022

How Hudi’s New Bucket Index Supercharges Upserts in Massive Data Lakes

Hudi is a streaming data‑lake platform that provides ACID capabilities, supports real‑time incremental consumption and batch updates, and can be written and queried via Spark, Flink, Presto, and other engines.

Hudi partitions are composed of multiple File Groups identified by a File ID. Each File Group contains a Base File (Parquet) and Delta Files (log files) that record modifications. Hudi uses MVCC; a Compaction task merges Delta Files into a new Base File, and a Clean operation removes obsolete files. Hudi’s index maps each record consistently to a File ID, enabling efficient upserts. Once the Record Key’s first version determines its File Group, that mapping never changes, ensuring all versions of a record reside in the same File Group.

Purpose and Types of Hudi Index

Index Purpose

In traditional Hive warehouses, updating a partition requires three heavy operations: reading all files in the partition, performing a distributed join with the update data, and writing the entire dataset back.

Read all 100,000 rows from 400 files.

Join the 100 rows to be updated with the full dataset.

Write the updated 100,000 rows to a temporary location and replace the original data.

This raises three questions: is it necessary to read so many files, to update so many files, or to perform a distributed join? In the worst case, only 100 files (¼ of the total) actually need to be read and updated.

Therefore, Hudi introduces an index to avoid unnecessary reads and writes, and to eliminate the need for a distributed join.

Avoid reading irrelevant files.

Avoid updating irrelevant files.

Merge updates within a File Group without a full distributed join.

Index Types

Bloom Filter Index : Each Parquet file maintains a Bloom filter. During the File Group mapping phase, Bloom filters of all candidate partitions are loaded to test Record Key existence. Lightweight, default index, stored in the file footer, no external dependencies.

HBase Index : Maintains a mapping of each Record Key to its Partition Path and File Group in HBase. Insert tasks batch‑get the mapping from HBase. Heavyweight, high query efficiency for small batches, but introduces external system overhead.

Bucket Index Background

Although Bloom Filter and HBase indexes improve performance, ByteDance’s large‑scale data‑lake use cases still faced challenges: with ~40,000 File Groups and 30 TB of data, upsert speed degraded because Bloom Filter false positives forced costly record‑level scans.

HBase Index was rejected due to the desire to avoid an external dependency. A lightweight, high‑performance solution was needed.

ByteDance’s data‑lake team therefore developed a hash‑based Bucket Index, contributed as RFC‑29.

Bucket Index Design Principles

Bucket Index is a hash‑based index similar to a database hash index. Given n buckets, a hash function determines the bucket for each record. Each bucket corresponds to a File Group, ensuring a one‑to‑one mapping between buckets and File Groups.

Unlike Bloom Filter, Hash Index provides a deterministic Record Key → File Group mapping without false positives. Records with the same key always fall into the same bucket.

Bucket Index Data Write Process

The write flow can be illustrated with a real‑time upsert scenario where five new records are inserted into partition 20220203.

Steps:

Estimate the storage size of a single partition and set numBuckets when creating the table.

Generate numBuckets File IDs, replacing the first eight characters with the bucket ID (e.g., 00000000‑e929‑4327‑8b0c‑7d0d66091321).

During insertion, compute the hash of the index key and take modulo numBuckets to locate the target bucket and thus the File Group.

After indexing, each record carries a File ID; the engine shuffles data by File ID so that records with the same File ID are processed together. For COW tables, updates are merged into a new Base File; for MOR tables, updates go to the delta log and inserts create new Base Files.

The hash computation is:

hashKeyFields.hashCode() & Integer.MAX_VALUE) % numBuckets

where hashKeyFields can be a subset of the Record Key; if unspecified, the full Record Key is used.

Bucket Index Query Optimization

Bucket Index enables two main query optimizations:

Bucket Pruning : Spark can prune entire buckets based on the indexed column, reducing the amount of data scanned.

Bucket Join : When joining a bucketed table with another table, Spark can avoid an extra shuffle because matching keys reside in the same bucket.

Example of bucket pruning: select * from T1 where city = 'beijing' Only the bucket containing city='beijing' needs to be read.

Example of bucket join:

select count(*) from T1 join T2 where T1.city = T2.city

Because T1 is bucketed on city, the join avoids an extra exchange operation.

Practice and Future Plans

Choosing an appropriate numBuckets is critical. It should be based on the estimated table size and future growth, aiming for roughly 3 GB per bucket. Too few buckets reduce parallelism; too many create many small files.

Future work includes an extensible Hash Index that can dynamically increase the number of buckets after table creation, avoiding the need for re‑creation when data volume grows.

Conclusion

Hudi’s Bucket Index provides a lightweight, hash‑based solution that eliminates unnecessary file reads and writes, stabilizes upsert performance at scale, and integrates with Spark/Presto to enable bucket pruning and bucket join optimizations. It requires only a simple configuration change to activate, making it easy to adopt for existing Hudi users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Hudi Hash Index Bucket Index

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.