Big Data 16 min read

How Hudi’s New Bucket Index Boosts Upsert Performance in Massive Data Lakes

This article explains the background, design, and practical benefits of Hudi's Bucket Index—a hash‑based indexing mechanism that reduces unnecessary file reads and writes, improves upsert speed on terabyte‑scale datasets, and enables query optimizations such as bucket pruning and bucket join.

ByteDance Data Platform

Feb 28, 2022

How Hudi’s New Bucket Index Boosts Upsert Performance in Massive Data Lakes

Hudi is a streaming data‑lake platform that provides ACID guarantees, supports real‑time incremental consumption and batch updates, and can be written and queried using engines like Spark, Flink, and Presto.

Hudi organizes data into partitions composed of multiple File Groups, each identified by a File ID. Within a File Group, data is stored as Base Files (Parquet) and Delta Files (log files). Hudi uses MVCC and compaction to merge Delta Files into new Base Files and cleans up obsolete files. The index maps each record consistently to a File ID, enabling efficient upserts. Once a Record Key’s first version determines its File Group, that mapping never changes, ensuring all versions of a record stay in the same File Group.

Why an Index Is Needed

In traditional Hive warehouses, updating a small subset of records in a large partition requires reading all files, performing distributed joins, and rewriting the entire partition, which is costly. Hudi’s index avoids these unnecessary reads and writes by quickly locating the relevant File Group.

Existing Hudi Index Types

Bloom Filter Index : Each Parquet file maintains a Bloom filter; during mapping, Bloom filters of candidate files are loaded to test Record Key existence. It is lightweight, default, and stored in the file footer, requiring no external system.

HBase Index : Stores Record Key to Partition Path and File Group mapping in HBase; suitable for small batches but introduces external dependency and higher operational cost.

Bucket Index – Background and Motivation

ByteDance’s data‑lake team faced performance degradation when upserting billions of records across ~40,000 File Groups. Bloom Filter Index suffered from false positives, causing expensive reads, while HBase Index was deemed too heavyweight. A lightweight, high‑performance hash‑based index was needed.

Bucket Index Design Principles

Bucket Index is a hash‑based index inspired by database hash indexes. Given n buckets, a hash function determines the bucket for each record, mapping each bucket to a File Group. Unlike Bloom filters, it has no false positives; identical keys always map to the same bucket.

During table creation, the number of buckets (numBuckets) is estimated based on expected partition size; each bucket corresponds one‑to‑one with a File Group.

Data Write Flow

Estimate partition size and set numBuckets when creating the table.

Generate n File IDs and replace the first 8 bits with the bucket ID.

For each incoming record, compute the hash of the index key (by default the Record Key) and take

(hashKeyFields.hashCode() & Integer.MAX_VALUE) % numBuckets

to obtain the bucket ID, which determines the target File Group.

Write the record to the corresponding File Group; for COW tables the update merges into a new Base File, for MOR tables the update is appended to the delta log.

Query Optimizations with Bucket Index

Bucket Index enables two main optimizations in engines like Spark:

Bucket Pruning : When a query filters on the bucket column (e.g., city = 'beijing'), only the relevant bucket’s files are read, drastically reducing scanned data.

Bucket Join : If a bucketed table joins with another table on the bucket column, the join can avoid an extra shuffle because matching records reside in the same bucket.

Practical Experience and Future Plans

Choosing an appropriate numBuckets is critical: too few buckets lead to large File Groups and reduced parallelism, while too many create many small files. The team recommends targeting ~3 GB per bucket. Future work includes a scalable hash‑index that can dynamically expand the number of buckets after table creation, eliminating the need for accurate upfront estimation.

Conclusion

Hudi’s Bucket Index provides a lightweight, hash‑based solution that eliminates unnecessary file reads, maintains stable upsert performance at scale, and integrates with query engines to enable bucket pruning and bucket join, thereby improving both write and read efficiency for large‑scale data‑lake workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Hudi Upsert Hash Index Bucket Index

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.