How Hudi’s New Bucket Index Boosts Upsert Performance in Massive Data Lakes
This article explains the background, design, and practical benefits of Hudi's Bucket Index—a hash‑based indexing mechanism that reduces unnecessary file reads and writes, improves upsert speed on terabyte‑scale datasets, and enables query optimizations such as bucket pruning and bucket join.
Hudi is a streaming data‑lake platform that provides ACID guarantees, supports real‑time incremental consumption and batch updates, and can be written and queried using engines like Spark, Flink, and Presto.
Hudi organizes data into partitions composed of multiple File Groups, each identified by a File ID. Within a File Group, data is stored as Base Files (Parquet) and Delta Files (log files). Hudi uses MVCC and compaction to merge Delta Files into new Base Files and cleans up obsolete files. The index maps each record consistently to a File ID, enabling efficient upserts. Once a Record Key’s first version determines its File Group, that mapping never changes, ensuring all versions of a record stay in the same File Group.
Why an Index Is Needed
In traditional Hive warehouses, updating a small subset of records in a large partition requires reading all files, performing distributed joins, and rewriting the entire partition, which is costly. Hudi’s index avoids these unnecessary reads and writes by quickly locating the relevant File Group.
Existing Hudi Index Types
Bloom Filter Index : Each Parquet file maintains a Bloom filter; during mapping, Bloom filters of candidate files are loaded to test Record Key existence. It is lightweight, default, and stored in the file footer, requiring no external system.
HBase Index : Stores Record Key to Partition Path and File Group mapping in HBase; suitable for small batches but introduces external dependency and higher operational cost.
Bucket Index – Background and Motivation
ByteDance’s data‑lake team faced performance degradation when upserting billions of records across ~40,000 File Groups. Bloom Filter Index suffered from false positives, causing expensive reads, while HBase Index was deemed too heavyweight. A lightweight, high‑performance hash‑based index was needed.
Bucket Index Design Principles
Bucket Index is a hash‑based index inspired by database hash indexes. Given n buckets, a hash function determines the bucket for each record, mapping each bucket to a File Group. Unlike Bloom filters, it has no false positives; identical keys always map to the same bucket.
During table creation, the number of buckets (numBuckets) is estimated based on expected partition size; each bucket corresponds one‑to‑one with a File Group.
Data Write Flow
Estimate partition size and set
numBucketswhen creating the table.
Generate
nFile IDs and replace the first 8 bits with the bucket ID.
For each incoming record, compute the hash of the index key (by default the Record Key) and take
(hashKeyFields.hashCode() & Integer.MAX_VALUE) % numBucketsto obtain the bucket ID, which determines the target File Group.
Write the record to the corresponding File Group; for COW tables the update merges into a new Base File, for MOR tables the update is appended to the delta log.
Query Optimizations with Bucket Index
Bucket Index enables two main optimizations in engines like Spark:
Bucket Pruning : When a query filters on the bucket column (e.g.,
city = 'beijing'), only the relevant bucket’s files are read, drastically reducing scanned data.
Bucket Join : If a bucketed table joins with another table on the bucket column, the join can avoid an extra shuffle because matching records reside in the same bucket.
Practical Experience and Future Plans
Choosing an appropriate
numBucketsis critical: too few buckets lead to large File Groups and reduced parallelism, while too many create many small files. The team recommends targeting ~3 GB per bucket. Future work includes a scalable hash‑index that can dynamically expand the number of buckets after table creation, eliminating the need for accurate upfront estimation.
Conclusion
Hudi’s Bucket Index provides a lightweight, hash‑based solution that eliminates unnecessary file reads, maintains stable upsert performance at scale, and integrates with query engines to enable bucket pruning and bucket join, thereby improving both write and read efficiency for large‑scale data‑lake workloads.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.