Evolution of ByteDance Data Lake Indexing: Hudi Index Enhancements and Future Directions
This article presents ByteDance's evolution of data lake indexing built on Apache Hudi, detailing traditional update challenges, Hudi's index mechanisms, the introduction of bucket and extensible hash indexes, query optimizations, and upcoming multi‑modal and range index innovations.
ByteDance's data lake architecture, based on Apache Hudi, faced efficiency issues with traditional indexing during large‑scale updates, prompting the development of new index implementations.
Hudi Index Overview : Hudi introduces indexes that map primary keys to file names, reducing I/O by quickly locating relevant files. Supported index types include Bloom Filter (default), HBase, Bucket, and Flink State.
Problems and Challenges : As file groups grew to tens of thousands, Bloom filter reads became slow due to extensive file footer accesses and false‑positive rates, while HBase and State indexes introduced dependency and concurrency limitations.
ByteDance Index Evolution : The team designed a hash‑based Bucket Index that partitions data into buckets mapped to file groups, enabling lightweight updates. Write operations now compute a hash of the index key to locate the appropriate bucket, creating new file groups only when needed.
To handle growing data, a partition‑level bucket count was introduced, storing bucket metadata in a custom Hudi Metastore, allowing selective bucket count increases without rewriting all historical data.
Query Optimizations : Leveraging Hive's bucket pruning, Spark can eliminate shuffle for joins on bucketed columns, coalesce shuffles when bucket count is lower than parallelism, and read fewer files when filter predicates align with bucket keys.
Extensible Hash Techniques : ByteDance evaluated Consistent Hash, Linear Hash, and Extensible Hash, ultimately adopting Extensible Hash for its ability to split or merge individual buckets while preserving logical mappings, facilitating seamless query optimizations.
Non‑Primary‑Key Index : By appending data to small file groups without deduplication, non‑primary‑key indexes improve ingestion latency and avoid costly file lookups.
Future Plans : Upcoming features include a Multi‑Modal Index stored in a Hudi table to enable asynchronous secondary index building, and a Range Index that sorts base files by range keys and stores min/max values for efficient point and range queries.
The article concludes with a FAQ covering bucket column considerations, multi‑column support, Hudi usage scenarios at ByteDance, migration experiences, and roadmap items such as second‑level indexing and transaction support.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.