Big Data 15 min read

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

DataFunTalk
DataFunTalk
DataFunTalk
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

Apache Hudi 0.11.0 was released on April 30, 2022, introducing a series of new features and optimizations. This presentation deep‑dives into four main areas: multi‑level index, Spark SQL enhancements, Flink integration improvements, and other functional upgrades.

01 Multi‑Level Index

The multi‑level (multi‑model) index is introduced to improve query performance on massive tables. It stores scalable, server‑less metadata in a dedicated meta‑table (using Hudi MOR tables) and supports both synchronous and asynchronous index creation. The design ensures transactional updates, low‑latency point/range/prefix lookups, and efficient data skipping via column statistics. Benchmarks show HFile‑based metadata queries can be 10‑100× faster than Parquet/Avro for millions of entries.

Key performance gains include:

File listing optimization: using metadata tables to list files reduces I/O by 2‑20×.

Data skipping: column stats enable file pruning, yielding 10‑30× faster queries on wide tables.

Bloom‑filter replacement in MOR tables improves upsert speed by ~3× for large file counts.

02 Spark SQL New Features

Hudi now supports updating or deleting records using non‑primary‑key fields, and introduces time‑travel queries via the timestamp as of syntax (Spark 3.2+). Example queries:

select * from hudi_tbl timestamp as of '20210728141108100'

Additional capabilities include CALL commands for snapshot management, clustering, compaction, and other table operations, providing a richer procedural interface comparable to stored procedures in traditional databases.

03 Flink Integration Improvements

Version 0.11.0 adds support for Flink 1.13.x and 1.14.x, complex data types, and a DFS‑based Flink HoodieCatalog. The new Bucket Index replaces Bloom‑filter indexing for massive tables (e.g., 34 TB, 500 billion records), reducing false positives and improving write performance by hashing keys to file groups.

04 Other Features and Enhancements

Spark DataSource query optimization: selective column reads using OverwriteWithLatestAvroPayload reduce memory and CPU usage for wide tables.

Schema evolution for Spark 3.1/3.2: enable hoodie.schema.on.read.enable=true to add, drop, rename, or reorder columns and tables.

Savepoints and restores via CALL commands, with MOR table support.

Pulsar commit callbacks for downstream jobs (configuration keys: hoodie.write.commit.callback.pulsar.topic , hoodie.write.commit.callback.pulsar.broker.service.url ).

Catalog synchronization: BigQuery COW support, DataHub schema sync, AWS Glue Data Catalog integration.

Overall, Apache Hudi 0.11.0 delivers substantial read/write performance improvements, richer SQL capabilities, tighter Flink integration, and broader ecosystem compatibility for large‑scale data lake workloads.

Big Datadata lakespark sqlApache HudiFlink IntegrationMulti-Level Index
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.