Big Data 15 min read

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

DataFunTalk

Jul 16, 2022

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

Apache Hudi 0.11.0 was released on April 30, 2022, introducing a series of new features and optimizations. This presentation deep‑dives into four main areas: multi‑level index, Spark SQL enhancements, Flink integration improvements, and other functional upgrades.

01 Multi‑Level Index

The multi‑level (multi‑model) index is introduced to improve query performance on massive tables. It stores scalable, server‑less metadata in a dedicated meta‑table (using Hudi MOR tables) and supports both synchronous and asynchronous index creation. The design ensures transactional updates, low‑latency point/range/prefix lookups, and efficient data skipping via column statistics. Benchmarks show HFile‑based metadata queries can be 10‑100× faster than Parquet/Avro for millions of entries.

Key performance gains include:

File listing optimization: using metadata tables to list files reduces I/O by 2‑20×.

Data skipping: column stats enable file pruning, yielding 10‑30× faster queries on wide tables.

Bloom‑filter replacement in MOR tables improves upsert speed by ~3× for large file counts.

02 Spark SQL New Features

Hudi now supports updating or deleting records using non‑primary‑key fields, and introduces time‑travel queries via the timestamp as of syntax (Spark 3.2+). Example queries:

select * from hudi_tbl timestamp as of '20210728141108100'

Additional capabilities include CALL commands for snapshot management, clustering, compaction, and other table operations, providing a richer procedural interface comparable to stored procedures in traditional databases.

03 Flink Integration Improvements

Version 0.11.0 adds support for Flink 1.13.x and 1.14.x, complex data types, and a DFS‑based Flink HoodieCatalog. The new Bucket Index replaces Bloom‑filter indexing for massive tables (e.g., 34 TB, 500 billion records), reducing false positives and improving write performance by hashing keys to file groups.

04 Other Features and Enhancements

Spark DataSource query optimization: selective column reads using OverwriteWithLatestAvroPayload reduce memory and CPU usage for wide tables.

Schema evolution for Spark 3.1/3.2: enable hoodie.schema.on.read.enable=true to add, drop, rename, or reorder columns and tables.

Savepoints and restores via CALL commands, with MOR table support.

Pulsar commit callbacks for downstream jobs (configuration keys: hoodie.write.commit.callback.pulsar.topic, hoodie.write.commit.callback.pulsar.broker.service.url).

Catalog synchronization: BigQuery COW support, DataHub schema sync, AWS Glue Data Catalog integration.

Overall, Apache Hudi 0.11.0 delivers substantial read/write performance improvements, richer SQL capabilities, tighter Flink integration, and broader ecosystem compatibility for large‑scale data lake workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Spark SQL Apache Hudi Flink Integration Multi-Level Index

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.