Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities
The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.
Multi‑Mode Index
In version 0.11.0, the Spark writer enables a metadata table with synchronous updates and file‑listing based on that table to improve partition and file‑listing performance on large Hudi tables; readers must set the corresponding flag to benefit.
The metadata table now includes two new indexes: a Bloom filter index for file‑level pruning during writes, and a column statistics index for column‑value range pruning during reads. Both are disabled by default and can be enabled via hoodie.metadata.index.bloom.filter.enable and hoodie.metadata.index.column.stats.enable.
Data Skipping Using Metadata Table
Column statistics support in the metadata table enables data skipping based on the CSI index rather than custom implementations, working for all datasets regardless of layout optimizations. Enable it by setting hoodie.enable.data.skipping=true and turning on the metadata table and column stats index.
Standard functions such as date_format(ts, "MM/dd/yyyy") < "04/01/2022" are supported. Currently, data skipping is available only for COW tables and MOR tables in read‑optimized mode.
Asynchronous Index
A new async service allows creation of various index types (file, Bloom filter, column stats) in the metadata table without blocking ingestion. The indexer adds a new “indexing” action and requires a lock provider for safe coordination.
Spark Data Source Improvements
No‑log‑file MOR queries (except incremental) now use a vectorized Parquet reader, leveraging modern CPU vector instructions.
When using standard Record Payloads, MOR tables read only the strictly necessary columns (primary key, pre‑combine key), dramatically reducing data throughput and decode cost, especially for wide tables.
Spark‑Based Schema‑on‑Read
Version 0.11.0 adds experimental Spark SQL DDL support (ALTER TABLE) for Spark 3.1.x and 3.2.1, allowing easy schema evolution.
Spark SQL Enhancements
Non‑primary‑key fields can now be used for updates or deletes.
Time‑travel queries are supported via timestamp as of syntax (Spark 3.2+ only).
A CALL command is added for invoking additional operations on Hudi tables.
Spark Version and Bundles
Support added for Spark 3.2 with corresponding bundles (hudi‑spark3.2‑bundle or hudi‑spark3‑bundle).
Spark 3.1 continues with hudi‑spark3.1‑bundle.
Spark 2.4 remains supported via hudi‑spark2.4‑bundle (legacy name).
Simplified Utilities Bundle
The new hudi-utilities-slim-bundle excludes dependencies that may conflict with other frameworks such as Spark. hudi-utilities-slim-bundle works with Spark 3.1 and 2.4. hudi-utilities-bundle continues to support Spark 3.1 as in 0.10.x.
Flink Integration Improvements
Support for Flink 1.13.x and 1.14.x.
Complex data types (Map, Array) are now supported, including nested structures.
A DFS‑based Flink catalog (identifier hudi) is added, creatable via API or CREATE CATALOG syntax.
Bucket index support for both UPSERT and BULK_INSERT operations; enable with index.type=BUCKET.
Google BigQuery Integration
Hudi tables can be queried as external tables from BigQuery by configuring org.apache.hudi.gcp.bigquery.BigQuerySyncTool as the sync tool for HoodieDeltaStreamer. This experimental feature applies only to hive‑style partitioned COW tables.
AWS Glue Metadata Sync
Hudi tables can sync directly to AWS Glue Data Catalog using org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool as the HoodieDeltaStreamer sync implementation.
DataHub Metadata Sync
Table schema and last commit timestamps can be synchronized to DataHub via org.apache.hudi.sync.datahub.DataHubSyncTool.
Encryption
Support for Spark 3.2 and Parquet 1.12 adds encryption capabilities for COW tables.
Bucket Index
Version 0.11.0 introduces a lightweight bucket index that hashes record keys to assign records to buckets (each bucket maps to a file group). Enable with index type BUCKET and set hoodie.storage.layout.partitioner.class to
org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner(or index.type=BUCKET for Flink).
Savepoints and Recovery
Disaster recovery features now include support for MOR tables, extending the existing COW table savepoint and recovery capabilities.
Pulsar Write Commit Callback
In addition to existing HTTP and Kafka callbacks, a Pulsar callback ( org.apache.hudi.callback.HoodieWriteCommitCallback) is added for commit notifications.
HiveSchemaProvider
The new org.apache.hudi.utilities.schema.HiveSchemaProvider allows HoodieDeltaStreamer to fetch schemas from user‑defined Hive tables.
Migration Guide
Bundle usage updates: Spark 3.0.x bundles are no longer officially supported; users are encouraged to adopt version‑specific bundles (e.g., hudi‑spark3.2‑bundle). The Spark/Utilities bundles no longer require the extra spark‑avro package.
Configuration Updates
For MOR tables, hoodie.datasource.write.precombine.field is required for both write and read.
When using BigQuery integration, set hoodie.datasource.write.drop.partition.columns=true.
To extract physical partition paths in Spark readers, enable
hoodie.datasource.read.extract.partition.values.from.path=true.
Default Spark index type changed from BLOOM to SIMPLE; adjust configurations if relying on BLOOM.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
