Big Data 13 min read

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Big Data Technology & Architecture

May 4, 2022

Multi‑Mode Index

In version 0.11.0, the Spark writer enables a metadata table with synchronous updates and file‑listing based on that table to improve partition and file‑listing performance on large Hudi tables; readers must set the corresponding flag to benefit.

The metadata table now includes two new indexes: a Bloom filter index for file‑level pruning during writes, and a column statistics index for column‑value range pruning during reads. Both are disabled by default and can be enabled via hoodie.metadata.index.bloom.filter.enable and hoodie.metadata.index.column.stats.enable.

Data Skipping Using Metadata Table

Column statistics support in the metadata table enables data skipping based on the CSI index rather than custom implementations, working for all datasets regardless of layout optimizations. Enable it by setting hoodie.enable.data.skipping=true and turning on the metadata table and column stats index.

Standard functions such as date_format(ts, "MM/dd/yyyy") < "04/01/2022" are supported. Currently, data skipping is available only for COW tables and MOR tables in read‑optimized mode.

Asynchronous Index

A new async service allows creation of various index types (file, Bloom filter, column stats) in the metadata table without blocking ingestion. The indexer adds a new “indexing” action and requires a lock provider for safe coordination.

Spark Data Source Improvements

No‑log‑file MOR queries (except incremental) now use a vectorized Parquet reader, leveraging modern CPU vector instructions.

When using standard Record Payloads, MOR tables read only the strictly necessary columns (primary key, pre‑combine key), dramatically reducing data throughput and decode cost, especially for wide tables.

Spark‑Based Schema‑on‑Read

Version 0.11.0 adds experimental Spark SQL DDL support (ALTER TABLE) for Spark 3.1.x and 3.2.1, allowing easy schema evolution.

Spark SQL Enhancements

Non‑primary‑key fields can now be used for updates or deletes.

Time‑travel queries are supported via timestamp as of syntax (Spark 3.2+ only).

A CALL command is added for invoking additional operations on Hudi tables.

Spark Version and Bundles

Support added for Spark 3.2 with corresponding bundles (hudi‑spark3.2‑bundle or hudi‑spark3‑bundle).

Spark 3.1 continues with hudi‑spark3.1‑bundle.

Spark 2.4 remains supported via hudi‑spark2.4‑bundle (legacy name).

Simplified Utilities Bundle

The new hudi-utilities-slim-bundle excludes dependencies that may conflict with other frameworks such as Spark. hudi-utilities-slim-bundle works with Spark 3.1 and 2.4. hudi-utilities-bundle continues to support Spark 3.1 as in 0.10.x.

Flink Integration Improvements

Support for Flink 1.13.x and 1.14.x.

Complex data types (Map, Array) are now supported, including nested structures.

A DFS‑based Flink catalog (identifier hudi) is added, creatable via API or CREATE CATALOG syntax.

Bucket index support for both UPSERT and BULK_INSERT operations; enable with index.type=BUCKET.

Google BigQuery Integration

Hudi tables can be queried as external tables from BigQuery by configuring org.apache.hudi.gcp.bigquery.BigQuerySyncTool as the sync tool for HoodieDeltaStreamer. This experimental feature applies only to hive‑style partitioned COW tables.

AWS Glue Metadata Sync

Hudi tables can sync directly to AWS Glue Data Catalog using org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool as the HoodieDeltaStreamer sync implementation.

DataHub Metadata Sync

Table schema and last commit timestamps can be synchronized to DataHub via org.apache.hudi.sync.datahub.DataHubSyncTool.

Encryption

Support for Spark 3.2 and Parquet 1.12 adds encryption capabilities for COW tables.

Bucket Index

Version 0.11.0 introduces a lightweight bucket index that hashes record keys to assign records to buckets (each bucket maps to a file group). Enable with index type BUCKET and set hoodie.storage.layout.partitioner.class to

org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner

(or index.type=BUCKET for Flink).

Savepoints and Recovery

Disaster recovery features now include support for MOR tables, extending the existing COW table savepoint and recovery capabilities.

Pulsar Write Commit Callback

In addition to existing HTTP and Kafka callbacks, a Pulsar callback ( org.apache.hudi.callback.HoodieWriteCommitCallback) is added for commit notifications.

HiveSchemaProvider

The new org.apache.hudi.utilities.schema.HiveSchemaProvider allows HoodieDeltaStreamer to fetch schemas from user‑defined Hive tables.

Migration Guide

Bundle usage updates: Spark 3.0.x bundles are no longer officially supported; users are encouraged to adopt version‑specific bundles (e.g., hudi‑spark3.2‑bundle). The Spark/Utilities bundles no longer require the extra spark‑avro package.

Configuration Updates

For MOR tables, hoodie.datasource.write.precombine.field is required for both write and read.

When using BigQuery integration, set hoodie.datasource.write.drop.partition.columns=true.

To extract physical partition paths in Spark readers, enable

hoodie.datasource.read.extract.partition.values.from.path=true

Default Spark index type changed from BLOOM to SIMPLE; adjust configurations if relying on BLOOM.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Spark Data Skipping Apache Hudi Async Index Metadata Index

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.