Big Data 12 min read

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

The Apache Hudi 0.12.0 release introduces a native Presto connector, archive‑beyond‑savepoint capability, file‑system based locking, new deltastreamer termination strategies, expanded Spark and Flink support, numerous performance enhancements, and a series of configuration and API updates for better data‑lake management.

Big Data Technology Architecture

Aug 23, 2022

Apache Hudi 0.12.0 Release Highlights: Presto Connector, Archive Beyond Savepoint, File‑System Locks, Deltastreamer Termination, Spark & Flink Support, Performance Improvements, and Configuration Updates

Presto‑Hudi Connector

Starting with PrestoDB 0.275, users can query Hudi tables using a native Hudi connector, offering functionality comparable to the Hive connector. See the Presto documentation for details.

Archive Beyond Savepoint

Hudi now supports saving points and recovery. By enabling hoodie.archive.beyond.savepoint, archives can continue after a savepoint commit, allowing long‑term retention of older commits and queries using as.of.instant with any older savepoint.

Note: Enabling this feature currently disables restore support; the limitation will be relaxed in future releases (see HUDI‑4500).

File‑System Based Lock

Optimistic concurrency control now supports a file‑system based lock provider, avoiding external systems. Configure it with:

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider

Deltastreamer Termination Strategy

A new post‑write termination strategy interface allows graceful shutdown when no new data arrives. Example implementation:

/**
 * Post write termination strategy for deltastreamer in continuous mode.
 */
public interface PostWriteTerminationStrategy {

  /**
   * Returns whether deltastreamer needs to be shutdown.
   * @param scheduledCompactionInstantAndWriteStatuses optional pair of scheduled compaction instant and write statuses.
   * @return true if deltastreamer has to be shutdown. false otherwise.
   */
  boolean shouldShutdown(Option<Pair<Option<String>, JavaRDD<WriteStatus>>
      scheduledCompactionInstantAndWriteStatuses);

}

The built‑in NoNewDataTerminationStrategy can be used, or users may implement custom strategies.

Spark 3.3 Support

Version 0.12.0 adds Spark 3.3 support via the hudi-spark3.3-bundle or hudi-spark3-bundle. Spark 3.2, 3.1, and 2.4 remain supported.

Spark SQL Support Improvements

Call Procedure now supports upgrade, downgrade, bootstrap, clean, rollback, and repair.

Table analysis is supported.

SQL syntax for creating, dropping, showing, and refreshing indexes is added.

Flink 1.15 Support

Flink 1.15.x integrates with Hudi using the hudi-flink1.15-bundle. Earlier Flink versions continue to be supported.

Flink Integration Improvements

Batch mode data skipping enabled via metadata.enabled, hoodie.metadata.index.column.stats.enable, and read.data.skipping.enabled.

New HMS‑based catalog identified by hudi; switch with 'mode' = 'hms' or use dfs mode by default.

Flink INSERT now supports asynchronous clustering using clustering.schedule.enabled and clustering.async.enabled.

Performance Improvements

Key performance gains include:

Reduced write performance gap via Spark datasource.

All built‑in key generators now use a high‑performance Spark‑specific API.

Replaced UDFs in bulk insert with RDD transformations to lower serde cost.

Optimized column‑statistics index for data skipping.

Benchmark results against TPC‑DS are available in the Hudi blog.

Migration Guide

Several API and configuration defaults have changed. The default table version is now 5, and automatic upgrades are performed for older tables. Notable configuration updates: hoodie.bulkinsert.sort.mode default changed from GLOBAL_SORT to NONE. hoodie.datasource.hive_sync.partition_value_extractor default switched to MultiPartKeysValueExtractor.

Various META_SYNC_* settings are now inferred from other configs.

API Updates

The getRecordKey method in SparkKeyGeneratorInterface now returns UTF8String instead of String:

// Before
String getRecordKey(InternalRow row, StructType schema);

// After
UTF8String getRecordKey(InternalRow row, StructType schema);

Fallback Partition

When partition values are null, Hudi falls back to a default partition. The fallback changed from __HIVE_DEFAULT_PARTITION__ to default after 0.9.0, and was switched back to __HIVE_DEFAULT_PARTITION__ in 0.12.0. A validation skip can be enabled via hoodie.skip.default.partition.validation.

Bundle Updates

hudi-aws-bundle extracts AWS‑related dependencies for Glue sync, CloudWatch metrics, DynamoDB lock provider.

Added Spark 3.3 support (hudi-spark3.3-bundle).

Continued support for Spark 3.2, 3.1, 2.4 bundles.

Added Flink 1.15 support (hudi-flink1.15-bundle) and continued support for Flink 1.14 and 1.13 bundles.

Acknowledgments

Thanks to all contributors of the 0.12.0 release. Data‑lake enthusiasts are welcome to join the Apache Hudi community and star/fork the repository at https://github.com/apache/hudi.

Flink Presto Spark Apache Hudi

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.