Apache Hudi 0.5.1 Release Highlights and Upgrade Guide
The Apache Hudi 0.5.1 release introduces upgraded Spark, Avro, Parquet and Kafka dependencies, new Scala support, timeline layout changes, CLI enhancements, DeltaStreamer parameter updates, Kafka offset enum revisions, key‑generator package relocation, Hive sync options, dynamic Bloom filter, bulk‑insert support, and AWS cloud storage compatibility.
After roughly three months of development, the Apache Hudi community announced version 0.5.1, the second Apache‑incubated release, bringing a series of important upgrades and new features.
Dependency upgrades: Spark was upgraded from 2.1.0 to 2.4.4, Avro from 1.7.7 to 1.8.2, Parquet from 1.8.1 to 1.10.1, and Kafka from 0.8.2.1 to 2.0.0 (via the spark‑streaming‑kafka artifact transition). Important: Hudi 0.5.1 requires Spark 2.4 or higher.
Scala support: Hudi now supports Scala 2.11 and 2.12. Artifact names have been renamed to include the Scala version suffix, e.g., hudi-spark_{scala_version}, hudi-utilities_{scala_version}, etc.
Timeline layout: The rename‑based timeline metadata handling is removed. New tables enable the new layout by default; existing tables keep it disabled. To enable it, set the configuration hoodie.timeline.layout.version=1 or run the CLI command
repair overwrite-hoodie-props hoodie.timeline.layout.version=1to add the property to hoodie.properties. Upgrade the Hudi Reader to 0.5.1 before upgrading the Writer.
CLI enhancement: The repair overwrite-hoodie-props command can rewrite the hoodie.properties file, allowing table name changes or activation of the new timeline layout. Note that temporary query failures may occur while the file is being rewritten.
DeltaStreamer changes: The parameter for specifying table type changed from --storage-type to --table-type. Additionally, the Kafka reset‑offset enum values were renamed (LARGEST → LATEST, SMALLEST → EARLIEST) and are configured via auto.offset.reset.
Spark‑shell usage: When exploring Hudi with spark-shell, include the extra package --packages org.apache.spark:spark-avro_2.11:2.4.4 as described in the quick‑start guide.
Key generator relocation: The key‑generator classes have moved to the package org.apache.hudi.keygen. If you override the key‑generator class via hoodie.datasource.write.keygenerator.class, update the fully‑qualified class name accordingly.
Hive sync tool: For MOR tables, Hive now registers read‑only tables with a _ro suffix. Use the --skip-ro-suffix option to keep the original table name during synchronization.
hudi‑hadoop‑mr‑bundle: This bundle shades the Avro library to support real‑time queries. Custom record‑merge logic can be implemented by providing a HoodieRecordPayload. When using this feature, relocate the Avro dependency, for example:
<relocation>
<pattern>org.apache.avro.</pattern>
<shadedPattern>org.apache.hudi.org.apache.avro.</shadedPattern>
</relocation>Additional enhancements: DeltaStreamer now has improved delete support and adds AWS Database Migration Service (DMS) integration. Dynamic Bloom filter support is available via the configuration hoodie.bloom.index.filter.type=DYNAMIC_V0. HDFSParquetImporter can perform bulk inserts using --command bulkinsert. Finally, Hudi 0.5.1 adds support for AWS WASB and WASBS cloud storage.
For more details, refer to the official release page: https://hudi.apache.org/releases.html#release-051-incubating .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
