Big Data 10 min read

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

DataFunSummit

Sep 30, 2024

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

Apache Hudi Streamer is a Spark application designed to simplify end‑to‑end data ingestion pipelines for Hudi tables, offering a rich set of configurable options that make it a "Swiss Army knife" for lakehouse data loading.

Key command‑line options include:

--table-type (Copy‑on‑Write or Merge‑on‑Read), --table-name, and --target-base-path to define the target Hudi table.

--continuous to run the streamer in a perpetual mode; without it the job runs once.

--min-sync-interval-seconds to enforce a minimum pause between ingestion cycles when used with continuous mode.

--op to choose the write operation (UPSERT, INSERT, BULK_INSERT).

--filter-dupes (hoodie.combine.before.insert) to de‑duplicate records before insert operations.

--props and --hoodie-conf for supplying arbitrary Hudi configuration properties, with the latter taking precedence.

Source configuration is abstracted via the Source interface; users specify the fully‑qualified class name with --source-class and configure source‑specific properties (e.g., KafkaSource requires hoodie.streamer.source.kafka.topic). The --source-limit flag can cap the amount of data read per extraction.

Transformer classes are supplied via --transformer-class. Multiple transformers are applied sequentially, allowing lightweight data transformations such as field addition, removal, or flattening before writing.

Table service management can be run asynchronously. Options like --max-pending-compactions and --max-pending-clustering limit concurrent compaction or clustering jobs. Fair scheduling for these services is configured through Spark’s spark.scheduler.allocation.file and spark.scheduler.mode=FAIR.

Data catalog synchronization is enabled with --enable-sync and --sync-tool-classes, supporting tools such as:

org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool

org.apache.hudi.gcp.bigquery.BigQuerySyncTool

org.apache.hudi.hive.HiveSyncTool

org.apache.hudi.sync.datahub.DataHubSyncTool

These sync tools update external metastore information after each write.

Additional features include schema providers (e.g., SchemaRegistryProvider for Kafka), checkpointing options ( --checkpoint, --initial-checkpoint-provider), graceful termination ( --post-write-termination-strategy-class), and bootstrap initialization ( --run-bootstrap).

The article concludes with a recap of the Hudi Streamer workflow and encourages readers to join the Apache Hudi community for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Spark Apache Hudi data ingestion

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.