Big Data 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Big Data Technology & Architecture

Oct 13, 2022

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This article explains how to perform file clustering with Apache Hudi after a batch job and before streaming, merging many small files into larger ones to improve query performance and reduce I/O overhead.

After completing a bulk_insert batch, the HDFS directory contains thousands of tiny files. The article demonstrates how to count these files using HDFS commands:

[hadoop@p0-tklfrna-tklrna-device02 hudi_clustering]$ hdfs dfs -count /flk_hudi/chdrpf_hudi_test03/*
           7            7        32637997 /flk_hudi/chdrpf_hudi_test03/.hoodie
           1          1067      571117942 /flk_hudi/chdrpf_hudi_test03/1
           ... (additional lines omitted for brevity) ...

The clustering process is driven by a Spark job submitted via spark-submit. A minimal clustering configuration is shown first:

[hadoop@p0-tklfrna-tklrna-device02 hudi_clustering]$ cat /home/hadoop/hudi_clustering/clusteringjob.properties
hoodie.clustering.inline.max.commits=2
hoodie.clustering.plan.strategy.max.num.groups=40

Advanced configuration adds more parameters to control file size limits and grouping:

[hadoop@p0-tklfrna-tklrna-device02 ~]$ cat /home/hadoop/hudi_clustering/clusteringjob.properties
hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=2
hoodie.clustering.plan.strategy.max.num.groups=40
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648
hoodie.clustering.plan.strategy.small.file.limit=629145600

The clustering job is scheduled with a timestamp obtained from the Hudi timeline. The spark-submit command includes the schedule flag and the instant time:

spark-submit \
  --master yarn \
  --class org.apache.hudi.utilities.HoodieClusteringJob \
  hdfs://nameservice1/utility_jars/hudi-utilities-bundle_2.12-0.10.0.jar \
  --schedule \
  --base-path hdfs://nameservice1/flk_hudi/chdrpf_hudi_test03 \
  --table-name chdrpf_hudi_test03 \
  --props file:///home/hadoop/hudi_clustering/clusteringjob.properties \
  --spark-memory 16g \
  > /home/hadoop/hudi_clustering/clusteringjob.log 2>&1

After the clustering job finishes, the Hudi timeline contains a replacecommit entry (e.g., 20220826105913373.replacecommit) indicating the clustering instant. The resulting file counts show a dramatic reduction in the number of small files:

[hadoop@p0-tklfrna-tklrna-device02 hudi_clustering]$ hdfs dfs -count /flk_hudi/chdrpf_hudi_test03/*
           7           10        39759457 /flk_hudi/chdrpf_hudi_test03/.hoodie
           1            5        295730057 /flk_hudi/chdrpf_hudi_test03/1
           ... (additional lines omitted for brevity) ...

Because Hudi does not automatically delete obsolete files, a manual cleaning policy is required. The cleaning configuration file ( hudi_cleaning.properties) is provided:

# hudi_cleaning.properties
hoodie.clean.automatic=true
hoodie.clean.async=true
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=1
hoodie.cleaner.delete.bootstrap.base.file=false
hoodie.commits.archival.batch=60
hoodie.archive.merge.small.file.limit.bytes=104857600
hoodie.compact.inline=false
hoodie.parquet.small.file.limit=124857600
hoodie.cleaner.parallelism=800
hoodie.cleaner.incremental.mode=true
hoodie.keep.max.commits=3
hoodie.keep.min.commits=2

The cleaning job is executed with another spark-submit command:

spark-submit \
  --class org.apache.hudi.utilities.HoodieCleaner \
  hdfs://nameservice1/utility_jars/hudi-utilities-bundle_2.12-0.10.0.jar \
  --props file:///home/hadoop/hudi_clustering/hudi_cleaning.properties \
  --target-base-path hdfs://nameservice1/flk_hudi/chdrpf_hudi_test03 \
  > /home/hadoop/hudi_clustering/clusteringjob_cleaning.log 2>&1

After cleaning, the Hudi timeline shows a clean entry (e.g., 20220826114108591.clean) confirming that obsolete files have been removed. Finally, the article notes that the remaining small files have been merged into larger ones, and subsequent streaming writes will maintain a reasonable file count.

Key timestamps in the Hudi timeline: 20220826105913373.replacecommit – clustering completed 20220826114108591.clean – cleaning completed 20220826114317026.commit – new data write completed

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake HDFS Spark Apache Hudi File Clustering

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.