Big Data 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Big Data Technology & Architecture

Mar 4, 2022

Managing Small Files in Apache Hudi and Spark Optimization Guide

This article records the author's exploration of Apache Hudi's small‑file handling features and related Spark tuning techniques, providing practical guidance for production data‑lake environments.

Small File Handling

Apache Hudi offers a self‑managing file size feature so users do not need to manually maintain tables. A large number of small files degrades query performance because the query engine must repeatedly open, read, and close files. In streaming data‑lake use cases, each write may produce only a tiny amount of data, potentially leading to many small files.

Write‑time vs. Post‑write Small File Optimization

Common solutions generate many small files during writes and later merge them, which can affect query SLA. Hudi’s clustering operation can merge small files efficiently; a dedicated article will cover clustering in detail.

During insert/upsert operations, Hudi can specify target file size.

Core Configuration

For this article we focus on the COPY_ON_WRITE table’s automatic small‑file merging. Key parameters include: hoodie.parquet.max.file.size: maximum size of a data file; Hudi tries to keep files at or below this size. hoodie.parquet.small.file.limit: files smaller than this value are considered small files. hoodie.copyonwrite.insert.split.size: number of records per partition insert; should match the record count that fits into a single file, derived from the max file size and average record size.

For example, if hoodie.parquet.max.file.size is 120 MB and hoodie.parquet.small.file.limit is 100 MB, any file under 100 MB is treated as a small file. Setting the limit to 0 disables automatic small‑file handling.

Illustrative Example

Assume a partition contains files File_1 (40 MB), File_2 (80 MB), File_3 (90 MB), File_4 (130 MB), and File_5 (105 MB). With the above configuration, the processing steps are:

Step 1: Assign updates to target files, potentially increasing their size.

Step 2: Identify small files (File_1, File_2, File_3) based on the 100 MB limit.

Step 3: Distribute new records to small files until each reaches the 120 MB max size (e.g., add 80 MB to File_1, 40 MB to File_2, 30 MB to File_3).

Step 4: If additional records remain after filling existing files, create new file groups according to hoodie.copyonwrite.insert.split.size (e.g., 120 k records per new file), resulting in new files File_6, File_7, and File_8.

After this ingestion round, all files except File_8 are optimally sized; subsequent ingestions repeat the process to keep the table free of small files.

Spark + Hudi Optimization

When writing data to Hudi via Spark, consider the following tuning aspects:

Input Parallelism: Set hoodie.[insert|upsert|bulkinsert].shuffle.parallelism to at least input_data_size/500MB to ensure adequate shuffle parallelism.

Off‑heap Memory: Adjust spark.yarn.executor.memoryOverhead or spark.yarn.driver.memoryOverhead to provide sufficient off‑heap memory for Parquet writes.

Spark Memory: Reserve enough storage memory (e.g., configure spark.memory.storageFraction) so that Hudi can load whole files for merging/compression.

File Size Limits: Tune limitFileSize to balance write latency and file count.

Time‑Series / Log Data: Adjust Bloom filter parameters ( bloomFilterFPP(), bloomFilterNumEntries()) and consider using event‑time prefixes for range pruning.

GC Tuning: Follow Spark GC best practices; for example, use G1 or CMS collectors with options such as:

-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof

OutOfMemory Errors: Mitigate OOM by adjusting memory fractions, e.g.:

spark.memory.fraction = 0.2
spark.memory.storageFraction = 0.2

A complete production‑grade Spark configuration example is provided in the original article.

If you found this article helpful, remember to "watch", "like", and "bookmark".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Performance Tuning Data Lake Spark Apache Hudi Small Files

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.