Big Data 10 min read

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Mastering Hive Small File Management: Strategies to Boost Performance

Background

Small files are a long‑standing pain point in data‑warehouse environments because they consume excessive storage space and degrade query performance. Effective governance of these files is essential for maintaining Hive’s efficiency and stability.

How Small Files Are Generated

Daily tasks and dynamic partition inserts (using Spark2 MapReduce) produce a large number of small files, causing a surge in Map tasks.

More Reduce tasks generate more small files, as each Reduce corresponds to an output file.

Source data may already contain many small files, e.g., from APIs or Kafka.

Real‑time data ingestion into Hive also creates many small files.

Impact of Small Files

From Hive’s perspective, each small file triggers a separate Map task, each launching a JVM, leading to massive resource waste and performance loss.

In HDFS, each small file occupies about 150 bytes of metadata; a large number of such files overloads the NameNode, slowing metadata operations and increasing read/write latency.

Storage consumption rises dramatically; for example, an average file size of 280 KB can be reduced to 249 KB after merging.

Solutions

2.1 Use Spark 3 to Merge Small Files

Spark’s Adaptive Query Execution (AQE) can automatically merge small partitions. Spark 3.2+ introduces the Rebalance operation, which leverages AQE to balance partitions, merge overly small files, and split skewed partitions.

2.2 Reduce the Number of Reduce Tasks

<code>set mapred.reduce.tasks=100;  -- set Reduce count (Mapper:Reduce = 10:1)</code>

2.3 Distribute By Rand()

Using

distribute by rand()

forces a shuffle that evenly distributes data across partitions, ensuring each partition has a similar size.

<code>where t0.ds='${lst1date}'
and xxx=xxx
distribute by rand()</code>

2.4 Add a Post‑Ingestion Cleanup Task

Run a cleanup job after data transfer to merge small files before downstream consumption.

2.5 Daily Scheduled Merge for Real‑Time Data

For real‑time tasks that write to Hive, schedule a daily Spark 3 job to consolidate the previous day’s small files.

<code>set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.ods_kafka_xxxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.ods_kafka_xxxx
where ds='${lst1date}'</code>

2.6 Hive Parameters for Merging

<code>set hive.merge.mapfiles=true;      -- merge map‑only task output
set hive.merge.mapredfiles=true;   -- merge map‑reduce task output
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;  -- combine before map phase</code>

2.7 Spark 2 Setting

<code>set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;</code>

Existing Small File Handling

3.1 Dynamic Partition Refresh with Spark 3

Example code to rewrite partitions dynamically:

<code>set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.xxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.xxx
where ds&lt;='2023-04-20' and ds&gt;='2022-04-20';</code>

3.2 Rebuild Table

If the table is unpartitioned, consider dropping and recreating it, then load data with Spark 3.

Problem Points Encountered

When using Spark 3 with dynamic partitions, fixed‑date partitions sometimes merge only one file, while dynamic partitions may leave many small files; the missing configuration

spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true

caused this.

Real‑time data ingestion still generates many small files; a daily Spark 3 job can re‑process historical data and then handle t‑1 files.

When Spark 3 writes to Hive used by Impala, the parameter

set spark.sql.hive.convertInsertingPartitionedTable=false

must be added to ensure data visibility in Impala.

Tool‑Based Small File Governance

NetEase DataFlow EasyData provides a “Small File Governance” service that automatically generates Spark 3 scheduled tasks to merge files daily, validates results, and rolls back on failure, ensuring data quality.

Identify tables with high small‑file counts via trends and storage metrics.

Configure automatic daily scans and merges (excluding tables with very large partition counts).

Monitor task execution similar to regular offline job operations.

Effectiveness

Across X tables, the total small‑file count dropped from 1,217,927 to 680,133, achieving a 44.1% reduction.

optimizationbig dataData WarehouseHiveSparkSmall Files
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.