Big Data 10 min read

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

Data Thinking Notes

May 10, 2023

Mastering Hive Small File Management: Strategies to Boost Performance

Background

Small files are a long‑standing pain point in data‑warehouse environments because they consume excessive storage space and degrade query performance. Effective governance of these files is essential for maintaining Hive’s efficiency and stability.

How Small Files Are Generated

Daily tasks and dynamic partition inserts (using Spark2 MapReduce) produce a large number of small files, causing a surge in Map tasks.

More Reduce tasks generate more small files, as each Reduce corresponds to an output file.

Source data may already contain many small files, e.g., from APIs or Kafka.

Real‑time data ingestion into Hive also creates many small files.

Impact of Small Files

From Hive’s perspective, each small file triggers a separate Map task, each launching a JVM, leading to massive resource waste and performance loss.

In HDFS, each small file occupies about 150 bytes of metadata; a large number of such files overloads the NameNode, slowing metadata operations and increasing read/write latency.

Storage consumption rises dramatically; for example, an average file size of 280 KB can be reduced to 249 KB after merging.

Solutions

2.1 Use Spark 3 to Merge Small Files

Spark’s Adaptive Query Execution (AQE) can automatically merge small partitions. Spark 3.2+ introduces the Rebalance operation, which leverages AQE to balance partitions, merge overly small files, and split skewed partitions.

2.2 Reduce the Number of Reduce Tasks

set mapred.reduce.tasks=100;  -- set Reduce count (Mapper:Reduce = 10:1)

2.3 Distribute By Rand()

Using distribute by rand() forces a shuffle that evenly distributes data across partitions, ensuring each partition has a similar size.

where t0.ds='${lst1date}'
and xxx=xxx
distribute by rand()

2.4 Add a Post‑Ingestion Cleanup Task

Run a cleanup job after data transfer to merge small files before downstream consumption.

2.5 Daily Scheduled Merge for Real‑Time Data

For real‑time tasks that write to Hive, schedule a daily Spark 3 job to consolidate the previous day’s small files.

set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.ods_kafka_xxxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.ods_kafka_xxxx
where ds='${lst1date}'

2.6 Hive Parameters for Merging

set hive.merge.mapfiles=true;      -- merge map‑only task output
set hive.merge.mapredfiles=true;   -- merge map‑reduce task output
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;  -- combine before map phase

2.7 Spark 2 Setting

set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;

Existing Small File Handling

3.1 Dynamic Partition Refresh with Spark 3

Example code to rewrite partitions dynamically:

set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.xxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.xxx
where ds<='2023-04-20' and ds>='2022-04-20';

3.2 Rebuild Table

If the table is unpartitioned, consider dropping and recreating it, then load data with Spark 3.

Problem Points Encountered

When using Spark 3 with dynamic partitions, fixed‑date partitions sometimes merge only one file, while dynamic partitions may leave many small files; the missing configuration

spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true

caused this.

Real‑time data ingestion still generates many small files; a daily Spark 3 job can re‑process historical data and then handle t‑1 files.

When Spark 3 writes to Hive used by Impala, the parameter set spark.sql.hive.convertInsertingPartitionedTable=false must be added to ensure data visibility in Impala.

Tool‑Based Small File Governance

NetEase DataFlow EasyData provides a “Small File Governance” service that automatically generates Spark 3 scheduled tasks to merge files daily, validates results, and rolls back on failure, ensuring data quality.

Identify tables with high small‑file counts via trends and storage metrics.

Configure automatic daily scans and merges (excluding tables with very large partition counts).

Monitor task execution similar to regular offline job operations.

Effectiveness

Across X tables, the total small‑file count dropped from 1,217,927 to 680,133, achieving a 44.1% reduction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Hive Spark Small Files

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

How Small Files Are Generated

Impact of Small Files

Solutions

2.1 Use Spark 3 to Merge Small Files

2.2 Reduce the Number of Reduce Tasks

2.3 Distribute By Rand()

2.4 Add a Post‑Ingestion Cleanup Task

2.5 Daily Scheduled Merge for Real‑Time Data

2.6 Hive Parameters for Merging

2.7 Spark 2 Setting

Existing Small File Handling

3.1 Dynamic Partition Refresh with Spark 3

3.2 Rebuild Table

Problem Points Encountered

Tool‑Based Small File Governance

Effectiveness

Data Thinking Notes

How this landed with the community

Was this worth your time?

0 Comments

2.1 Use Spark 3 to Merge Small Files

2.7 Spark 2 Setting

3.1 Dynamic Partition Refresh with Spark 3