Big Data 9 min read

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big Data Technology & Architecture

May 5, 2023

Strategies for Handling Small Files in Hive and Spark

Background: Small files are generated by daily tasks and dynamic partition inserts using Spark2 MapReduce engine, high number of reduces, source data containing many small files (API, Kafka), and real‑time data landing in Hive.

Impact: From Hive's perspective, each small file spawns many Map tasks, each launching a JVM, wasting resources and degrading performance; in HDFS each small file consumes ~150 Byte of NameNode memory, reducing NameNode performance and increasing read/write latency; storage overhead is also significant.

Solution 2.1: Use Spark 3 adaptive query execution (AQE) to automatically merge small partitions; Spark 3.2+ introduces REBALANCE operation to balance partitions and merge small files.

Solution 2.2: Reduce the number of reducers, e.g.:

set mapred.reduce.tasks=100;  -- set reduce tasks, mapper:reduce = 10:1

insert overwrite table xxx.xxx partition(ds='${lst1date}')

Solution 2.3: Use DIStribute BY RAND() to evenly distribute data across partitions, adding an extra shuffle stage. where t0.ds='${lst1date}' and xxx=xxx distribute by rand() Solution 2.4: Add a cleaning task after data transfer to re‑partition and merge small files, preventing downstream propagation.

Solution 2.5: For real‑time tasks, schedule a daily Hive merge job after data lands, with example Hive settings:

set hive.exec.dynamic.partition.mode=nonstrict;

set spark.sql.hive.convertInsertingPartitionedTable=false;

set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;

insert overwrite table xxx.ods_kafka_xxxx partition(ds) select id, xxx_date, xxx_type, ds from xxx.ods_kafka_xxxx where ds='${lst1date}'

Solution 2.6: Hive parameters to merge small files:

set hive.merge.mapfiles=true;

set hive.merge.mapredfiles=true;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Spark 2 parameter:

set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;

Existing small file handling (section 3.1) uses Spark 3 dynamic partition refresh with similar settings and insert‑overwrite statements.

Problem points: (1) Missing

spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true

leads to incomplete merging with static partitions; (2) Real‑time data still generates many small files, requiring historical back‑fill and daily jobs; (3) When using Impala, set spark.sql.hive.convertInsertingPartitionedTable=false to ensure consistency.

Tooling: NetEase EasyData data‑governance service provides a UI to monitor small‑file trends, automatically generate Spark 3 tasks for daily merging, validate results, and replace data in production tables.

Effect: Optimized x tables, reduced total small files from 1,217,927 to 680,133, achieving a 44.1 % reduction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive data optimization Spark Small Files

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.