Big Data 19 min read

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

Big Data Technology & Architecture

Dec 9, 2020

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

1. Small File Problems in Hive

Hive stores table data in HDFS, where large files are efficient, but aggregated tables often produce many small files that increase NameNode memory usage and slow MapReduce tasks because each file's metadata consumes about 150 bytes.

2. Causes of Hive Small Files

Increasing the number of reducers to speed up queries creates more output files; the default reducer count is based on hive.exec.reducers.bytes.per.reduce (default 1 GB), so more reducers mean more small files.

3. Solving Small Files

Two approaches are recommended:

Input file merging before the map phase.

Output file merging after the reduce phase.

4. Configuring Map Input Merge

-- 每个Map最大输入大小，决定合并后的文件数
set mapred.max.split.size=256000000;
-- 一个节点上split的至少的大小，决定了多个data node上的文件是否需要合并
set mapred.min.split.size.per.node=100000000;
-- 一个交换机下split的至少的大小，决定了多个交换机上的文件是否需要合并
set mapred.min.split.size.per.rack=100000000;
-- 执行Map前进行小文件合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

5. Configuring Hive Result Merge

set hive.merge.mapfiles=true;  -- Map‑only job end merge
set hive.merge.mapredfiles=true; -- Map‑Reduce job end merge
set hive.merge.size.per.task=256*1000*1000; -- Merge file size
set hive.merge.smallfiles.avgsize=16000000; -- Trigger separate merge task when average file size is smaller

When merging, Hive launches an extra map‑only job; the number of mappers equals total output size divided by hive.merge.size.per.task. Merge occurs only if the appropriate hive.merge.mapfiles or hive.merge.mapredfiles flag is enabled and the average file size exceeds hive.merge.smallfiles.avgsize.

6. Compression Handling

For compressed output, input merging works regardless of format, but output merging requires using SequenceFile. Examples for different file types and codecs are provided:

TextFile (no compression, Deflate, Gzip, Bzip2, Lzo, Lz4, Snappy)

SequenceFile (Deflate, Gzip)

RCFile (Gzip)

ORCFile (ZLIB, Snappy)

Parquet (Snappy)

Avro (Snappy)

Each example shows table creation, setting Hive and MapReduce compression properties, and inserting data.

7. File Format Overview

Hive supports TextFile, SequenceFile, RCFile, ORCFile, Parquet, and Avro. Compression algorithms include Deflate, Gzip, Bzip2, Lzo, Lz4, Snappy, and Zlib. ORC and Parquet provide columnar storage with better compression and query performance than RCFile.

8. Partition Optimization

Too many partitions increase NameNode load and create small files. Recommendations:

For small or static tables, avoid partitions and use data merging.

For large tables, use only day‑level partitions; hour‑level only for extremely large raw data; avoid month partitions.

For snapshot tables, keep only daily partitions.

9. Resource Consumption Summary

Data warehouses consume CPU, memory, and disk. Computation mainly stresses CPU and memory, while storage consumes memory (NameNode metadata) and disk. Optimizing model design and careful table definition reduce overall resource usage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Hive compression Hadoop Small Files file formats

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.