Handling Small Files in Hive: Configuration, Compression, and File Format Optimization
The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.
1. Small File Problems in Hive
Hive stores table data in HDFS, where large files are efficient, but aggregated tables often produce many small files that increase NameNode memory usage and slow MapReduce tasks because each file's metadata consumes about 150 bytes.
2. Causes of Hive Small Files
Increasing the number of reducers to speed up queries creates more output files; the default reducer count is based on hive.exec.reducers.bytes.per.reduce (default 1 GB), so more reducers mean more small files.
3. Solving Small Files
Two approaches are recommended:
Input file merging before the map phase.
Output file merging after the reduce phase.
4. Configuring Map Input Merge
-- 每个Map最大输入大小,决定合并后的文件数
set mapred.max.split.size=256000000;
-- 一个节点上split的至少的大小,决定了多个data node上的文件是否需要合并
set mapred.min.split.size.per.node=100000000;
-- 一个交换机下split的至少的大小,决定了多个交换机上的文件是否需要合并
set mapred.min.split.size.per.rack=100000000;
-- 执行Map前进行小文件合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;5. Configuring Hive Result Merge
set hive.merge.mapfiles=true; -- Map‑only job end merge
set hive.merge.mapredfiles=true; -- Map‑Reduce job end merge
set hive.merge.size.per.task=256*1000*1000; -- Merge file size
set hive.merge.smallfiles.avgsize=16000000; -- Trigger separate merge task when average file size is smallerWhen merging, Hive launches an extra map‑only job; the number of mappers equals total output size divided by hive.merge.size.per.task. Merge occurs only if the appropriate hive.merge.mapfiles or hive.merge.mapredfiles flag is enabled and the average file size exceeds hive.merge.smallfiles.avgsize.
6. Compression Handling
For compressed output, input merging works regardless of format, but output merging requires using SequenceFile. Examples for different file types and codecs are provided:
TextFile (no compression, Deflate, Gzip, Bzip2, Lzo, Lz4, Snappy)
SequenceFile (Deflate, Gzip)
RCFile (Gzip)
ORCFile (ZLIB, Snappy)
Parquet (Snappy)
Avro (Snappy)
Each example shows table creation, setting Hive and MapReduce compression properties, and inserting data.
7. File Format Overview
Hive supports TextFile, SequenceFile, RCFile, ORCFile, Parquet, and Avro. Compression algorithms include Deflate, Gzip, Bzip2, Lzo, Lz4, Snappy, and Zlib. ORC and Parquet provide columnar storage with better compression and query performance than RCFile.
8. Partition Optimization
Too many partitions increase NameNode load and create small files. Recommendations:
For small or static tables, avoid partitions and use data merging.
For large tables, use only day‑level partitions; hour‑level only for extremely large raw data; avoid month partitions.
For snapshot tables, keep only daily partitions.
9. Resource Consumption Summary
Data warehouses consume CPU, memory, and disk. Computation mainly stresses CPU and memory, while storage consumes memory (NameNode metadata) and disk. Optimizing model design and careful table definition reduce overall resource usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
