Big Data 6 min read

Hive Performance Tuning: Understanding Map and Reduce Counts

This article explains how Hive determines the number of map and reduce tasks based on input file size and block configuration, discusses when to increase or decrease map counts, and provides practical commands for adjusting reducer settings to optimize large‑scale data processing.

Big Data Technology & Architecture

Oct 31, 2020

Hive Performance Tuning: Understanding Map and Reduce Counts

In Hive, the number of map tasks is primarily determined by the total number of input files, their sizes, and the cluster's block size (default 128 MB, viewable via set dfs.block.size;). A single large file is split into multiple blocks, each generating a map task, while small files each become a single block and map.

Examples illustrate that a 780 MB file creates seven map tasks (six 128 MB blocks and one 12 MB block), whereas three small files of 10 MB, 20 MB, and 150 MB result in four map tasks after block splitting.

More map tasks are not always better; many tiny files cause excessive map initialization overhead and limit parallelism. Conversely, large files with complex logic may benefit from additional maps to reduce per‑task data volume.

To increase map count when a single large file contains millions of records, the article suggests splitting the file into multiple parts using a random distribution:

set mapreduce.job.reduces =10;
create table a_1 as
select * from a
distribute by rand();

This creates ten files in table a_1, allowing ten map tasks to process the data more efficiently.

When adjusting reduce tasks, the default per‑reduce data size is 256 MB ( hive.exec.reducers.bytes.per.reducer=256123456) and the maximum number of reducers is 1009 ( hive.exec.reducers.max=1009). The number of reducers can be calculated as N = min(maxReducers, totalInputSize / bytesPerReducer) or set directly with set mapreduce.job.reduces = 15;.

However, increasing reducers indiscriminately also incurs overhead and may produce many small output files, which can be problematic for downstream jobs. The article concludes that both map and reduce counts should be balanced to handle large data volumes while keeping per‑task workload reasonable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data SQL Performance Tuning Hive MapReduce

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.