Big Data 16 min read

Understanding Shuffle in Hadoop MapReduce and Spark

This article explains the concept and workflow of shuffle in Hadoop MapReduce and Spark, covering map‑side buffering, spill and merge, reduce‑side copy‑merge‑reduce, the reasons for sorting and file merging, and compares Hash‑Shuffle and Sort‑Shuffle implementations with performance considerations.

Big Data Technology Architecture

Apr 28, 2020

Understanding Shuffle in Hadoop MapReduce and Spark

Shuffle Overview

Shuffle means to rearrange data; in MapReduce it refers to the process that takes the unordered intermediate output from the map phase and reorganizes it according to a partitioning rule so that reducers can consume it. It occurs between map output and reduce input and consists of a map‑side and a reduce‑side part.

MapReduce Shuffle

Map‑side Shuffle

Before shuffle, the input data is split and each split is processed by a MapTask, producing key‑value pairs (intermediate results). These intermediate results are first stored in an in‑memory circular buffer, where each pair is assigned a partition attribute and serialized into a byte array.

When the buffer reaches a threshold, a spill thread writes the buffered data to temporary files on disk, sorting and optionally combining the records by key.

After all spills are done, the temporary spill files are merged into a final output file per map task, which is then made available to reducers.

Reduce‑side Shuffle

Reducers continuously poll the JobTracker to learn which map tasks have finished. Once a map task is complete, the reduce‑side shuffle begins, consisting of three stages: copy, merge, and reduce.

Each reduce task fetches the partitioned data from every map task using the offset information stored in index files.

Fetched data is merged, sorted, and processed by the user‑defined reduce logic.

The final output is written back to HDFS.

Why Sorting Is Needed

Sorting groups identical keys together, enabling the optional combine step.

Reducers process data key‑by‑key; without sorting they would need to scan the entire dataset for each key, which is inefficient.

Sorted keys allow reducers to detect the end of data when the last key is read.

Why File Merging Is Needed

Spill operations generate many small files; merging prevents a proliferation of tiny files that would waste cluster resources.

Fewer files reduce the number of open file handles.

Merging ensures global ordering across partitions.

Spark Shuffle

Spark’s shuffle builds on the MapReduce model but introduces optimizations. Shuffle write corresponds to the map side, and shuffle read corresponds to the reduce side.

Spark adds two shuffle manager implementations: HashShuffleManager and SortShuffleManager . The default changed from hash‑based to sort‑based in Spark 1.2.

Hash Shuffle

Normal Mechanism

Each mapper creates a bucket for every reducer (M × R buckets). This can produce many small files.

Optimized Mechanism

Enabling spark.shuffle.consolidateFiles=true merges buckets on the same core, reducing the number of files from M × R to core × R.

Sort Shuffle

Normal Mechanism

Data is first written to an in‑memory structure (Map for aggregation operators, Array for non‑aggregation). When a threshold is reached, the data is spilled to disk after sorting by key, producing multiple temporary files that are later merged into a single file per task, along with an index file.

Bypass Mechanism

If the number of shuffle read tasks is below spark.shuffle.sort.bypassMergeThreshold (default 200) and the operator is not aggregation‑type, Spark skips sorting, directly writing hashed partitions to disk and merging them later, which greatly improves performance.

Spark Shuffle Summary

Shuffle partitions map‑side data and delivers it to the appropriate reducers. Its performance directly impacts overall job throughput. Hadoop’s shuffle is sort‑based, while Spark supports both hash‑based and sort‑based shuffles, with the latter becoming default after Spark 1.2.

Hash‑Shuffle can generate many small files; its optimized version reduces file count but may still suffer when reducers are many. Sort‑Shuffle reduces file count by merging spills into a single file per task and can bypass sorting for small jobs, offering better performance.

Differences Between Spark and MapReduce Shuffle

Both perform partitioning of mapper output, but Hadoop always sorts before combine/reduce, whereas Spark defaults to hash‑based merging unless sort‑based is selected.

Hadoop’s pipeline is clearly divided into map, spill, merge, shuffle, sort, reduce stages; Spark integrates these steps within its DAG of stages and transformations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Shuffle big-data Sort-Shuffle Hash Shuffle

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.