Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce
This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.
In the previous chapter we discussed the physical execution graph of a job and how records flow through compute() to produce results, but we have not yet covered how data moves from a ShuffleDependency to the next stage.
Comparison of Hadoop MapReduce and Spark Shuffle Process
If you are familiar with the shuffle process in Hadoop MapReduce, you might imagine Spark's shuffle to be similar, yet there are notable differences and connections.
From a high‑level perspective, there is no major difference. Both systems partition the output of the mapper (or ShuffleMapTask in Spark) and send each partition to a reducer (or the next stage's ShuffleMapTask or ResultTask). Reducers use an in‑memory buffer to aggregate data before applying reduce() (or subsequent Spark operations).
From a low‑level perspective, the differences are significant. Hadoop MapReduce is sort‑based; records must be sorted before combine() and reduce(). Spark defaults to a hash‑based approach using HashMap for aggregation, without pre‑sorting. Users can enable sort‑based shuffle by setting spark.shuffle.manager=sort.
From an implementation perspective, Hadoop defines clear stages (map, spill, merge, shuffle, sort, reduce), whereas Spark only has stages and a series of transformations, so operations like spill and merge are embedded within transformations.
We define shuffle write as the partitioning and persisting of data, and shuffle read as the reducer fetching and aggregating that data. In Spark, shuffle write is added to the end of a ShuffleMapStage, where each output record is partitioned and persisted to local disk as a FileSegment. Each ShuffleMapTask creates R buffers (where R is the number of reducers) of size spark.shuffle.file.buffer.kb (default 32KB).
Bucket is a generic term for the storage location of a ShuffleMapTask 's output after partitioning.
The ShuffleMapTask writes records to the appropriate bucket determined by partitioner.partition(record.getKey()), producing FileSegment s that reducers later fetch.
Two main issues arise: (1) the large number of FileSegment s generated (M tasks × R reducers), and (2) memory consumption of buffers (cores × R × 32KB). Spark addresses the first issue with file consolidation ( spark.shuffle.consolidateFiles=true), allowing multiple tasks on the same core to share a single shuffle file.
Shuffle Read
When reading shuffle data, Spark waits for all ShuffleMapTask s of the parent stage to finish before fetching. The amount of data fetched at once is limited by spark.reducer.maxMbInFlight (default 48MB), referred to as the soft buffer.
Shuffle read processes data incrementally: each fetched record is immediately aggregated into a HashMap (or ExternalAppendOnlyMap if spilling to disk). The aggregation function must be commutative.
Fetched FileSegment s are stored in the soft buffer; processed data can reside in memory only ( AppendOnlyMap) or in memory plus disk ( ExternalAppendOnlyMap), depending on the configuration spark.shuffle.spill.
The memory‑disk balance is managed by monitoring the fraction of executor memory allocated to shuffle ( spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, default 0.24) and using a ShuffleMemoryMap to track usage per reducer.
Typical Transformations and Their Shuffle Read Implementations
reduceByKey(func) : fetches records and aggregates them in a HashMap using the provided commutative function.
groupByKey(numPartitions) : similar to reduceByKey but concatenates values without a map‑side combine.
distinct(numPartitions) : inserts records into a HashMap, discarding duplicates.
cogroup(otherRDD, numPartitions) : shares a single HashMap across multiple ShuffleDependency s, storing values from each RDD in separate ArrayBuffer s.
join / intersection : also rely on cogroup logic.
sortByKey(ascending, numPartitions) : collects records into an array and sorts them, without using a HashMap.
coalesce(numPartitions, shuffle = true) : has a ShuffleDependency but directly streams records to the next RDD without aggregation.
HashMap Implementations in Shuffle Read
AppendOnlyMap is an open hash table optimized for append‑only use cases; it does not support removal of keys. It expands using quadratic probing and can provide a sorted iterator via destructiveSortedIterator().
ExternalAppendOnlyMap wraps an AppendOnlyMap and spills to disk when memory limits are reached. Spilled maps are later merged using a heap‑based merge algorithm that combines records with the same hash key.
Memory usage is estimated using SizeTrackingAppendOnlyMap and SizeEstimator, and spilling respects the buffer size spark.shuffle.file.buffer.kb and batch size spark.shuffle.spill.batchSize (default 10,000).
Discussion
Compared to the fixed shuffle‑combine‑merge‑reduce pipeline of Hadoop MapReduce, Spark offers flexible shuffle‑aggregate strategies tailored to each transformation, balancing memory and disk usage.
This chapter covered how Spark performs shuffle without sorting records, integrating shuffle into the RDD computation chain, and highlighted differences with Hadoop MapReduce. The next chapter will explore deployment topology and inter‑process communication, including how shuffle data locations are determined.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
