Understanding Spark Shuffle: Stages, Evolution, and Source Code Structure
This article explains the concept of Spark Shuffle, details its two-phase write and read processes, describes the evolution from Hash‑based to Sort‑based and Tungsten‑based shuffles across Spark versions, and outlines the relevant source‑code components in Spark 2.1.
In distributed computation frameworks such as Spark or Hadoop MapReduce, data is partitioned by key and scattered across the cluster; when a new partitioning is required (e.g., for sorting), the data must be reshuffled, which is the process called Shuffle.
1. Spark Shuffle Two Phases – Spark divides a job into stages; Shuffle consists of a Write phase (the last step of the parent stage) and a Read phase (the first step of the child stage). The Write phase runs on ShuffleMapTask and the Read phase on ResultTask.
The tasks involved are:
Shuffle Write
Map‑side combine (if needed)
Write to a local output file
Shuffle Read
Block fetch
Reduce‑side combine
Sort (if needed)
During the Write phase, after the last map computation of a stage, Spark may aggregate results, then writes the partitioned data to the local disk of the executor. The Read phase starts when a reduce task fetches the shuffled blocks, aggregates by key, optionally sorts, and produces a new RDD.
2. Evolution of Spark Shuffle Implementations
Spark 0.8 and earlier – Hash‑based Shuffle: partitions are hashed, no sorting, producing M×R intermediate files.
Spark 0.8.1 – File consolidation to reduce the number of intermediate files.
Spark 0.9 – Introduction of ExternalAppendOnlyMap for spilling to disk during combine.
Spark 1.1 – Sort‑based Shuffle added (still defaulting to Hash). Uses ShuffleBlockManager split into ShuffleManager with HashShuffleManager and SortShuffleManager.
Spark 1.2 – Default shuffle switched to Sort‑based.
Spark 1.4 – Tungsten‑Sort based Shuffle introduced, using off‑heap memory and unsafe operations.
Spark 1.6 – Tungsten‑Sort merged into Sort‑based Shuffle; manager classes unified.
Spark 2.0 – Hash‑based Shuffle removed; only Sort‑based Shuffle remains.
Key classes for each implementation include: org.apache.spark.storage.ShuffleBlockManager (Hash) org.apache.spark.shuffle.sort.SortShuffleManager (Sort) org.apache.spark.shuffle.unsafe.UnsafeShuffleManager (Tungsten‑Sort)
3. Spark Shuffle Source Code Structure (Spark 2.1 example)
Shuffle Write entry chain:
org.apache.spark.scheduler.ShuffleMapTask#runTask
---> org.apache.spark.shuffle.sort.SortShuffleManager#getWriter
---> org.apache.spark.shuffle.sort.SortShuffleWriter#write (ordinary sort)
---> org.apache.spark.shuffle.sort.UnsafeShuffleWriter#write (Tungsten‑sort)Important dependencies:
org.apache.spark.util.collection.ExternalSorter // sorts by (partition id, key)
---> org.apache.spark.util.collection.PartitionedAppendOnlyMap org.apache.spark.shuffle.sort.ShuffleExternalSorter // Java implementationShuffle Read entry chain:
org.apache.spark.rdd.ShuffledRDD#compute
---> org.apache.spark.shuffle.sort.SortShuffleManager#getReader
---> org.apache.spark.shuffle.BlockStoreShuffleReader#readKey dependencies for reading:
org.apache.spark.Aggregator // combine logic
---> org.apache.spark.util.collection.ExternalAppendOnlyMap
org.apache.spark.util.collection.ExternalSorter // optional final sortingThe article concludes with a list of reference materials and recommended readings for deeper exploration of Spark Shuffle mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
