Magnet: A Push‑Based Shuffle Service that Scales to Petabyte‑Level Data Processing
LinkedIn’s massive Spark workloads suffer from shuffle bottlenecks caused by tiny shuffle blocks, unreliable RPC connections, and data skew, so the authors design Magnet—a push‑merge shuffle service that merges blocks into large chunks, improves disk I/O, tolerates failures, and cuts end‑to‑end job time by nearly 30% regardless of hardware.
Abstract
Apache Spark’s shuffle phase moves all intermediate data between map and reduce tasks. At LinkedIn, rapid growth in data volume and cluster size makes shuffle a scalability bottleneck, increasing job latency, causing failures, and raising infrastructure cost.
1. Introduction
Distributed data‑processing frameworks such as Hadoop and Spark rely on shuffle to transfer intermediate data. Spark’s external shuffle service (ESS) stores shuffle data on disk, providing fault tolerance but still suffers from three major problems at LinkedIn’s scale:
Full‑mesh RPC connections between map and reduce tasks become unreliable as node failures become common.
Shuffle blocks are tiny (average size ≈ tens of KB); reading them from HDDs incurs high IOPS cost and wastes ~15 % of Spark compute resources.
As the number of map tasks (M) and reduce tasks (R) grows, the number of blocks (M × R) grows quadratically, further shrinking block size.
Replacing HDDs with SSDs is impractical at LinkedIn’s petabyte‑scale, and cloud‑native deployments face similar network‑I/O issues.
2. Background and Motivation
In LinkedIn’s YARN‑based Spark deployment, each executor registers with a local ESS. The ESS tracks the location of shuffle files and serves them to reducers. However, the following issues remain:
2.2 Disk I/O inefficiency
Small random reads from HDDs dominate shuffle latency; a study of 5 000 delayed stages shows that most have block sizes of only a few KB, leading to long fetch times.
2.3 Full‑mesh reliability
Each shuffle requires M × R RPC connections; with thousands of nodes, intermittent node failures cause connection failures that trigger costly retries and stage failures.
2.4 Reduce‑task placement
Data locality is less beneficial than expected because tiny blocks force reducers to fetch data from many remote nodes, and network bandwidth cannot be fully utilized due to disk I/O limits.
3. System Design
Magnet is a push‑merge shuffle service that retains Spark’s sort‑based on‑disk shuffle semantics while addressing the three problems above.
3.1 Overview
Spark Driver coordinates map tasks, reduce tasks, and Magnet services.
Map tasks push their sorted shuffle blocks to a remote Magnet service instead of writing them locally.
Magnet merges blocks belonging to the same shuffle partition into large files.
Reduce tasks read the merged files (co‑located with the task when possible) to avoid many small reads.
3.2 Push‑Merge Shuffle
Each map task divides its shuffle file into MB‑sized chunks, randomizes their order, and pushes them via a dedicated thread pool. Magnet assigns each chunk to a specific Magnet service based on partition ID, ensuring that all chunks for a partition converge on the same service.
3.2.1 Remote‑push preparation
To keep CPU and memory overhead low on the shuffle service, Magnet does not sort or buffer blocks in memory; sorting remains in the map task. The service only appends received chunks to the appropriate merged file.
3.2.2 Merging on the service
For each active partition Magnet maintains metadata (bitmap of received mapper IDs, current offset, and currentMapId) stored in a ConcurrentHashMap<>. This metadata detects duplicate blocks, orders appends, and enables recovery after partial failures.
3.3 Reliability Enhancements
Magnet adopts a best‑effort approach:
If a map task fails before pushing, it simply retries without affecting the service.
If a block push fails after retries, Magnet discards the block and the reducer falls back to the original unmerged block.
If merging fails due to duplicate or corrupted data, the reducer also falls back to the unmerged block.
This double‑copy strategy adds only modest storage overhead (temporary shuffle data occupies a few hundred TB even when daily PB‑scale data is processed).
3.4 Flexible Deployment
Magnet can be deployed in two modes:
Co‑located clusters : Magnet runs on the same nodes as Spark executors, allowing reducers to read merged files locally for better data locality.
Disaggregated cloud clusters : Magnet runs on compute nodes while merged files reside on separate storage nodes; the driver selects services with lower load to balance traffic.
Dynamic resource allocation is respected: if executors are released, Magnet services outside the active executor set are chosen, and new executors are launched near the selected services.
4. Implementation Optimizations
Magnet is built on Apache Spark 2.3 and requires only internal changes, so existing Spark applications run unchanged.
4.1 Parallel data transfer
Map tasks push blocks using a dedicated thread pool, overlapping CPU‑intensive map work with network‑intensive push. Reduce tasks fetch merged files in MB‑sized slices, allowing parallel fetch and computation similar to Spark’s asynchronous RPC model.
4.2 Low‑overhead service
The Magnet service performs no sorting or in‑memory buffering; it directly appends received chunks to disk, keeping CPU usage comparable to the native ESS.
4.3 Disk I/O optimization
Although Magnet writes merged data twice (original block then merged file), sequential writes benefit from OS page cache and disk buffers, yielding higher throughput than many small random reads. Small‑block writes remain stable across block sizes.
5. Evaluation
5.1 Setup
Two environments were used:
A synthetic pressure‑test framework that generates high‑concurrency shuffle traffic on a single shuffle service.
A production‑like benchmark cluster (200 nodes, 56 CPU cores, 256 GB RAM per node, 10 Gbps Ethernet, 6 HDDs per node with 100 MB/s sequential read).
Table 1 (not reproduced) lists three synthetic workloads with identical total data but varying block sizes.
5.2 Synthetic Workload Results
5.2.1 Completion time
Fetching 150 GB of 10 KB blocks from HDDs takes ~4 h with native Spark; Magnet reduces this to ~5 min by merging blocks into MB‑sized reads. Block size has little impact on push time because pushes read large local files.
5.2.2 Disk I/O
For fetch operations, read throughput increases dramatically with block size, confirming that HDD IOPS limit small random reads. Write throughput during pushes remains stable across block sizes.
5.2.3 Resource usage
Magnet’s shuffle service consumes 20‑50 % CPU (virtual cores) and ~300 MB heap/off‑heap memory, similar to the native service, despite handling PB‑scale data.
5.3 Production Workload Evaluation
Three real jobs were run on a 200‑node cluster (200 executors, 5 GB memory each):
Job 1: small SQL query (< 100 GB shuffle) – minimal benefit.
Job 2: CPU‑intensive (~ 400 GB shuffle) – 45 % overall runtime reduction, 68 % reduction in reduce‑stage time.
Job 3: I/O‑intensive (~ 800 GB shuffle) – 81 % reduction in reduce‑stage time.
When jobs 2 and 3 ran concurrently, Magnet’s benefits amplified, demonstrating robustness under mixed workloads.
6. Related Work
Riffle [41] pulls blocks to a merge service and caches them in memory, incurring higher memory pressure than Magnet’s executor‑side buffering. Sailfish [33] and Cosco [4] delegate shuffle storage to external systems (KFS, Facebook’s storage), limiting deployment flexibility. iShuffle [??] pushes blocks to reducers but cannot improve reliability or locality in Spark’s two‑level scheduler. Other proposals (HD‑shuffle [31], MapReduce‑Online [20], Hadoop‑A [39]) address shuffle efficiency but target different execution models or require specialized hardware.
7. Conclusion
Magnet introduces a push‑merge shuffle mechanism that merges tiny shuffle blocks into large sequential reads, improves disk I/O, tolerates push/merge failures, and enhances data locality for reduce tasks. Evaluations on synthetic and production workloads show up to 30 % end‑to‑end job time reduction and substantial I/O gains, proving Magnet’s effectiveness for petabyte‑scale Spark deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
