Big Data 15 min read

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

ITPUB

Mar 29, 2018

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

Overview

Apache Hadoop is an open‑source framework that runs on a cluster of commodity machines, enabling them to communicate and cooperate to store and process massive data sets in a highly distributed manner. Its original core consists of the Hadoop Distributed File System (HDFS) and a distributed computation engine that executes programs as MapReduce jobs.

MapReduce Basics

MapReduce, popularized by Google, provides a simple programming model for parallel and scalable processing of large data sets. Users express their computation as map and reduce functions that operate on key‑value pairs. Hadoop offers a high‑level API for implementing custom map and reduce functions in various languages.

The framework runs a series of map tasks on subsets of the input data; each map task invokes the map function. After all map tasks finish, reduce tasks are launched to process the intermediate data produced by the maps, invoking the reduce function. Map and reduce tasks run independently, supporting parallelism and fault tolerance.

Hadoop abstracts away the complexities of parallelization, scheduling, resource management, inter‑machine communication, and fault handling, making it feasible to develop distributed applications that process terabytes of data across hundreds or thousands of machines.

2‑Stage MapReduce (2MR) Architecture

This diagram shows the division of tasks into a Map side and a Reduce side.

Classic MR Architecture

JobClient requests a new job ID from JobTracker.

Job output specifications are checked.

Input splits are calculated.

Resources (job JAR, config files, input splits) are copied to a JobTracker‑managed directory named after the job ID.

JobClient calls JobTracker.submitJob() to signal that the job is ready.

JobTracker queues the submission and hands it to the scheduler for initialization.

The scheduler creates a list of tasks; for each split, a Map task is generated.

TaskTrackers periodically send heartbeats to JobTracker.

Shuffle and Combine

The overall shuffle process consists of three parts: Map‑side shuffle, a sort phase, and Reduce‑side shuffle. Data moves from the output of map tasks to the input of reduce tasks.

On the Map side, a Combiner can perform a local reduce to lessen the amount of data transferred. The combiner must not change the final result and is suitable when the map output key/value types match the reducer’s input.

Map Shuffle Details

Input : Map tasks read splits from HDFS blocks.

Partitioning : The Partitioner hashes the key and assigns it to a reducer based on the number of reduce tasks.

Buffering : Map output is written to an in‑memory buffer, serialized as byte arrays.

Spill : When the buffer reaches a threshold (default 80 % of 100 MB), data is spilled to disk in a temporary spill_file.

Combiner : If configured, the combiner runs during spill to reduce data size.

Merge Phase

Multiple spill files generated by a map task are merged into a single sorted output before being written to HDFS.

Reduce Shuffle Details

Copy : Reducer fetches map output files from TaskTrackers via HTTP.

Merge : Fetched data is merged in memory or on disk (memory‑to‑memory, memory‑to‑disk, disk‑to‑disk) depending on size thresholds.

Reducer Input : The final merged file becomes the reducer’s input, after which the reducer writes its output back to HDFS.

YARN (Yet Another Resource Negotiator)

YARN, also known as MapReduce v2, is a generic resource‑management layer that decouples resource negotiation from job scheduling, allowing multiple processing frameworks to run on the same cluster.

Limitations of Classic MapReduce

JobTracker is a single point of failure and a scalability bottleneck.

JobTracker’s dual responsibilities (resource management and task coordination) overload the process.

TaskTracker’s fixed map/reduce slots lead to poor resource utilization, especially when tasks have heterogeneous CPU or memory demands.

Rigid slot allocation prevents using idle slots for other workloads.

Addressing Scalability

YARN separates the two duties of JobTracker into distinct daemons:

ResourceManager handles cluster‑wide resource allocation.

ApplicationMaster coordinates the execution of a single application’s tasks.

NodeManager replaces TaskTracker, managing containers rather than fixed slots.

This design eliminates the single‑point‑of‑failure problem, enables dynamic container allocation, and supports multiple processing frameworks beyond MapReduce.

Advantages of YARN

Faster MapReduce execution.

Support for multiple frameworks (e.g., Spark, Flink).

Easier framework upgrades.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MapReduce YARN Hadoop Shuffle

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.