Big Data 12 min read

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

21CTO

May 17, 2018

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

Apache Hadoop is an open‑source framework that enables a cluster of commodity machines to store and process massive data sets in a highly distributed manner, originally consisting of HDFS and a MapReduce engine.

MapReduce Model

MapReduce, introduced by Google, provides a simple programming model that expresses computation as Map and Reduce functions operating on key‑value pairs. Hadoop offers a high‑level API for implementing custom Map and Reduce functions in various languages.

The framework runs Map tasks on subsets of input data, then passes the intermediate results to Reduce tasks, which operate independently, enabling parallel and fault‑tolerant computation.

The Hadoop infrastructure abstracts parallelism, scheduling, resource management, inter‑machine communication, and fault handling, making it easy to develop distributed applications that process terabytes of data across hundreds or thousands of nodes.

MR Architecture

Shuffle and Combine

The shuffle phase spans the Map and Reduce sides and includes a Sort stage. Combine runs on the Map side to pre‑aggregate data before it is written to disk, reducing the amount of data transferred during shuffle.

Map Shuffle Process

Map tasks read input splits from HDFS blocks, partition data using a Partitioner, buffer results in memory, spill to disk when the buffer exceeds a threshold, and optionally run a Combiner to reduce intermediate data size.

Reduce Shuffle Process

Reduce tasks launch fetcher threads to copy map outputs via HTTP from TaskTrackers, merge the fetched data in memory (or to disk), and finally produce a single sorted file that serves as the input for the Reduce function.

YARN Overview

YARN (Yet Another Resource Negotiator) is the next‑generation runtime that decouples resource management from job scheduling. It introduces a global ResourceManager , per‑application ApplicationMaster , and NodeManager to replace the single JobTracker/TaskTracker model.

The ApplicationMaster coordinates all tasks of an application, while NodeManagers provide dynamic containers instead of fixed Map/Reduce slots, improving resource utilization and supporting multiple processing frameworks.

Limitations of Classic MapReduce

The original JobTracker is a single point of failure and a scalability bottleneck, handling both cluster resource management and task coordination. Fixed Map/Reduce slots lead to under‑utilization when only one type of task is active.

Scalability Solutions with YARN

By separating responsibilities, YARN allows independent scaling of resource management (ResourceManager) and job scheduling (ApplicationMaster), supports multiple frameworks, and eliminates the rigid slot model, resulting in faster computation and easier framework upgrades.

YARN Advantages

Faster MapReduce execution

Support for multiple processing frameworks

Easier framework upgrades

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Computing YARN Hadoop Shuffle

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.