Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization
This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.
Apache Hadoop is an open‑source software framework that can be installed on a cluster of commodity machines, enabling them to communicate and cooperate to store and process massive data sets in a highly distributed manner. Its core components are the Hadoop Distributed File System (HDFS) and a distributed computation engine that runs programs as MapReduce jobs.
MapReduce, popularized by Google, is a simple programming model useful for processing large data sets in parallel and at scale. Users express their computation as Map and Reduce functions that operate on key‑value pairs, and Hadoop provides a high‑level API for implementing custom map and reduce logic in various languages.
Hadoop’s infrastructure runs MapReduce jobs as a series of map and reduce tasks. A Map task invokes the map function on a subset of the input data; once all map calls finish, the corresponding Reduce tasks process the intermediate data generated by the maps. The map and reduce tasks run independently, supporting parallelism and fault tolerance.
MR Architecture
The classic MapReduce workflow involves JobClient, JobTracker, and TaskTracker:
JobClient requests a new JobID from JobTracker.
It checks the job output specification.
It computes input splits for the job.
It copies the job’s JAR, configuration files, and split information to a directory named after the JobID on the JobTracker’s file system.
It calls JobTracker.submitJob() to signal that the job is ready for execution.
JobTracker enqueues the submission, schedules it, and initializes the job.
TaskTrackers are created for each split; each split corresponds to one map task.
TaskTrackers send periodic heartbeats to JobTracker.
Shuffle and Combine
The shuffle phase spans both the map and reduce sides and includes a sort stage. Combine runs on the map side to pre‑aggregate data before it is spilled to disk, reducing network and I/O traffic.
Map Shuffle Details
Input : Map tasks read splits from HDFS blocks.
Partitioning : The Partitioner interface hashes the key and assigns it to a reduce task.
Spill : When the in‑memory buffer (default 100 MB) fills, data are written to temporary spill files on disk.
Combiner : Optional pre‑reduce aggregation that must not alter the final result.
Merge : Multiple spill files are merged into a single sorted output before being sent to reducers.
Reduce Shuffle Details
Copy : Reducer fetches map output files from TaskTrackers via HTTP.
Merge : Three merge strategies are used – memory‑to‑memory, memory‑to‑disk, and disk‑to‑disk – to combine the fetched data into one input file for the reducer.
Reducer Input : The final merged file is loaded into memory (or streamed from disk) and processed by the reduce function, after which results are written back to HDFS.
YARN (Yet Another Resource Negotiator)
YARN is the next‑generation execution framework that replaces the classic MapReduce architecture. It decouples resource management and job scheduling, allowing multiple processing models to run on the same cluster.
ResourceManager : Global cluster manager that arbitrates resources among competing applications.
ApplicationMaster : Per‑application lightweight process that coordinates the execution of its tasks, monitors progress, restarts failed tasks, and aggregates counters.
NodeManager : Replaces TaskTracker; it launches containers for tasks without fixed map/reduce slots, enabling dynamic resource allocation.
The classic MapReduce design suffers from scalability bottlenecks because a single JobTracker handles both resource management and task coordination, leading to high overhead and under‑utilization of CPU and memory. YARN addresses these issues by separating these responsibilities, supporting faster MapReduce computation, multi‑framework workloads, and easier framework upgrades.
Overall, Hadoop’s MapReduce model provides a powerful abstraction for distributed batch processing, while YARN extends the platform to be more flexible, scalable, and suitable for a broader range of data‑intensive applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
