Big Data 15 min read

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big Data Technology & Architecture

Apr 1, 2019

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

Hadoop, an Apache open‑source framework inspired by Google's big‑data papers, enables distributed processing of massive datasets across clusters ranging from a single server to thousands of machines, offering fault tolerance without expensive hardware.

Over the past decade Hadoop has evolved into a broader ecosystem that includes projects such as HBase, Hive, and Spark. Like the Spring framework, Hadoop is built on a few fundamental modules:

Common : provides shared utilities for file system access, I/O, serialization, and RPC.

HDFS : a distributed file system designed for high throughput on commodity hardware.

YARN : a resource‑management layer that schedules and monitors cluster resources.

MapReduce : a programming model that performs parallel computation on large data sets.

The Common module supplies the low‑level implementations needed by the other components, such as file handling and network communication.

HDFS (Hadoop Distributed File System) stores files as blocks replicated across multiple DataNodes, with a single NameNode managing the namespace and metadata. A typical high‑availability HDFS deployment includes three ZooKeeper nodes, DNS and NTP services, two NameNodes (active and standby), shared storage (NFS) for EditLogs, and many DataNodes. ZooKeeper monitors NameNode heartbeats and triggers failover; NFS stores transaction logs that enable the standby NameNode to resume operations.

Understanding distributed file systems begins with the concept of a file: an ordered sequence of bytes abstracted from physical storage. A file system organizes these bytes into directories, handling allocation, deallocation, and metadata so users need only remember file names and paths. Distributed file systems extend this model across multiple machines, providing network‑transparent access, data replication, and fault tolerance.

MapReduce offers a simple framework where a job is split into independent map tasks that process input splits, followed by a shuffle and sort phase, and then reduce tasks that aggregate intermediate results. The framework includes a JobTracker (master) and TaskTrackers (slaves) that schedule, monitor, and restart failed tasks. Users define map and reduce functions, configure input/output paths, and submit the job as a JAR or executable.

Below is a complete Scala implementation of a word‑count MapReduce program that can be run on a Hadoop cluster:

import java.io.IOException
import java.util.StringTokenizer

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.mapreduce.{Job, Mapper, Reducer}

import scala.collection.JavaConversions

object WordCount {
  def main(args: Array[String]): Unit = {
    val job = new Job(new Configuration(), "WordCount")
    job.setJarByClass(classOf[WordMapper])
    job.setMapperClass(classOf[WordMapper])
    job.setCombinerClass(classOf[WordReducer])
    job.setReducerClass(classOf[WordReducer])
    job.setOutputKeyClass(classOf[Text])
    job.setOutputValueClass(classOf[IntWritable])
    job.setNumReduceTasks(1)
    FileInputFormat.addInputPath(job, new Path(args(0)))
    FileOutputFormat.setOutputPath(job, new Path(args(1)))
    System.exit(job.waitForCompletion(true) match {
      case true => 0
      case false => 1
    })
  }
}

class WordMapper extends Mapper[Object, Text, Text, IntWritable] {
  val one = new IntWritable(1)

  @throws[IOException]
  @throws[InterruptedException]
  override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context) = {
    val stringTokenizer = new StringTokenizer(value.toString())
    while (stringTokenizer.hasMoreTokens()) {
      context.write(new Text(stringTokenizer.nextToken()), one)
    }
  }
}

class WordReducer extends Reducer[Text, IntWritable, Text, IntWritable] {
  @throws[IOException]
  @throws[InterruptedException]
  override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context) = {
    import JavaConversions.iterableAsScalaIterable
    context.write(key, new IntWritable(values.map(x => x.get()).reduce(_ + _)))
  }
}

The YARN (Yet Another Resource Negotiator) layer manages cluster resources and schedules applications. It consists of a global ResourceManager, per‑node NodeManagers, an ApplicationMaster for each job, and Containers that encapsulate allocated resources (CPU, memory, disk, network).

ResourceManager delegates scheduling to a Scheduler and tracks applications via an Applications Manager. NodeManagers report resource usage and handle Container lifecycle requests from the ApplicationMaster. Containers represent the actual resource bundles assigned to tasks.

In conclusion, the article offers a high‑level tour of Hadoop's core components, the design of a fault‑tolerant HDFS cluster, the MapReduce programming model with a runnable Scala example, and the YARN resource‑management architecture, laying a solid groundwork for further exploration of big‑data technologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MapReduce HDFS Hadoop Scala

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.