Big Data 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big Data Technology & Architecture

Apr 2, 2019

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Hadoop solves large‑scale distributed data processing using the MapReduce framework, which serves both as a programming model and a computation engine. Developers write programs based on the MapReduce model, then the framework distributes and runs them on a Hadoop cluster.

The MapReduce model consists of two simple yet powerful stages: map and reduce. Each map receives a pair, processes it, and emits intermediate pairs; identical keys are then grouped and passed to the reduce stage, which aggregates the values.

As a concrete illustration, the classic WordCount program counts word frequencies in massive text collections. The complete Java implementation is shown below:

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
}

The map function extracts each word from a line of text and emits <word, 1>. The framework groups identical words, forming <word, [1,1,1,…]>, which the reduce function sums to produce the final count <word, total>.

Job execution involves several key processes: the user application that submits the job, the JobTracker that schedules map and reduce tasks, and TaskTracker daemons (often co‑located with HDFS DataNodes) that run the individual map/reduce tasks. The workflow includes storing the job JAR in HDFS, submitting the job, creating task trees, allocating tasks to nodes that hold the relevant data blocks, and finally reading input data (for maps) or writing output (for reduces).

Between the map and reduce phases, the framework performs a shuffle operation. Each map task writes its output locally; as tasks finish, a partitioner assigns each intermediate to a specific reduce task, typically using the key’s hash code. The default partitioner code is:

/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K2 key, V2 value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

During shuffle, data is transferred via HTTP, sorted, and merged so that each reducer receives all values for a given key. This step is the most performance‑critical part of large‑scale batch processing, and understanding it is essential for writing efficient MapReduce programs.

In summary, MapReduce abstracts the complexities of data distribution, task scheduling, and inter‑node communication, allowing developers to focus on writing simple map and reduce functions while the framework handles the heavy lifting of distributed computation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MapReduce Distributed Computing Hadoop Shuffle WordCount

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.