Big Data 10 min read

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

21CTO

Sep 5, 2017

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

What is MapReduce?

MapReduce is a set of tools and techniques for processing large and complex data sets; technologies that can handle massive data are called MapReduce.

When to Use MapReduce

MapReduce is especially suitable for big‑data problems. It splits processing work into tiny chunks that can be handled by many machines in parallel, offering faster performance than traditional software systems.

Typical scenarios include:

Counting and statistics

Sorting

Filtering

Aggregation

Apache Hadoop

This guide uses Apache Hadoop, the de‑facto standard open‑source platform for developing MapReduce solutions. Hadoop clusters can be rented or built on cloud providers such as Amazon, Google, or Microsoft.

Key advantages of Hadoop are:

Scalability – add new nodes without changing code

Cost‑effectiveness – runs on commodity hardware

Flexibility – schema‑less, can handle any data structure

Fault tolerance – failed nodes are taken over by others

Hadoop also supports “streaming” applications, allowing developers to choose the language for mapper and reducer scripts. In this article PHP is used as the primary language.

Hadoop Installation

Detailed installation and configuration of Apache Hadoop are beyond the scope of this article; readers can find many online resources for their platforms.

Mapper

The mapper converts input lines into key‑value pairs. For a word‑count example, each word becomes word\t1.

#!/usr/bin/php
<?php
    while($line = fgets(STDIN)){
        $line = ltrim($line);
        $line = rtrim($line);
        $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
        foreach($words as $key){
            printf("%s\t%d
", $key, 1);
        }
    }
?>

Reducer

The reducer receives sorted key‑value pairs, aggregates the values, and outputs the final result.

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;
    while($line = fgets(STDIN)){
        $line = ltrim($line);
        $line = rtrim($line);
        list($key,$count) = explode("\t", $line);
        if ($last_key === $key){
            $running_total += $count;
        } else {
            if ($last_key != NULL)
                printf("%s\t%d
", $last_key, $running_total);
            $last_key = $key;
            $running_total = $count;
        }
    }
    if ($last_key != NULL)
        printf("%s\t%d
", $last_key, $running_total);
?>

Running the Word‑Count on Hadoop

Test locally:

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php

Run on a Hadoop cluster:

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php" \
 -reducer "./reducer.php" \
 -input "hello/mobydick.txt" \
 -output "hello/result"

View the output:

hdfs dfs -cat hello/result/part-00000

Mapping and Reducing Process Diagram

Calculating Annual Gold Prices

A practical example computes the average yearly gold price from a small dataset, demonstrating that the same logic scales to larger collections.

Download the dataset and place it in HDFS:

wget https://raw.githubusercontent.../a.csv
hadoop dfs -mkdir goldprice
hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv

Mapper extracts year and price:

#!/usr/bin/php
<?php
    while($line = fgets(STDIN)){
        $line = ltrim($line);
        $line = rtrim($line);
        preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);
        if ($matches) {
            printf("%s\t%.3f
", $matches[1], $matches[2]);
        }
    }
?>

Reducer computes running average per year:

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;
    $running_average = 0;
    $number_of_items = 0;
    while($line = fgets(STDIN)){
        $line = ltrim($line);
        $line = rtrim($line);
        list($key,$count) = explode("\t", $line);
        if ($last_key === $key){
            $number_of_items++;
            $running_total += $count;
            $running_average = $running_total / $number_of_items;
        } else {
            if ($last_key != NULL)
                printf("%s\t%.4f
", $last_key, $running_average);
            $last_key = $key;
            $number_of_items = 1;
            $running_total = $count;
            $running_average = $count;
        }
    }
    if ($last_key != NULL)
        printf("%s\t%.3f
", $last_key, $running_average);
?>

Run locally and on Hadoop using the same commands as the word‑count example, then retrieve the results.

Bonus: Generating Charts

Results can be visualized with gnuplot. After retrieving the output file:

hdfs dfs -get goldprice/result/part-00000 gold.dat

Create a gnuplot script (gold.plot) and run it to produce chart.jpg:

# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"