Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide
This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.
What is MapReduce?
MapReduce is a set of tools and techniques for processing large and complex data sets; technologies that can handle massive data are called MapReduce.
When to Use MapReduce
MapReduce is especially suitable for big‑data problems. It splits processing work into tiny chunks that can be handled by many machines in parallel, offering faster performance than traditional software systems.
Typical scenarios include:
Counting and statistics
Sorting
Filtering
Aggregation
Apache Hadoop
This guide uses Apache Hadoop, the de‑facto standard open‑source platform for developing MapReduce solutions. Hadoop clusters can be rented or built on cloud providers such as Amazon, Google, or Microsoft.
Key advantages of Hadoop are:
Scalability – add new nodes without changing code
Cost‑effectiveness – runs on commodity hardware
Flexibility – schema‑less, can handle any data structure
Fault tolerance – failed nodes are taken over by others
Hadoop also supports “streaming” applications, allowing developers to choose the language for mapper and reducer scripts. In this article PHP is used as the primary language.
Hadoop Installation
Detailed installation and configuration of Apache Hadoop are beyond the scope of this article; readers can find many online resources for their platforms.
Mapper
The mapper converts input lines into key‑value pairs. For a word‑count example, each word becomes word\t1.
#!/usr/bin/php
<?php
while($line = fgets(STDIN)){
$line = ltrim($line);
$line = rtrim($line);
$words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
foreach($words as $key){
printf("%s\t%d
", $key, 1);
}
}
?>Reducer
The reducer receives sorted key‑value pairs, aggregates the values, and outputs the final result.
#!/usr/bin/php
<?php
$last_key = NULL;
$running_total = 0;
while($line = fgets(STDIN)){
$line = ltrim($line);
$line = rtrim($line);
list($key,$count) = explode("\t", $line);
if ($last_key === $key){
$running_total += $count;
} else {
if ($last_key != NULL)
printf("%s\t%d
", $last_key, $running_total);
$last_key = $key;
$running_total = $count;
}
}
if ($last_key != NULL)
printf("%s\t%d
", $last_key, $running_total);
?>Running the Word‑Count on Hadoop
Test locally:
head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.phpRun on a Hadoop cluster:
hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
-mapper "./mapper.php" \
-reducer "./reducer.php" \
-input "hello/mobydick.txt" \
-output "hello/result"View the output:
hdfs dfs -cat hello/result/part-00000Mapping and Reducing Process Diagram
Calculating Annual Gold Prices
A practical example computes the average yearly gold price from a small dataset, demonstrating that the same logic scales to larger collections.
Download the dataset and place it in HDFS:
wget https://raw.githubusercontent.../a.csv
hadoop dfs -mkdir goldprice
hadoop dfs -copyFromLocal ./data.csv goldprice/data.csvMapper extracts year and price:
#!/usr/bin/php
<?php
while($line = fgets(STDIN)){
$line = ltrim($line);
$line = rtrim($line);
preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);
if ($matches) {
printf("%s\t%.3f
", $matches[1], $matches[2]);
}
}
?>Reducer computes running average per year:
#!/usr/bin/php
<?php
$last_key = NULL;
$running_total = 0;
$running_average = 0;
$number_of_items = 0;
while($line = fgets(STDIN)){
$line = ltrim($line);
$line = rtrim($line);
list($key,$count) = explode("\t", $line);
if ($last_key === $key){
$number_of_items++;
$running_total += $count;
$running_average = $running_total / $number_of_items;
} else {
if ($last_key != NULL)
printf("%s\t%.4f
", $last_key, $running_average);
$last_key = $key;
$number_of_items = 1;
$running_total = $count;
$running_average = $count;
}
}
if ($last_key != NULL)
printf("%s\t%.3f
", $last_key, $running_average);
?>Run locally and on Hadoop using the same commands as the word‑count example, then retrieve the results.
Bonus: Generating Charts
Results can be visualized with gnuplot. After retrieving the output file:
hdfs dfs -get goldprice/result/part-00000 gold.datCreate a gnuplot script (gold.plot) and run it to produce chart.jpg:
# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"Translator: Du Jiang (21CTO community initiator) Author: Glenn De Backer Original: https://www.simplicity.be/article/big-data-php/
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
