Big Data 21 min read

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

ITPUB

Dec 14, 2023

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

Python and Hadoop have long been a classic combination for large‑scale data processing. The article begins with an overview of Hadoop’s core components—HDFS for distributed storage, MapReduce for parallel computation, and YARN for resource scheduling—explaining how they work together to handle data that exceeds the capacity of a single machine.

HDFS Distributed File System

HDFS splits large files into fixed‑size blocks, stores each block on multiple DataNodes, and maintains metadata in the NameNode. When a client writes a file, it contacts the NameNode for block allocation, writes blocks to DataNodes, and the system keeps three replicas by default for fault tolerance.

Reading data follows a simple flow: the client requests metadata from the NameNode, receives block locations, and then reads the blocks directly from the appropriate DataNodes.

MapReduce Framework

MapReduce divides a complex job into a map phase, an optional combiner/shuffle phase, and a reduce phase. The map phase processes input records and emits intermediate key\tvalue pairs; the shuffle phase groups values by key; the reduce phase aggregates the values and writes final results back to HDFS.

YARN Resource Management

YARN coordinates resources across the cluster: a client requests an application, the ResourceManager allocates containers, the ApplicationMaster negotiates resources and launches tasks on NodeManagers, and finally the ApplicationMaster deregisters after completion.

Setting Up a Single‑Node Hadoop Environment

Assuming a CentOS 7 server with JDK 1.8 installed, the guide creates a dedicated hadoop user and group, assigns directory permissions, grants sudo rights, sets a password, disables the firewall, and configures password‑less SSH for the hadoop user.

groupadd hadoop
useradd -r -g hadoop hadoop
mkdir -p /home/hadoop
chown -R hadoop.hadoop /usr/local/
chown -R hadoop.hadoop /tmp/
chown -R hadoop.hadoop /home/
vim /etc/sudoers   # add "hadoop ALL=(ALL) ALL"
passwd hadoop
systemctl stop firewalld
systemctl disable firewalld
ssh-keygen -t rsa
cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
chmod 700 /home/hadoop/.ssh
chmod 644 /home/hadoop/.ssh/authorized_keys
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub <em>hostname</em>

Installing Hadoop

Download Hadoop 3.2.0, extract it, set HADOOP_HOME and update PATH in /etc/profile, then source the profile to apply the variables.

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
tar -zxvf hadoop-3.2.0.tar.gz
export HADOOP_HOME=/usr/local/hadoop-3.2.0
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
source /etc/profile

Verify the installation with hadoop version, which should display the Hadoop version and repository information.

Installing Python 3

Because CentOS 7 ships with Python 2.7, the guide compiles Python 3.7.4 from source after installing required development libraries.

yum install -y zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel zlib* libffi-devel readline-devel tk-devel
wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
tar -zxvf Python-3.7.4.tgz
cd Python-3.7.4
./configure
make && make install
python3 --version   # should show 3.7.4

Writing the MapReduce Programs in Python

The mapper reads each line from standard input, splits it into words, and emits word\t1 for each word.

import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print('%s\t%s' % (word, 1))

The reducer aggregates the counts for each word, handling sorted input from the mapper.

import sys
handler_word = None
handler_count = 0
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if handler_word == word:
        handler_count += count
    else:
        if handler_word:
            print('%s\t%s' % (handler_word, handler_count))
        handler_word = word
        handler_count = count
if handler_word:
    print('%s\t%s' % (handler_word, handler_count))

Preparing Input Data

Create /home/hadoop/input/data.input with sample words:

hadoop mapreduce hive flume
hbase spark storm flume
sqoop hadoop hive kafka
spark hadoop storm

Running the Job with Hadoop Streaming

Execute the following command to launch the Python MapReduce job:

hadoop jar /usr/local/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-streaming-3.2.0.jar \
  -file /home/hadoop/python/mapper.py -mapper "python3 mapper.py" \
  -file /home/hadoop/python/reducer.py -reducer "python3 reducer.py" \
  -input /home/hadoop/input/data.input -output /home/hadoop/output

The command’s parameters specify the streaming jar, the mapper and reducer scripts, the input file, and the output directory.

Verifying the Result

After the job finishes, list the output directory and view part-00000 to see word counts:

ll /home/hadoop/output
cat /home/hadoop/output/part-00000
# Expected output (example)
flume   2
hadoop  3
hbase   1
hive    2
kafka   1
mapreduce       1
spark   2
sqoop   1
storm   2

Conclusion

Even in the era of large language models like ChatGPT, offline batch processing remains essential for handling massive datasets. Hadoop’s reliable HDFS storage and flexible MapReduce model, combined with Python’s ease of use for data analysis, continue to provide a powerful solution for large‑scale word‑count and other batch analytics tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Streaming MapReduce bigdata Hadoop WordCount centos7

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

HDFS Distributed File System

MapReduce Framework

YARN Resource Management

Setting Up a Single‑Node Hadoop Environment

Installing Hadoop

Installing Python 3

Writing the MapReduce Programs in Python

Preparing Input Data

Running the Job with Hadoop Streaming

Verifying the Result

Conclusion

ITPUB

How this landed with the community

Was this worth your time?

0 Comments

Installing Python 3