Big Data 22 min read

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

This tutorial walks you through Hadoop’s core components, sets up a single‑node Hadoop cluster on CentOS 7, installs Python 3, writes mapper and reducer scripts in Python, and runs a Hadoop‑Streaming word‑count job to demonstrate classic big‑data processing techniques.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

Introduction

The article explains why the Python‑Hadoop combination remains valuable for big‑data processing in 2023 and uses a hands‑on word‑count project to illustrate the underlying technology.

Hadoop Principles and Architecture

Hadoop is an open‑source distributed framework maintained by the Apache Software Foundation. Its two core components are the HDFS distributed file system and the MapReduce parallel computation engine. Additional modules include Hadoop Common and YARN, which together provide reliable storage and resource scheduling for large‑scale data processing.

HDFS Distributed File System

When a large file is written to Hadoop, the client contacts the NameNode for metadata, then splits the file into blocks (default 128 MB) and stores each block on different DataNodes, typically with three replicas for fault tolerance.

The read flow involves the client requesting block locations from the NameNode, then reading the blocks directly from the DataNodes.

Client requests metadata from NameNode.

NameNode returns block locations.

Client reads blocks from the appropriate DataNodes.

After all blocks are read, the client notifies the NameNode to close the stream.

MapReduce Distributed Computing Framework

MapReduce divides a large computation into map and reduce phases. The data flow consists of input → map → shuffle/combiner → reduce → output. The map phase processes each input line and emits key\tvalue pairs; the shuffle phase groups values by key; the reduce phase aggregates the values and writes the final result back to HDFS.

YARN Resource Management

YARN handles resource allocation and job scheduling. The typical workflow is:

Client submits an application request to the ResourceManager.

ResourceManager allocates resources and starts an ApplicationMaster on a NodeManager.

ApplicationMaster registers with the ResourceManager and launches containers for tasks.

Containers report status back to the ApplicationMaster.

When all tasks finish, the ApplicationMaster deregisters.

Setting Up a Single‑Node Hadoop Environment

Assuming a fresh CentOS 7 installation with JDK 1.8, the steps are:

Create a hadoop group and user:

groupadd hadoop</code><code>useradd -r -g hadoop hadoop

Assign directory permissions for /usr/local and /tmp to the Hadoop user:

mkdir -p /home/hadoop</code><code>chown -R hadoop.hadoop /usr/local/</code><code>chown -R hadoop.hadoop /tmp/</code><code>chown -R hadoop.hadoop /home/

Grant sudo rights by editing /etc/sudoers and adding: hadoop ALL=(ALL) ALL Set a password for the Hadoop user: passwd hadoop Disable the firewall:

systemctl stop firewalld</code><code>systemctl disable firewalld

Configure password‑less SSH for the Hadoop user:

ssh-keygen -t rsa</code><code>cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys</code><code>chmod 700 /home/hadoop/.ssh</code><code>chmod 644 /home/hadoop/.ssh/authorized_keys</code><code>chmod 600 /home/hadoop/.ssh/id_rsa</code><code>ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub <em>hostname_or_IP</em>

Install Hadoop in Local Mode

Download, extract, and configure Hadoop:

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz</code><code>tar -zxvf hadoop-3.2.0.tar.gz

Add the following to /etc/profile and source it:

HADOOP_HOME=/usr/local/hadoop-3.2.0</code><code>PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH</code><code>export PATH HADOOP_HOME

Set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh to point to your JDK directory, e.g., /usr/local/jdk1.8.0_321. Verify the installation with hadoop version.

Installing Python 3

CentOS 7 ships with Python 2.7.5. To install Python 3.7.4:

Install build dependencies:

yum install -y zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel libffi-devel readline-devel tk-devel

Download and extract Python source:

wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz</code><code>tar -zxvf Python-3.7.4.tgz

Compile and install:

cd Python-3.7.4</code><code>./configure</code><code>make && make install

Confirm the installation with python3 --version.

Writing the MapReduce Programs in Python

Mapper (mapper.py)

import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print('%s\t%s' % (word, 1))

Reducer (reducer.py)

import sys
handler_word = None
handler_count = 0
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if handler_word == word:
        handler_count += count
    else:
        if handler_word:
            print('%s\t%s' % (handler_word, handler_count))
        handler_word = word
        handler_count = count
if handler_word:
    print('%s\t%s' % (handler_word, handler_count))

Preparing Input Data

Create /home/hadoop/input/data.input with sample text:

hadoop mapreduce hive flume
hbase spark storm flume
sqoop hadoop hive kafka
spark hadoop storm

Running the Hadoop‑Streaming Job

Execute the following command as the hadoop user:

hadoop jar /usr/local/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-streaming-3.2.0.jar \
  -file /home/hadoop/python/mapper.py -mapper "python3 mapper.py" \
  -file /home/hadoop/python/reducer.py -reducer "python3 reducer.py" \
  -input /home/hadoop/input/data.input -output /home/hadoop/output

The job logs show 100% map and reduce completion, confirming successful execution.

Verifying the Result

List the output directory and view part-00000:

ll /home/hadoop/output
cat /home/hadoop/output/part-00000

The file contains word‑frequency pairs, e.g.:

flume   2
hadoop  3
hbase   1
hive    2
kafka   1
mapreduce       1
spark   2
sqoop   1
storm   2

Conclusion

Even with modern large‑language models, offline batch processing remains essential for massive data validation and analysis. Hadoop’s HDFS provides reliable storage, while its support for multiple languages—including Python—makes it a robust choice for big‑data workflows. The Python‑Hadoop combination thus continues to be a practical solution for large‑scale data processing in the AI era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPythonStreamingLinuxMapReduceHadoopword count
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.