How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster
This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.
Python and Hadoop have long been a classic combination for large‑scale data processing. The article begins with an overview of Hadoop’s core components—HDFS for distributed storage, MapReduce for parallel computation, and YARN for resource scheduling—explaining how they work together to handle data that exceeds the capacity of a single machine.
HDFS Distributed File System
HDFS splits large files into fixed‑size blocks, stores each block on multiple DataNodes, and maintains metadata in the NameNode. When a client writes a file, it contacts the NameNode for block allocation, writes blocks to DataNodes, and the system keeps three replicas by default for fault tolerance.
Reading data follows a simple flow: the client requests metadata from the NameNode, receives block locations, and then reads the blocks directly from the appropriate DataNodes.
MapReduce Framework
MapReduce divides a complex job into a map phase, an optional combiner/shuffle phase, and a reduce phase. The map phase processes input records and emits intermediate key\tvalue pairs; the shuffle phase groups values by key; the reduce phase aggregates the values and writes final results back to HDFS.
YARN Resource Management
YARN coordinates resources across the cluster: a client requests an application, the ResourceManager allocates containers, the ApplicationMaster negotiates resources and launches tasks on NodeManagers, and finally the ApplicationMaster deregisters after completion.
Setting Up a Single‑Node Hadoop Environment
Assuming a CentOS 7 server with JDK 1.8 installed, the guide creates a dedicated hadoop user and group, assigns directory permissions, grants sudo rights, sets a password, disables the firewall, and configures password‑less SSH for the hadoop user.
groupadd hadoop
useradd -r -g hadoop hadoop
mkdir -p /home/hadoop
chown -R hadoop.hadoop /usr/local/
chown -R hadoop.hadoop /tmp/
chown -R hadoop.hadoop /home/
vim /etc/sudoers # add "hadoop ALL=(ALL) ALL"
passwd hadoop
systemctl stop firewalld
systemctl disable firewalld
ssh-keygen -t rsa
cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
chmod 700 /home/hadoop/.ssh
chmod 644 /home/hadoop/.ssh/authorized_keys
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub <em>hostname</em>Installing Hadoop
Download Hadoop 3.2.0, extract it, set HADOOP_HOME and update PATH in /etc/profile, then source the profile to apply the variables.
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
tar -zxvf hadoop-3.2.0.tar.gz
export HADOOP_HOME=/usr/local/hadoop-3.2.0
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
source /etc/profileVerify the installation with hadoop version, which should display the Hadoop version and repository information.
Installing Python 3
Because CentOS 7 ships with Python 2.7, the guide compiles Python 3.7.4 from source after installing required development libraries.
yum install -y zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel zlib* libffi-devel readline-devel tk-devel
wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
tar -zxvf Python-3.7.4.tgz
cd Python-3.7.4
./configure
make && make install
python3 --version # should show 3.7.4Writing the MapReduce Programs in Python
The mapper reads each line from standard input, splits it into words, and emits word\t1 for each word.
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))The reducer aggregates the counts for each word, handling sorted input from the mapper.
import sys
handler_word = None
handler_count = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if handler_word == word:
handler_count += count
else:
if handler_word:
print('%s\t%s' % (handler_word, handler_count))
handler_word = word
handler_count = count
if handler_word:
print('%s\t%s' % (handler_word, handler_count))Preparing Input Data
Create /home/hadoop/input/data.input with sample words:
hadoop mapreduce hive flume
hbase spark storm flume
sqoop hadoop hive kafka
spark hadoop stormRunning the Job with Hadoop Streaming
Execute the following command to launch the Python MapReduce job:
hadoop jar /usr/local/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-streaming-3.2.0.jar \
-file /home/hadoop/python/mapper.py -mapper "python3 mapper.py" \
-file /home/hadoop/python/reducer.py -reducer "python3 reducer.py" \
-input /home/hadoop/input/data.input -output /home/hadoop/outputThe command’s parameters specify the streaming jar, the mapper and reducer scripts, the input file, and the output directory.
Verifying the Result
After the job finishes, list the output directory and view part-00000 to see word counts:
ll /home/hadoop/output
cat /home/hadoop/output/part-00000
# Expected output (example)
flume 2
hadoop 3
hbase 1
hive 2
kafka 1
mapreduce 1
spark 2
sqoop 1
storm 2Conclusion
Even in the era of large language models like ChatGPT, offline batch processing remains essential for handling massive datasets. Hadoop’s reliable HDFS storage and flexible MapReduce model, combined with Python’s ease of use for data analysis, continue to provide a powerful solution for large‑scale word‑count and other batch analytics tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
