Big Data 5 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

This article introduces Hadoop as a widely used big‑data framework, explains its core components HDFS and MapReduce, describes the cluster node roles, presents typical command‑line usage and a sample MapReduce workflow, and offers guidance for further learning.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

With the growing big‑data needs of various product lines, QA engineers are increasingly required to understand large‑scale data processing techniques such as Hadoop, which is introduced in this article.

Hadoop is an Apache open‑source project inspired by Google File System (GFS) and the MapReduce paper; its two core components are HDFS, a distributed file system that stores massive files in 64 MB blocks with replication, and MapReduce, a programming model that provides distributed computation over those files.

Hadoop consists of a cluster side and a client side. The cluster contains three types of nodes: NameNode (the master that keeps the namespace and metadata in memory), Secondary NameNode (a helper that periodically snapshots metadata), and DataNode (the slaves that store actual data blocks on disk).

NameNode: manages the file system namespace, handles client read/write requests, and stores metadata; it is a single point of failure.

Secondary NameNode: monitors HDFS status, saves metadata snapshots, and can serve as a standby for the NameNode.

DataNode: stores data blocks, performs block read/write operations, and provides data retrieval services.

MapReduce, proposed by Google, breaks a complex computation into a series of Map and Reduce steps, similar to the shell pipeline: cat word.txt | sort | uniq -c The MapReduce workflow can be summarized in three stages:

Map task: extracts key‑value pairs from each input line.

Sort & Shuffle: groups values by key.

Reduce task: filters, aggregates, or otherwise processes each <key, value> group.

Typical Hadoop commands (shown in the original figure) allow users to interact with the file system and launch jobs; common commands include hdfs dfs -ls, hdfs dfs -put, hadoop jar, etc.

A practical MapReduce example processes advertising click logs to produce per‑advertiser metrics. The example references three shell scripts— mapper.sh , reducer.sh , and run.sh —which implement the map, reduce, and job‑submission steps respectively.

In conclusion, readers are encouraged to study the “Hadoop: The Definitive Guide” for deeper knowledge and to explore other big‑data frameworks such as Spark, Storm, and Flink, selecting the most suitable tool based on data characteristics and performance requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data-processingMapReducedistributed computingHDFSHadoop
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.