Hadoop Explained: Architecture, Core Components, and Real-World Applications
This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.
1. Hadoop Technology Overview
Big data has become a new driving force for economic and social development. With the rise of cloud computing and mobile internet, massive data generation and flow have become normal, and Hadoop’s distributed architecture is one of the most widely used big‑data technologies.
1.1 Development History
Hadoop was created by Doug Cutting, the founder of Apache Lucene, and originated from the open‑source web‑search project Nutch, which itself was part of Lucene. The evolution of Hadoop is illustrated in the diagram below.
1.2 Key Features
High reliability – data is replicated across multiple nodes; failed tasks are automatically rescheduled.
High scalability – new nodes can be added easily to expand the cluster.
High efficiency – data is processed in parallel on the nodes where it resides.
High fault tolerance – HDFS stores multiple replicas; if a node fails, another replica is used.
Low cost – Hadoop is open source and runs on commodity hardware.
Runs on inexpensive machines – no need for high‑end servers.
Java‑based core – although written in Java, applications can also be developed in C++ or Python.
2. HDFS – Hadoop Distributed File System
HDFS is a distributed file system designed to run on ordinary hardware with strong fault‑tolerance. It stores files as blocks that are replicated across multiple DataNodes. The architecture consists of a NameNode, a Secondary NameNode, and many DataNodes.
2.1 NameNode
The NameNode holds metadata (file names, directories, block locations, replica information, and DataNode status). Metadata is kept in memory and persisted in fsimage and edits files.
2.2 Secondary NameNode
Periodically merges edits into fsimage, creates a new checkpoint, and stores it back to the NameNode.
2.3 DataNode
DataNodes store actual data blocks (default 128 MB per block in Hadoop 3.x). Each block is replicated (default three copies). The NameNode tracks block locations.
3. MapReduce – Hadoop Computing Engine
MapReduce is the core programming model for large‑scale data processing on Hadoop. It consists of two phases:
Map : reads input splits from HDFS, transforms each record into key‑value pairs, and emits intermediate data.
Reduce : receives grouped intermediate keys, aggregates values, and writes the final results back to HDFS.
The full execution flow includes input splitting, the Map phase, Shuffle & Sort, the Reduce phase, and output writing.
4. YARN – Resource Management
YARN (Yet Another Resource Negotiator) separates resource management from data processing. It consists of:
ResourceManager (RM) : global scheduler that allocates containers to applications.
NodeManager (NM) : runs on each node, reports resource usage, and launches containers.
ApplicationMaster (AM) : negotiates resources for a specific application and coordinates its tasks.
Client Application : submits jobs to the RM, which creates an AM.
5. Real‑World Application Scenarios
Hadoop is employed across many industries, including:
Online travel (e.g., Expedia, Ctrip)
Mobile data platforms (e.g., China Mobile’s BigCloud)
E‑commerce (e.g., Alibaba’s Taobao and Tmall)
Fraud detection in finance and government
IT security and malware analysis (e.g., Qihoo 360)
Healthcare analytics (e.g., IBM Watson)
Search engines (e.g., Yahoo, Baidu)
Social platforms (e.g., Tencent, Facebook)
6. Hadoop Ecosystem
Beyond the core, Hadoop’s ecosystem includes many complementary projects:
Hive : data‑warehouse framework that translates SQL‑like queries into MapReduce jobs.
ZooKeeper : coordination service for maintaining configuration and naming.
HBase : column‑family NoSQL database for random, real‑time read/write access.
Spark : fast, in‑memory processing engine that improves on MapReduce for iterative algorithms.
Flume : reliable, distributed system for collecting and moving large volumes of log data.
Kafka : distributed publish/subscribe messaging system for real‑time data pipelines.
This summary is adapted from the book “Hadoop and Big Data Mining” (ISBN 9787111709473) with permission from the publisher.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
