Big Data 20 min read

What Makes Hadoop the Backbone of Modern Big Data Processing?

This article provides a comprehensive overview of Hadoop, covering its history, core features, the HDFS storage framework, MapReduce computation engine, YARN resource manager, real‑world application scenarios, and the surrounding ecosystem of tools such as Hive, Spark and Kafka.

Python Crawling & Data Mining

Oct 16, 2022

What Makes Hadoop the Backbone of Modern Big Data Processing?

01 Hadoop技术概述

In the era of big data, Hadoop has become one of the most widely used and representative distributed technologies, offering a framework for storing and processing massive datasets.

1 Hadoop的发展历史

Hadoop originated from the Apache Lucene project and the Nutch web crawler, evolving into a mature open‑source platform.

2 Hadoop的特点

高可靠性 – Data is replicated across multiple nodes, allowing automatic recovery from node failures.

高扩展性 – New nodes can be added easily to expand the cluster.

高效性 – Data is processed in parallel across nodes, ensuring fast computation.

高容错性 – HDFS stores multiple replicas; if a node fails, other replicas are used.

低成本 – Hadoop is open‑source and free to download.

可构建在廉价机器上 – It runs on standard commodity servers.

基于Java – Primarily written in Java, but supports other languages such as C++ and Python.

3 Hadoop存储框架—HDFS

HDFS is a distributed file system designed for commodity hardware, offering strong fault tolerance and high throughput for large files.

HDFS简介及架构

HDFS stores data across multiple nodes, splits files into blocks, and replicates each block on several nodes.

分布式文件系统 – The file system spans many cluster nodes.

数据分布在多个节点上 – Files are broken into blocks stored on different nodes with replicas.

数据从多个节点读取 – Reading a file involves fetching blocks from multiple nodes.

HDFS follows a master/slave architecture with a NameNode, Secondary NameNode, and multiple DataNodes.

NameNode

Stores metadata and handles client requests, keeping file attributes, block locations, and DataNode information in memory and persisting them to fsimage and edits files.

Secondary NameNode

Periodically merges edits into the fsimage, backs up the new fsimage, and resets the edits file.

DataNode

Actually stores the data blocks; default block size is 128 MB with three replicas for fault tolerance.

4 Hadoop计算引擎—MapReduce

MapReduce is Hadoop’s core computation framework for parallel processing of large data sets.

Map（映射） – Reads data from HDFS, transforms it into key‑value pairs, and passes them to the Reduce phase.

Reduce（规约） – Receives intermediate key‑value pairs, groups them by key, processes each group, and writes the results back to HDFS.

5 Hadoop资源管理器—YARN

YARN manages cluster resources, enabling efficient scheduling and execution of various applications such as MapReduce, Hive, HBase, and Spark.

ResourceManager (RM) – Global scheduler that allocates resources to applications.

NodeManager (NM) – Runs on each node, reports resource usage, and launches containers.

ApplicationMaster (AM) – Negotiates resources for a specific application and coordinates its tasks.

Client Application – Submits jobs to the RM, which creates an AM to manage execution.

02 Hadoop应用场景介绍

Hadoop is widely applied across many industries, including:

Online travel platforms (e.g., Expedia, Ctrip)

Mobile data services (e.g., China Mobile’s BigCloud)

E‑commerce (e.g., Alibaba’s Taobao and Tmall)

Fraud detection in finance and government

IT security and malware detection (e.g., Qihoo 360)

Healthcare analytics (e.g., IBM Watson)

Search engines (e.g., Yahoo, Baidu, Alibaba)

Social platforms (e.g., Tencent, Facebook)

03 Hadoop生态系统

The Hadoop ecosystem has grown to include many complementary tools that provide specialized services.

Hive – SQL‑like data warehouse on Hadoop.

ZooKeeper – Coordination service for distributed applications.

HBase – Scalable, column‑oriented NoSQL database.

Spark – Fast, in‑memory data processing engine.

Flume – Reliable, distributed log collection system.

Kafka – Distributed publish/subscribe messaging platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MapReduce Distributed Computing YARN HDFS Hadoop

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.