Big Data 15 min read

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Rare Earth Juejin Tech Community

Dec 26, 2024

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

The article begins with a brief history of big‑data foundations, citing Google's 2004 papers (Google File System, MapReduce, and Bigtable) that inspired Doug Cutting to create Hadoop, the open‑source framework for large‑scale data processing.

It then introduces HDFS (Hadoop Distributed File System) as the storage layer for massive data, comparing it to traditional single‑node file systems and explaining its core components: NameNode (metadata management) and DataNode (block storage). The text describes how files are split into blocks, replicated across DataNodes, and how the NameNode tracks block locations.

The read/write workflow of HDFS is detailed: a client contacts the NameNode to create a file, receives block allocation, writes blocks to DataNodes, receives acknowledgments, and finally updates metadata. Reading follows the reverse process, with the client retrieving block locations from the NameNode and streaming data from the appropriate DataNodes.

Next, the article shifts to the MapReduce computation framework, outlining its two phases—Map (transform input records into key‑value pairs) and Reduce (aggregate values for each key). A word‑count example illustrates how Map emits <word,1> pairs and Reduce sums them to produce final frequencies.

To demonstrate practical usage, the article provides complete Java source code for a Hadoop MapReduce job that processes JSON and CSV files. The Mapper parses each line, detects the format, and emits cleaned data using helper methods cleanJsonData and cleanCsvData. The Reducer simply forwards the Mapper output. The Driver class configures the job, sets the Mapper and Reducer classes, specifies input and output paths, and launches the job via Maven‑built JAR.

Code snippets are presented in their original form, wrapped in ... tags to preserve formatting. Maven dependencies required for the job include hadoop-common, hadoop-mapreduce-client-core, jackson, and commons-csv. The article concludes by encouraging readers to explore Hadoop’s source code, noting that the framework is essentially Java and that its concepts are approachable for those willing to study them.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data maven MapReduce Distributed File System HDFS Hadoop

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.