Big Data 6 min read

Understanding Hadoop: Architecture, HDFS, and MapReduce

This article explains Hadoop as an Apache‑managed open‑source platform for storing massive data on distributed clusters and running robust, efficient analytics via its two core components—HDFS for storage and the Java‑based MapReduce framework for processing—highlighting modularity, high availability, and common tooling.

Art of Distributed System Architecture Design

Dec 31, 2016

Understanding Hadoop: Architecture, HDFS, and MapReduce

Big data is an overwhelming term, and discussing big data inevitably brings up Hadoop. Unfortunately, most big‑data evangelists, even professionals, cannot clearly explain what Hadoop actually is or what it does, leaving their non‑technical managers completely confused.

It is well known that Hadoop is an open‑source software platform managed by the Apache Software Foundation, but what exactly is Hadoop? Simply put, Hadoop is a method for storing massive amounts of data on a distributed server cluster and running distributed analytics applications.

Hadoop is designed to be a very robust system; even if a server or an entire cluster fails, the big‑data analytics applications running on it will not be interrupted. Moreover, Hadoop is efficient because it does not require you to shuffle data back and forth across the network.

Below is the official Apache definition:

The Apache Hadoop software library is a framework that allows distributed processing of large data sets on clusters of servers using a simple programming model. Hadoop is designed to scale from a single server to thousands of servers, each with its own compute and storage resources. Hadoop’s high availability does not depend on hardware; its codebase can detect and handle hardware failures at the application layer, providing highly available services on a server cluster.

If we analyze more deeply, we find that Hadoop has even more impressive features. First, Hadoop is almost completely modular, meaning you can replace its modules with other software tools. This makes Hadoop’s architecture exceptionally flexible without sacrificing reliability and efficiency.

Hadoop Distributed File System (HDFS)

If the word Hadoop leaves your mind blank, remember this: Hadoop has two main components—a data‑processing framework and a distributed storage file system (HDFS).

HDFS is like the basket of the Hadoop system; you neatly place data inside it, waiting for the data‑analysis chef to turn it into a tasty dish served to the CEO’s table. Of course, you can analyze data within Hadoop, or you can extract‑transform‑load the data in Hadoop to other tools for analysis.

Data‑processing framework and MapReduce

As the name suggests, the data‑processing framework is the tool that processes data. Specifically, Hadoop’s data‑processing framework is the Java‑based system MapReduce, which you will hear about more often than HDFS because:

1. MapReduce is the tool that actually completes data‑processing tasks

2. MapReduce often drives its users crazy

In conventional relational databases, data is found and analyzed via SQL (Structured Query Language); NoSQL refers to non‑relational databases that also use query languages, not limited to SQL.

One point that is easy to confuse: Hadoop is not a true database—it can store and extract data but has no query language. Hadoop is more of a data‑warehouse system, so it needs a system like MapReduce for real data processing.

MapReduce runs a series of tasks, each a separate Java application that can access data and extract useful information. Using MapReduce instead of a query language makes Hadoop’s data‑analysis capabilities more powerful and flexible, but also greatly increases technical complexity.

Many tools now make Hadoop easier to use, such as Hive, which can translate query statements into MapReduce jobs. However, the complexity and limitations of MapReduce (batch single‑task processing) cause Hadoop to be used more often as a data warehouse rather than a data‑analysis tool. Reference: Hadoop is the poor man’s ETL.

Another unique aspect of Hadoop is that all its functions are distributed, unlike the centralized architecture of traditional databases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MapReduce Distributed Computing hdfs Hadoop

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.