Big Data 17 min read

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Programmer DD

Apr 13, 2021

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

HDFS Overview and Application Scenarios

HDFS (Hadoop Distributed File System) is a distributed file system designed based on Google’s GFS paper and runs on commodity hardware. It offers high fault tolerance, high throughput for large‑scale data access, and supports storage of TB‑PB sized files.

Suitable for large‑file storage and streaming data access; unsuitable for many small files, random writes, or low‑latency reads.

Typical Application Scenarios

Website user behavior data storage

Ecosystem data storage

Weather data storage

HDFS Position in FusionInsight

In FusionInsight HD, HDFS serves as the fundamental storage layer, providing a distributed, highly fault‑tolerant, and linearly scalable file system.

System Design Goals

Hardware failure tolerance : hardware is assumed unreliable; the system must detect and recover from failures automatically.

Streaming data access : applications read data in a streaming fashion, focusing on throughput rather than response time.

Large data volumes : supports files ranging from gigabytes to petabytes.

Data consistency : uses a Write‑Once‑Read‑Many (WORM) model; files are append‑only.

Multi‑platform support : runs on diverse hardware platforms.

Data locality : computation is placed close to data to reduce network load.

Basic System Architecture

The architecture consists of three components: NameNode, DataNode, and Client.

NameNode stores metadata and namespace information.

DataNode stores actual data blocks and reports them to the NameNode.

Client interacts with HDFS, obtaining block locations from the NameNode and reading/writing data to DataNodes.

Data Write Process

Client creates a file via the HDFS API.

NameNode creates a file node in its metadata.

Client writes data; NameNode returns block IDs and DataNode locations.

Client pipelines data to the chosen DataNodes, which replicate the blocks.

DataNodes acknowledge completion; client closes the file.

Data Read Process

Client opens a file via the HDFS API.

NameNode provides block locations.

Client reads data from the nearest DataNodes based on block locations.

After reading, the client closes the file.

Key Features

Unified file system view for users.

RPC‑based communication between components.

Space reclamation and dynamic replica management.

Data organized in blocks stored on underlying OS file systems.

Access via Java API, HTTP, or shell commands.

Metadata Persistence

NameNode maintains FsImage (snapshot of the namespace) and EditLog (record of recent changes). During startup, FsImage is loaded into memory and EditLog entries are applied to bring the metadata up to date.

High Availability (HA)

HA adds a standby NameNode, ZooKeeper for coordination, ZKFC for failover control, and JournalNodes for shared edit logs, ensuring continuous service during NameNode failures.

Federation

Federation introduces multiple NameNodes, each managing a portion of the namespace, improving scalability, throughput, and isolation between workloads.

Storage Policies

Hierarchical storage types: RAM_DISK, DISK, ARCHIVE, SSD.

Tag‑based policies allow directories to be associated with storage tags, directing blocks to specific DataNodes.

Node‑group policies enable placement of critical data on high‑reliability node groups.

Colocation (Same‑Node Placement)

Files that are frequently joined are stored on the same DataNode to minimize network traffic during processing.

Common Shell Commands

Typical HDFS shell commands (e.g., hdfs dfs -ls, hdfs dfs -put, hdfs dfs -rm) are used for file system operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data High Availability data storage HDFS Federation Storage Policies

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.