Big Data 11 min read

Introduction to HDFS: Architecture, Components, and Operations

This article provides a comprehensive overview of HDFS, covering its role as a distributed file system, the concepts of blocks, NameNode and DataNode responsibilities, replication, edit logs, snapshots, high‑availability mechanisms, and practical considerations for managing large‑scale data storage.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Introduction to HDFS: Architecture, Components, and Operations

HDFS (Hadoop Distributed File System) is a widely used distributed file system that abstracts a cluster of machines as a single logical storage unit, allowing users to read and write files as if they were on one machine while the system handles distribution across many nodes.

1. HDFS Overview

When data volumes grow beyond the capacity of a single server, files are split and stored across multiple machines, which introduces management challenges. HDFS solves this by providing a unified interface for distributed storage, hiding the complexity of multiple servers from the user.

In HDFS, files are divided into 128MB blocks (configurable) and each block is replicated on several DataNodes for fault tolerance.

2. HDFS Learning

A 1 GB file, for example, is split into eight 128MB blocks. Each block is stored on different DataNodes, and the metadata about block locations is managed by the NameNode.

The NameNode stores metadata such as file paths, block IDs, and their locations. All read and write operations first contact the NameNode to obtain block location information before communicating with the appropriate DataNodes.

Write: the client asks the NameNode where to place each new block.

Read: the client asks the NameNode which DataNodes hold the required blocks.

2.1 HDFS Replication

To avoid data loss when a node fails, each block is replicated on multiple DataNodes. This mirrors the replication strategies used in systems like Kafka and Elasticsearch.

2.2 NameNode Internals

The NameNode keeps metadata in memory for fast access and persists changes by appending records to an editlog file. Because appends are sequential I/O, they are efficient.

Over time the editlog can become large, slowing NameNode restarts. To mitigate this, the NameNode periodically creates a snapshot called fsimage and merges the edit log into it, similar to a database checkpoint.

A secondary service, SecondNameNode , performs this merging task, allowing the primary NameNode to focus on client requests.

2.3 High Availability

Since the NameNode is a single point of failure, a standby NameNode is introduced. Zookeeper coordinates the active‑standby pair, and a shared edit log (implemented via JournalNode ) ensures both NameNodes have identical metadata.

2.4 DataNode Operation

DataNodes store the actual block data and periodically send heartbeats and block reports to the NameNode. The block reports include checksums and timestamps, enabling the NameNode to detect corrupted blocks.

Conclusion

The concepts presented in HDFS—block replication, metadata management, edit logs, snapshots, and high availability—are common across many distributed systems such as Kafka, Elasticsearch, Zookeeper, and Redis.

Future articles will explore MapReduce and further compare persistence mechanisms across these frameworks.

big dataReplicationdistributed file systemHDFSNameNodeDataNode
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.