Big Data 7 min read

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), covering its streaming data access model, key characteristics, master‑slave architecture, block storage and replication mechanisms, rack‑aware placement strategy, and how the NameNode manages metadata and checkpoints.

Big Data Technology Architecture

Apr 20, 2020

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

HDFS (Hadoop Distributed File System) is a highly fault‑tolerant system designed for high‑throughput data access on large‑scale datasets, storing files as streams of blocks across inexpensive commodity hardware.

Key characteristics include streaming data access, suitability for massive files (GB‑TB), write‑once‑read‑many semantics, single‑writer restriction, and a focus on high throughput rather than low‑latency access.

The architecture follows a Master/Slave model with a NameNode managing the namespace and client operations, and multiple DataNodes storing the actual data blocks. The NameNode also tracks block‑to‑DataNode mappings.

Files are split into fixed‑size blocks (except the last block) and each block is replicated for fault tolerance. Replication factor is configurable per file, and all writes are append‑only, simplifying consistency.

HDFS employs a rack‑aware placement policy: typically one replica on the local rack, another on a different node within the same rack, and a third on a node in a different rack. This improves reliability, balances load, and reduces cross‑rack traffic during reads.

Metadata is stored on the NameNode using an EditLog for transactional changes and an FsImage for the complete namespace snapshot. Periodic checkpoints merge the EditLog into a new FsImage, after which the old log is discarded.

In summary, the article introduces HDFS fundamentals, its architecture, replication strategy, rack awareness, and metadata management, providing a solid foundation for further exploration of Hadoop’s design.

replication distributed file system metadata management hdfs Hadoop Rack Awareness

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.