Big Data 16 min read

Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

This article explains the concepts and architecture of HDFS, the high‑availability mechanisms of NameNode including quorum‑based shared storage, the detailed read and write workflows of the distributed file system, and discusses its typical use cases and limitations.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

With the rapid growth of Internet data, single‑machine storage limits are exceeded, prompting the adoption of distributed file systems and MapReduce; Hadoop implements Google’s ideas by integrating a distributed file system (HDFS) with a batch‑processing platform.

1. HDFS Concept and Architecture

HDFS follows a master/slave design, typically consisting of an active NameNode and multiple DataNodes, with a standby NameNode for redundancy. Core components include:

NameNode – manages the namespace, metadata, and block‑to‑DataNode mapping.

DataNode – stores actual data blocks and handles client read/write requests.

Block – a file is split into one or more blocks distributed across DataNodes.

Edits – transaction log of namespace changes, persisted on the NameNode.

FSImage – a snapshot of the NameNode’s metadata stored on local disk.

When the NameNode restarts, it loads FSImage and Edits into memory. Clients query the NameNode for block locations, then read from the nearest DataNode replicas, preferring local, same‑rack, then remote nodes. Replication across racks improves fault tolerance.

2. NameNode High‑Availability (HA) Mechanism

Hadoop 2.x introduces HA to eliminate the NameNode single‑point failure. The architecture involves an active NameNode, a standby NameNode, Zookeeper, ZKFailoverController (ZKFC), and a shared storage system (Quorum Journal Manager, QJM).

ZKFC monitors NameNode health and triggers automatic failover via Zookeeper’s ActiveStandbyElector. The shared storage (QJM) consists of multiple JournalNodes that store EditLogs using a Paxos‑style quorum to guarantee consistency. The standby NameNode periodically reads EditLogs from the JournalNodes to stay synchronized.

The failover steps are:

HealthMonitor periodically calls HAServiceProtocol on each NameNode.

On status change, it notifies ZKFailoverController.

ZKFailoverController invokes ActiveStandbyElector for leader election.

ActiveStandbyElector interacts with Zookeeper to elect the new active NameNode.

The elected node is switched to Active via HAServiceProtocol.

3. HDFS Read/Write Process

Read Flow : The client contacts the NameNode via RPC to obtain block locations, then reads blocks directly from the nearest DataNodes. The client streams data through DFSInputStream, which transparently fetches subsequent blocks as needed, keeping the NameNode’s role lightweight.

Write Flow :

The client sends an RPC request to the NameNode.

The NameNode validates the file and permissions.

The client splits data into packets, requests new blocks, and receives a list of suitable DataNodes.

Packets are pipelined through the DataNode chain; each node stores the packet and forwards it.

The last DataNode returns an acknowledgment that propagates back to the client.

If a DataNode fails, the pipeline is rebuilt, and the NameNode assigns a new replica to maintain the replication factor.

DFSOutputStream Internals

Data is buffered, divided into 64 KB packets, each containing 512 B chunks with checksums. Packets are placed in a dataQueue; a DataStreamer thread sends them to the first DataNode, moving them to an ackQueue. A ResponseProcessor thread receives acknowledgments and removes successful packets from the ackQueue. On errors, failed packets are discarded, a new pipeline is created, and transmission resumes.

4. Usage Scenarios and Drawbacks

Write‑once, read‑many workloads, large files (64 MB–256 MB blocks).

High fault tolerance via multiple replicas.

Batch processing with higher latency.

High latency unsuitable for low‑latency, high‑throughput scenarios.

Small files consume excessive NameNode memory.

Only one writer per file; no random writes, only append.

Original source: http://tech.weli.cn/2019/03/06/hdfs-basic/

Author: Zero Degree, backend technology lead at WeLi, contributor to large‑scale data platforms.

big datadistributed file systemHDFSHadoopNameNodeHARead Write
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.