Understanding HDFS Architecture, NameNode HA, and Read/Write Processes
This article explains the concepts and architecture of HDFS, the high‑availability mechanisms of NameNode including quorum‑based shared storage, the detailed read and write workflows of the distributed file system, and discusses its typical use cases and limitations.
With the rapid growth of Internet data, single‑machine storage limits are exceeded, prompting the adoption of distributed file systems and MapReduce; Hadoop implements Google’s ideas by integrating a distributed file system (HDFS) with a batch‑processing platform.
1. HDFS Concept and Architecture
HDFS follows a master/slave design, typically consisting of an active NameNode and multiple DataNodes, with a standby NameNode for redundancy. Core components include:
NameNode – manages the namespace, metadata, and block‑to‑DataNode mapping.
DataNode – stores actual data blocks and handles client read/write requests.
Block – a file is split into one or more blocks distributed across DataNodes.
Edits – transaction log of namespace changes, persisted on the NameNode.
FSImage – a snapshot of the NameNode’s metadata stored on local disk.
When the NameNode restarts, it loads FSImage and Edits into memory. Clients query the NameNode for block locations, then read from the nearest DataNode replicas, preferring local, same‑rack, then remote nodes. Replication across racks improves fault tolerance.
2. NameNode High‑Availability (HA) Mechanism
Hadoop 2.x introduces HA to eliminate the NameNode single‑point failure. The architecture involves an active NameNode, a standby NameNode, Zookeeper, ZKFailoverController (ZKFC), and a shared storage system (Quorum Journal Manager, QJM).
ZKFC monitors NameNode health and triggers automatic failover via Zookeeper’s ActiveStandbyElector. The shared storage (QJM) consists of multiple JournalNodes that store EditLogs using a Paxos‑style quorum to guarantee consistency. The standby NameNode periodically reads EditLogs from the JournalNodes to stay synchronized.
The failover steps are:
HealthMonitor periodically calls HAServiceProtocol on each NameNode.
On status change, it notifies ZKFailoverController.
ZKFailoverController invokes ActiveStandbyElector for leader election.
ActiveStandbyElector interacts with Zookeeper to elect the new active NameNode.
The elected node is switched to Active via HAServiceProtocol.
3. HDFS Read/Write Process
Read Flow : The client contacts the NameNode via RPC to obtain block locations, then reads blocks directly from the nearest DataNodes. The client streams data through DFSInputStream, which transparently fetches subsequent blocks as needed, keeping the NameNode’s role lightweight.
Write Flow :
The client sends an RPC request to the NameNode.
The NameNode validates the file and permissions.
The client splits data into packets, requests new blocks, and receives a list of suitable DataNodes.
Packets are pipelined through the DataNode chain; each node stores the packet and forwards it.
The last DataNode returns an acknowledgment that propagates back to the client.
If a DataNode fails, the pipeline is rebuilt, and the NameNode assigns a new replica to maintain the replication factor.
DFSOutputStream Internals
Data is buffered, divided into 64 KB packets, each containing 512 B chunks with checksums. Packets are placed in a dataQueue; a DataStreamer thread sends them to the first DataNode, moving them to an ackQueue. A ResponseProcessor thread receives acknowledgments and removes successful packets from the ackQueue. On errors, failed packets are discarded, a new pipeline is created, and transmission resumes.
4. Usage Scenarios and Drawbacks
Write‑once, read‑many workloads, large files (64 MB–256 MB blocks).
High fault tolerance via multiple replicas.
Batch processing with higher latency.
High latency unsuitable for low‑latency, high‑throughput scenarios.
Small files consume excessive NameNode memory.
Only one writer per file; no random writes, only append.
Original source: http://tech.weli.cn/2019/03/06/hdfs-basic/
Author: Zero Degree, backend technology lead at WeLi, contributor to large‑scale data platforms.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.