Big Data 20 min read

Design Principles and Architecture of HDFS (Hadoop Distributed File System)

This article explains HDFS's design goals, master/slave architecture, namespace management, block replication strategies, fault tolerance mechanisms, metadata persistence, communication protocols, robustness features, data organization, access methods, and space reclamation, providing a comprehensive overview of Hadoop's distributed storage system.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Design Principles and Architecture of HDFS (Hadoop Distributed File System)

1. Premise and Design Goals

1. Hardware failures are the norm rather than exceptions; an HDFS cluster may consist of hundreds or thousands of servers, and any component can fail at any time, so error detection and rapid, automatic recovery are core architectural goals of HDFS.

2. Applications running on HDFS differ from typical applications; they are primarily stream‑read and batch‑processing oriented, emphasizing high throughput over low latency for data access.

3. HDFS is designed to support massive data sets, with typical file sizes ranging from gigabytes to terabytes, and a single HDFS instance should be able to handle tens of millions of files.

4. HDFS applications use a write‑once‑read‑many access model; once a file is created, written, and closed, it is never modified. This simplifies consistency concerns and enables high‑throughput access. Typical examples include the MapReduce framework and web‑crawler applications.

5. The cost of moving computation is lower than moving data. The closer the computation is to the data it processes, the more efficient it becomes, especially at massive scales. Moving compute to the data, rather than moving data to the compute, is preferable, and HDFS provides interfaces to support this.

6. Portability across heterogeneous hardware and software platforms.

2. NameNode and DataNode

HDFS adopts a master/slave architecture. An HDFS cluster consists of one NameNode and many DataNodes. The NameNode is a central server that manages the filesystem namespace and client access to files. DataNodes, typically one per machine, manage the storage on their local disks. Files are split into blocks that are stored across the DataNode collection. The NameNode handles namespace operations (open, close, rename) and decides block‑to‑DataNode mappings, while DataNodes create, delete, and replicate blocks under the NameNode’s direction. Both components can run on ordinary inexpensive Linux machines and are implemented in Java.

A single‑node NameNode greatly simplifies system architecture. The NameNode stores and manages all HDFS metadata, so user data does not need to pass through the NameNode (i.e., file reads and writes occur directly on DataNodes).

3. Filesystem Namespace

HDFS supports a traditional hierarchical file organization similar to most file systems; users can create directories and perform create, delete, move, and rename operations on files. HDFS does not currently support user quotas, access permissions, or symbolic links, though the architecture does not preclude adding these features. The NameNode maintains the namespace, recording any modifications. Applications can set the replication factor for a file, which is also stored by the NameNode.

4. Data Replication

HDFS is designed to reliably store massive files across a large cluster. Each file is stored as a sequence of blocks; all blocks except the last have the same size. For fault tolerance, every block is replicated. Block size and replication factor are configurable and can be changed after file creation. HDFS enforces a write‑once model with a single writer at any time. The NameNode periodically receives heartbeats and BlockReports from each DataNode; heartbeats indicate liveliness, while BlockReports list all blocks stored on the DataNode.

1. Replica placement is key to HDFS reliability and performance. HDFS uses a rack‑aware strategy to improve reliability, efficiency, and network bandwidth utilization. In a typical large cluster spanning multiple racks, intra‑rack bandwidth is higher than inter‑rack bandwidth.

Through a Rack Awareness process, the NameNode determines each DataNode’s rack ID. A simple (non‑optimized) strategy places replicas on separate racks, preventing loss of an entire rack and allowing reads from multiple racks. This improves load balancing under failures but increases write cost because data must be transferred across racks.

In most cases, the replication factor is 3, with one replica on the local rack, one on another node in the same rack, and the third on a node in a different rack. Since rack failures are rarer than node failures, this strategy maintains reliability while improving write performance.

2. Replica selection aims to reduce bandwidth consumption and read latency; HDFS prefers the reader to fetch the nearest replica, first checking the same rack, then the same data center if the cluster spans multiple data centers.

3. SafeMode: After startup, the NameNode enters SafeMode, during which it does not replicate blocks. It waits until a configurable percentage of blocks have reached their minimum replica count before exiting SafeMode and proceeding with any under‑replicated blocks.

5. Persistence of Filesystem Metadata

The NameNode stores HDFS metadata. Any operation that modifies file metadata is recorded in an EditLog transaction log. The entire namespace, including block‑to‑file mappings and file attributes, is stored in an FsImage file on the NameNode’s local filesystem.

The NameNode keeps the full namespace and block map in memory. Upon startup, it reads the EditLog and FsImage from disk, applies the EditLog transactions to the in‑memory FsImage, flushes the updated FsImage back to disk, and truncates the old EditLog. This process is called a checkpoint and currently occurs only at startup.

DataNodes store block data as isolated files on the local filesystem. They use heuristics to create subdirectories for optimal file distribution. When a DataNode starts, it scans its local filesystem, builds a list of all HDFS blocks, and sends a BlockReport to the NameNode.

6. Communication Protocols

All HDFS communication is built on TCP/IP. Clients connect to the NameNode via a configurable port using ClientProtocol RPC calls. DataNodes communicate with the NameNode using DatanodeProtocol RPC calls. The NameNode only responds to RPC requests; it does not initiate them.

7. Robustness

HDFS’s primary goal is reliable data storage under failures. The three common failure types are NameNode failures, DataNode failures, and network partitions.

1. Disk errors, heartbeat detection, and re‑replication: DataNodes send periodic heartbeats. Missing heartbeats indicate a dead DataNode, which is marked dead and no longer receives I/O. Under‑replicated blocks trigger the NameNode to schedule replication.

2. Cluster balancing: HDFS can run a balancing plan that moves data from heavily used DataNodes to those with free space, and can create additional replicas when a file’s request volume spikes (not yet implemented).

3. Data integrity: Clients compute checksums for each block and store them as hidden files. When reading, the client verifies the block’s checksum against the stored value and can fetch another replica if mismatched.

4. Metadata disk errors: FsImage and EditLog are critical; corruption would invalidate the entire cluster. The NameNode can be configured to maintain multiple copies of these files, synchronizing updates across copies.

The NameNode is a single point of failure; manual intervention is required if it crashes, and automatic failover to a standby NameNode is not yet implemented.

5. Snapshots: Snapshots would allow point‑in‑time copies for recovery, but HDFS currently does not support this feature.

8. Data Organization

1. Data blocks: HDFS‑compatible applications handle large data sets with a write‑once‑read‑many model. A typical block size is 64 MB; files are split into 64 MB chunks stored on different DataNodes.

2. Procedure: When a client creates a file, data is first cached in a local temporary file. Once the temporary file exceeds a block size, the client contacts the NameNode to allocate a block and receive DataNode identifiers. The client then flushes the data to the designated DataNodes. Upon file close, any remaining data is flushed, and the client notifies the NameNode, which then commits the file creation to persistent storage.

3. Pipelined replication: For a replication factor of 3, the client obtains a list of three DataNodes. It streams data to the first DataNode, which writes a small chunk locally and simultaneously forwards it to the second DataNode. The second DataNode does the same, forwarding to the third, which only stores the data. This creates a pipeline replication flow.

9. Accessibility

HDFS offers multiple access methods: the DFSShell command‑line tool, Java APIs, C language bindings, and a web interface. A WebDAV interface is under development.

10. Space Reclamation

1. File deletion and recovery: When a user or application deletes a file, HDFS renames it and moves it to the /trash directory instead of deleting it immediately. Files remain in /trash for a configurable period (default 6 hours) and can be restored before permanent deletion, after which the associated blocks are freed.

2. Reducing replication factor: When a file’s replication factor is decreased, the NameNode selects excess replicas for removal. The next heartbeat informs the relevant DataNode to delete the corresponding blocks and release space, with a possible delay between the setReplication call and actual space reclamation.

big dataReplicationDistributed StorageHDFSNameNodeDataNode
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.