Big Data 16 min read

Understanding HDFS: Design Goals, Architecture, and Data Replication

This article explains HDFS’s core design principles, including fault tolerance, high‑throughput data access, master‑slave architecture with Namenode and Datanodes, namespace management, block replication strategies, safe mode, metadata persistence, communication protocols, robustness mechanisms, and file operations such as creation, deletion, and space reclamation.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Understanding HDFS: Design Goals, Architecture, and Data Replication

Design Goals

Hardware errors are normal; HDFS must detect and recover automatically across thousands of servers. Applications primarily perform streaming reads for batch processing, emphasizing high throughput over low latency. Files are large (GB–TB) and HDFS should handle tens of millions of files. The write‑once‑read‑many model simplifies consistency and enables high‑throughput access, as seen in MapReduce or web crawlers. Moving computation to data is more efficient than moving data to computation, especially at massive scale. Portability across heterogeneous hardware is also a goal.

Namenode and Datanode

HDFS uses a master/slave architecture with a single Namenode managing the namespace and client access, and multiple Datanodes storing blocks. Files are split into blocks distributed across Datanodes. Namenode handles namespace operations and block‑to‑Datanode mapping; Datanodes create, delete, and replicate blocks under Namenode’s direction. Both run on ordinary Linux machines and are Java‑based.

A single‑node Namenode simplifies architecture; client data reads/writes occur directly on Datanodes.

Filesystem Namespace

HDFS supports a hierarchical namespace similar to traditional file systems. Users can create, delete, move, and rename directories and files. While user quotas and permissions are not supported, the design does not preclude future implementation. Replication factor is stored in the namespace.

Data Replication

Files are stored as sequences of blocks, each replicated for fault tolerance. Block size and replication factor are configurable; the default replication factor is three. Replicas are placed using a rack‑aware strategy: one replica on the local rack, another on a different node of the same rack, and the third on a different rack, improving reliability and read performance.

When a reader is on the same rack as a replica, it reads that replica first; across data‑center clusters, the local data‑center replica is preferred.

SafeMode

After startup, Namenode enters SafeMode, during which it does not replicate blocks. It waits until a configurable percentage of blocks meet their minimum replication before exiting SafeMode and initiating any needed replications.

Metadata Persistence

Namenode records all metadata changes in an EditLog and stores the complete namespace and block map in FsImage. On startup, it loads FsImage, replays EditLog entries, and writes a new FsImage, truncating the old EditLog—a process called checkpointing.

Communication Protocols

All HDFS communication is built on TCP/IP. Clients talk to Namenode via ClientProtocol; Datanodes use DatanodeProtocol. Both are RPC‑based, with Namenode responding to requests rather than initiating them.

Robustness

HDFS handles failures of Namenodes, Datanodes, and network partitions. Heartbeats detect dead Datanodes, triggering replication of under‑replicated blocks. Cluster balancing plans can move data from overloaded Datanodes to free ones (not yet implemented). Data integrity is ensured via checksums stored as hidden files; mismatched blocks are re‑fetched from other replicas. FsImage and EditLog can be mirrored to protect against disk corruption. Namenode is a single point of failure; manual intervention is required to recover a failed Namenode.

Snapshots

Snapshot functionality, which would allow point‑in‑time recovery, is not currently supported.

Data Organization

Data is stored in blocks (default 64 MB). Clients cache data locally in a temporary file; when the buffer exceeds a block size, the client contacts Namenode for block allocation and Datanode locations, then streams data to the first Datanode, which pipelines it to the next replicas (pipeline replication).

Accessibility

HDFS can be accessed via DFSShell, Java API, C API, web browsers, and upcoming WebDAV support.

Space Reclamation

Deleted files are moved to a /trash directory and can be restored within a configurable retention period; after that, they are permanently removed and their blocks freed. Reducing a file’s replication factor causes excess replicas to be deleted during the next heartbeat cycle.

References

HDFS Java API: http://hadoop.apache.org/core/docs/current/api/

HDFS source code: http://hadoop.apache.org/core/version_control.html

Original article: http://www.blogjava.net/killme2008/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed storageHDFS
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.