Understanding HDFS: Design Goals, Architecture, and Data Replication
This article explains HDFS’s core design principles, including fault tolerance, high‑throughput data access, master‑slave architecture with Namenode and Datanodes, namespace management, block replication strategies, safe mode, metadata persistence, communication protocols, robustness mechanisms, and file operations such as creation, deletion, and space reclamation.
Design Goals
Hardware errors are normal; HDFS must detect and recover automatically across thousands of servers. Applications primarily perform streaming reads for batch processing, emphasizing high throughput over low latency. Files are large (GB–TB) and HDFS should handle tens of millions of files. The write‑once‑read‑many model simplifies consistency and enables high‑throughput access, as seen in MapReduce or web crawlers. Moving computation to data is more efficient than moving data to computation, especially at massive scale. Portability across heterogeneous hardware is also a goal.
Namenode and Datanode
HDFS uses a master/slave architecture with a single Namenode managing the namespace and client access, and multiple Datanodes storing blocks. Files are split into blocks distributed across Datanodes. Namenode handles namespace operations and block‑to‑Datanode mapping; Datanodes create, delete, and replicate blocks under Namenode’s direction. Both run on ordinary Linux machines and are Java‑based.
A single‑node Namenode simplifies architecture; client data reads/writes occur directly on Datanodes.
Filesystem Namespace
HDFS supports a hierarchical namespace similar to traditional file systems. Users can create, delete, move, and rename directories and files. While user quotas and permissions are not supported, the design does not preclude future implementation. Replication factor is stored in the namespace.
Data Replication
Files are stored as sequences of blocks, each replicated for fault tolerance. Block size and replication factor are configurable; the default replication factor is three. Replicas are placed using a rack‑aware strategy: one replica on the local rack, another on a different node of the same rack, and the third on a different rack, improving reliability and read performance.
When a reader is on the same rack as a replica, it reads that replica first; across data‑center clusters, the local data‑center replica is preferred.
SafeMode
After startup, Namenode enters SafeMode, during which it does not replicate blocks. It waits until a configurable percentage of blocks meet their minimum replication before exiting SafeMode and initiating any needed replications.
Metadata Persistence
Namenode records all metadata changes in an EditLog and stores the complete namespace and block map in FsImage. On startup, it loads FsImage, replays EditLog entries, and writes a new FsImage, truncating the old EditLog—a process called checkpointing.
Communication Protocols
All HDFS communication is built on TCP/IP. Clients talk to Namenode via ClientProtocol; Datanodes use DatanodeProtocol. Both are RPC‑based, with Namenode responding to requests rather than initiating them.
Robustness
HDFS handles failures of Namenodes, Datanodes, and network partitions. Heartbeats detect dead Datanodes, triggering replication of under‑replicated blocks. Cluster balancing plans can move data from overloaded Datanodes to free ones (not yet implemented). Data integrity is ensured via checksums stored as hidden files; mismatched blocks are re‑fetched from other replicas. FsImage and EditLog can be mirrored to protect against disk corruption. Namenode is a single point of failure; manual intervention is required to recover a failed Namenode.
Snapshots
Snapshot functionality, which would allow point‑in‑time recovery, is not currently supported.
Data Organization
Data is stored in blocks (default 64 MB). Clients cache data locally in a temporary file; when the buffer exceeds a block size, the client contacts Namenode for block allocation and Datanode locations, then streams data to the first Datanode, which pipelines it to the next replicas (pipeline replication).
Accessibility
HDFS can be accessed via DFSShell, Java API, C API, web browsers, and upcoming WebDAV support.
Space Reclamation
Deleted files are moved to a /trash directory and can be restored within a configurable retention period; after that, they are permanently removed and their blocks freed. Reducing a file’s replication factor causes excess replicas to be deleted during the next heartbeat cycle.
References
HDFS Java API: http://hadoop.apache.org/core/docs/current/api/
HDFS source code: http://hadoop.apache.org/core/version_control.html
Original article: http://www.blogjava.net/killme2008/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
