Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons
This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write mechanisms, replication strategies, component responsibilities, common command‑line tools, and the advantages and disadvantages of using Hadoop Distributed File System for large‑scale data storage.
HDFS (Hadoop Distributed File System) is a fault‑tolerant distributed file system designed to run on commodity hardware, offering high throughput data access for massive data sets by relaxing some POSIX constraints and focusing on streaming reads.
The architecture consists of a master NameNode that stores metadata, multiple DataNode workers that hold actual data blocks, a SecondaryNameNode that assists with checkpointing, and client processes that interact with the NameNode to locate blocks.
Write operation: The client calls the create method, the NameNode validates the request, then the client streams data to a pipeline of DataNodes in block‑size chunks (default 128 MB). Each DataNode acknowledges after completing a block, and the client finally closes the file.
Read operation: The client calls open, the NameNode returns the locations of the nearest DataNodes for each block, and the client reads the blocks directly from those DataNodes without involving the NameNode.
Replication placement has evolved: the original strategy placed the first replica on a different node in the same rack, the second on another node in the same rack, and the third on a different rack; newer versions place the first replica on the client’s node, the second on a different rack, and the third on another node in the same rack as the second, reducing read latency for newly written data.
Component roles:
NameNode: manages the namespace, block‑to‑file mapping, replication policies, and client read/write requests.
DataNode: stores actual block data, reports block reports to the NameNode, and serves read/write traffic.
SecondaryNameNode: merges FsImage and EditLog for checkpointing and backup, but does not act as a hot standby.
Common HDFS commands mirror Linux file operations, e.g., #hadoop fs -ls /tmp/ to list files, and #hdfs fsck to check file health and block locations. A typical command set includes cat, chgrp, chmod, chown, cp, df, ls, put, rm, mkdir, etc.
Advantages of HDFS:
Supports storage of massive data volumes.
Detects and quickly recovers from hardware failures.
Provides streaming data access.
Simplified consistency model.
High fault tolerance.
Runs on commodity hardware.
Disadvantages of HDFS:
Not suitable for low‑latency access.
Poor handling of a large number of small files.
Limited file modification (append‑only support from Hadoop 2.x).
No concurrent writes by multiple users.
Recent Hadoop 2.x enhancements such as NameNode Federation and NameNode HA address scalability and single‑point‑of‑failure concerns.
References: Hadoop official design docs, various technical blogs, and CSDN articles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
