Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features
This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.
HDFS Application Scenarios, Principles, and Basic Architecture
HDFS is designed for large‑scale batch processing, supporting GB, TB, and even PB‑level data sets across thousands of nodes, and is widely used in big‑data analytics.
Advantages of HDFS
High fault tolerance : data is automatically replicated across multiple nodes; lost replicas are recreated.
Optimized for batch processing : computation moves to the data rather than the other way around.
Scalable for massive data : can handle millions of files and clusters with 10K+ nodes.
Streaming file access : write‑once, read‑many model guarantees consistency.
Runs on inexpensive hardware : reliability is achieved through replication rather than expensive machines.
Disadvantages of HDFS
High latency : not suitable for millisecond‑level random reads.
Small‑file inefficiency : each small file consumes NameNode memory; excessive small files increase seek time.
Single‑writer limitation : a file can have only one writer and supports only append operations.
Metadata overhead : NameNode stores all block metadata in memory, limiting the total number of blocks.
Core Architecture and Data Blocks
HDFS stores files as fixed‑size blocks (default 128 MB, configurable). Each block is replicated (default three copies) and distributed across DataNodes. The NameNode holds metadata in memory, while DataNodes store the actual block files.
Write and Read Processes
During a write, the client contacts the NameNode for block allocation, streams data to a pipeline of DataNodes, and the pipeline acknowledges successful replication. Reads follow the reverse path, retrieving blocks from the nearest DataNode.
Access Methods
HDFS can be accessed via:
HDFS Shell commands
Java API (org.apache.hadoop.fs)
REST API
Fuse (filesystem in userspace)
C/C++ libhdfs
Thrift for multi‑language clients (Python, PHP, C#, etc.)
Common Shell Commands (examples)
$ hdfs version– shows Hadoop version. bin/hadoop fs -copyFromLocal /local/data /hdfs/data – upload a local file. bin/hadoop fs -rmr /hdfs/data – delete a directory. bin/hadoop fs -mkdir /hdfs/data – create a directory.
Cluster Management Scripts
Start all services: start-all.sh (or start-dfs.sh, start-yarn.sh).
Start a single service: hadoop-daemon.sh start namenode.
Refresh node list after adding/removing DataNodes: bin/hdfs dfsadmin -refreshNodes.
Balancing Data
Re‑distribute blocks to achieve balanced disk usage: bin/start-balancer.sh -threshold. Lower threshold yields better balance but takes longer.
Quota Management
Set directory space quota: hdfs dfsadmin -setSpaceQuota 128M /test.
Set file‑count quota: hdfs dfsadmin -setQuota 100 /test.
Advanced Hadoop 2.0 Features
NameNode High Availability (HA)
NameNode Federation
Snapshots (read‑only point‑in‑time copies)
In‑memory cache for hot data
Access Control Lists (ACLs) for fine‑grained permissions
Heterogeneous storage hierarchy (disk, SSD, RAM) with APIs to control placement and quotas
ACL Example
hdfs dfs -setfacl -m user:tom:rw- /bank/exchange hdfs dfs -setfacl -m group:team2:r-- /bank/exchangeSnapshot Example
Enable snapshots on a directory: bin/hdfs dfsadmin -allowSnapshot /mydir.
Create a snapshot: bin/hdfs dfs -createSnapshot /mydir snap1.
Delete a snapshot: bin/hdfs dfs -deleteSnapshot /mydir snap1.
Snapshots are read‑only and stored under /.snapshot.
Cache Management
Cache is managed at the directory level (no recursive caching) and organized into pools similar to YARN resource pools. Users can add or remove a directory from cache with commands such as hdfs cacheadmin -addPool (example omitted for brevity).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
