Big Data 10 min read

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big Data Technology & Architecture

Aug 16, 2020

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

HDFS Application Scenarios, Principles, and Basic Architecture

HDFS is designed for large‑scale batch processing, supporting GB, TB, and even PB‑level data sets across thousands of nodes, and is widely used in big‑data analytics.

Advantages of HDFS

High fault tolerance : data is automatically replicated across multiple nodes; lost replicas are recreated.

Optimized for batch processing : computation moves to the data rather than the other way around.

Scalable for massive data : can handle millions of files and clusters with 10K+ nodes.

Streaming file access : write‑once, read‑many model guarantees consistency.

Runs on inexpensive hardware : reliability is achieved through replication rather than expensive machines.

Disadvantages of HDFS

High latency : not suitable for millisecond‑level random reads.

Small‑file inefficiency : each small file consumes NameNode memory; excessive small files increase seek time.

Single‑writer limitation : a file can have only one writer and supports only append operations.

Metadata overhead : NameNode stores all block metadata in memory, limiting the total number of blocks.

Core Architecture and Data Blocks

HDFS stores files as fixed‑size blocks (default 128 MB, configurable). Each block is replicated (default three copies) and distributed across DataNodes. The NameNode holds metadata in memory, while DataNodes store the actual block files.

Write and Read Processes

During a write, the client contacts the NameNode for block allocation, streams data to a pipeline of DataNodes, and the pipeline acknowledges successful replication. Reads follow the reverse path, retrieving blocks from the nearest DataNode.

Access Methods

HDFS can be accessed via:

HDFS Shell commands

Java API (org.apache.hadoop.fs)

REST API

Fuse (filesystem in userspace)

C/C++ libhdfs

Thrift for multi‑language clients (Python, PHP, C#, etc.)

Common Shell Commands (examples)

$ hdfs version

– shows Hadoop version. bin/hadoop fs -copyFromLocal /local/data /hdfs/data – upload a local file. bin/hadoop fs -rmr /hdfs/data – delete a directory. bin/hadoop fs -mkdir /hdfs/data – create a directory.

Cluster Management Scripts

Start all services: start-all.sh (or start-dfs.sh, start-yarn.sh).

Start a single service: hadoop-daemon.sh start namenode.

Refresh node list after adding/removing DataNodes: bin/hdfs dfsadmin -refreshNodes.

Balancing Data

Re‑distribute blocks to achieve balanced disk usage: bin/start-balancer.sh -threshold. Lower threshold yields better balance but takes longer.

Quota Management

Set directory space quota: hdfs dfsadmin -setSpaceQuota 128M /test.

Set file‑count quota: hdfs dfsadmin -setQuota 100 /test.

Advanced Hadoop 2.0 Features

NameNode High Availability (HA)

NameNode Federation

Snapshots (read‑only point‑in‑time copies)

In‑memory cache for hot data

Access Control Lists (ACLs) for fine‑grained permissions

Heterogeneous storage hierarchy (disk, SSD, RAM) with APIs to control placement and quotas

ACL Example

hdfs dfs -setfacl -m user:tom:rw- /bank/exchange

hdfs dfs -setfacl -m group:team2:r-- /bank/exchange

Snapshot Example

Enable snapshots on a directory: bin/hdfs dfsadmin -allowSnapshot /mydir.

Create a snapshot: bin/hdfs dfs -createSnapshot /mydir snap1.

Delete a snapshot: bin/hdfs dfs -deleteSnapshot /mydir snap1.

Snapshots are read‑only and stored under /.snapshot.

Cache Management

Cache is managed at the directory level (no recursive caching) and organized into pools similar to YARN resource pools. Users can add or remove a directory from cache with commands such as hdfs cacheadmin -addPool (example omitted for brevity).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CLI Big Data data storage HDFS Hadoop HA

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.