Understanding HDFS Architecture: Key Components, Protocols, and Limitations
This article explains HDFS’s master‑slave architecture, detailing the roles of NameNode and DataNode, namespace management, communication protocols, client functions, common configuration parameters, maintenance commands, and the inherent limitations of a single‑NameNode design.
HDFS Architecture
Overview
HDFS uses a master/slave model consisting of a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and client access, while each DataNode runs a process that handles read/write requests, creates, deletes, and replicates data blocks, storing data on the local Linux file system.
Namespace Management
HDFS namespace includes directories, files, and blocks.
In HDFS 1.0 there is only one namespace and one NameNode that manages it.
HDFS follows a hierarchical file system, allowing users to create, delete, move, and rename directories and files just like a regular file system.
Communication Protocol
All data transfers occur over the network because HDFS is a distributed file system.
Protocols are built on top of TCP/IP.
Clients initiate TCP connections to the NameNode on a configurable port and interact via the client protocol.
NameNode and DataNode communicate using the DataNode protocol.
Client‑DataNode interaction uses RPC; the NameNode only responds to RPC requests, it does not initiate them.
Client
The client is the most common way users interact with HDFS; a client library is provided with the deployment.
The HDFS client exposes a file system interface that abstracts most implementation complexities.
Strictly speaking, the client is not part of HDFS itself.
It supports operations such as open, read, write, and provides a shell‑like command line for data access.
HDFS also offers a Java API for programmatic access.
Limitations of HDFS Architecture
Having a single NameNode simplifies design but introduces several clear limitations:
Namespace limitation: the NameNode stores metadata in memory, so the number of objects it can manage is bounded by available RAM.
Performance bottleneck: overall throughput is constrained by the single NameNode.
Isolation issue: a single namespace prevents isolation of different applications.
Cluster availability: failure of the sole NameNode renders the entire cluster unavailable.
Common HDFS Configuration Parameters
Common HDFS Maintenance Commands
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
