HDFS Overview: Architecture, Features, Data Management and Storage Policies
This article provides a comprehensive overview of HDFS, covering basic file system concepts, HDFS architecture, high availability, federation, replica placement, storage policies, colocation, data integrity, and key design considerations for large‑scale distributed storage.
Basic Overview of File Systems
File system definition: a method for storing and organizing computer data that makes access and lookup easy.
File name: used to locate the storage location within the file system.
Metadata: data that stores file attributes such as name, length, owning user/group, and storage location.
Block: the smallest unit of file storage; the storage medium is divided into fixed regions that are allocated as needed.
Overview of HDFS
HDFS (Hadoop Distributed File System) was designed based on Google’s GFS paper and is the distributed file system component of the Hadoop framework, managing files deployed across many independent physical machines.
It can be used in various scenarios such as website user behavior data storage, ecosystem data storage, and meteorological data storage.
Advantages and Disadvantages of HDFS
In addition to the common features of other distributed file systems, HDFS has its own characteristics:
High fault tolerance: assumes hardware is always unreliable.
High throughput: provides high throughput for applications that access large amounts of data.
Large‑file storage: supports storage of data at the TB‑PB scale.
Unsuitable Scenarios
Low‑latency data access applications (e.g., tens of milliseconds) – HDFS is optimized for high throughput at the cost of higher latency.
Massive numbers of small files – NameNode loads all metadata into memory, limiting the total number of files by available memory.
Multiple concurrent writers or arbitrary file modifications – HDFS currently allows only a single writer that appends to the end of a file.
Streaming Data Access
Streaming data access: after a dataset is generated, long‑running analyses read most or all of the data, making total read time more important than the latency of the first record.
Random data access: requires low latency for locating, querying, or modifying data and is typical of traditional relational databases.
HDFS Architecture
HDFS architecture consists of three components: NameNode, DataNode, and Client.
NameNode: stores and generates the file system metadata; runs a single instance.
DataNode: stores the actual data blocks and reports them to the NameNode; multiple instances run.
Client: accesses HDFS on behalf of applications, obtaining data from NameNode and DataNode; multiple instances run alongside the business logic.
HDFS Data Write Process
The application calls the HDFS Client API to request file creation.
HDFS Client contacts NameNode, which creates a file node in the metadata.
The application invokes the write API.
After receiving data, the client obtains block IDs and locations from NameNode, contacts the appropriate DataNodes, establishes a pipeline, writes to DataNode1, which replicates to DataNode2 and DataNode3.
Upon completion, the client receives an acknowledgment.
After all data is confirmed, the application closes the file.
Finally, the client contacts NameNode to confirm write completion, and NameNode persists the metadata.
HDFS Data Read Process
The application calls the HDFS Client API to open a file.
Client contacts NameNode to obtain file information (block list and DataNode locations).
The application invokes the read API.
Client contacts the appropriate DataNodes based on the metadata (preferring locality) to retrieve the required blocks.
Client may communicate with multiple DataNodes to gather all blocks.
After reading, the application closes the connection.
Key Features of HDFS
High Availability (HA)
HA is achieved using ZooKeeper to implement active‑standby NameNode pairs, solving the single‑point‑failure problem.
ZooKeeper stores HA state files and master‑standby information; an odd number of ZooKeeper nodes (≥3) is recommended.
Active NameNode provides services; standby synchronizes metadata and acts as a hot backup.
ZKFC (ZooKeeper Failover Controller) monitors NameNode master‑standby status.
JournalNode (JN) stores the EditLog generated by the active NameNode; the standby loads the EditLog from JN to sync metadata.
ZKFC Controls NameNode Master‑Standby Arbitration
ZKFC uses ZooKeeper’s distributed lock to arbitrate master‑standby roles and, via a command channel, controls the NameNode state.
Metadata Synchronization
The active NameNode writes EditLog to both local storage and JournalNode while updating in‑memory metadata.
The standby monitors EditLog changes on JN, loads the new EditLog into memory, and generates identical metadata.
FSImage (metadata snapshot) remains on each node’s disk; it is periodically written from memory to local disk.
Metadata Persistence
EditLog records user operations and is used with FSImage to generate new file system images.
FSImage stores periodic snapshots of the file system.
FSImage.ckpt is created after merging FSImage and EditLog in memory, then written to disk; the standby loads it and writes a checkpoint file.
EditLog.new is created when the active NameNode reaches a time interval or size limit, allowing the standby to merge and replace the old EditLog.
HDFS Federation
Reason: a single active NameNode becomes a scalability and performance bottleneck as clusters grow, potentially consuming hundreds of gigabytes of memory.
Use case: ultra‑large file storage (e.g., internet companies storing user behavior, telecom history, voice data). A rough estimate is 1 GB per 1 million blocks, yielding about 128 TB with default block size.
Federation concept: each NameNode manages its own namespace (directory tree), similar to mounting disks in Linux; directories are isolated, and each NameNode has its own standby.
Block pool: a set of blocks belonging to a specific namespace; each NameNode maintains a namespace volume that includes metadata and all blocks of that namespace.
NameNodes operate independently; failure of one does not affect others.
DataNodes register with all NameNodes and store blocks for every block pool.
Namespace (NS): the logical directory structure containing directories, files, and blocks managed by a NameNode.
Data Replication Mechanism
Replica Distance Calculation
Distance(Rack1/D1, Rack1/D1) = 0 (same server).
Distance(Rack1/D1, Rack1/D3) = 2 (different servers in the same rack).
Distance(Rack1/D1, Rack2/D1) = 4 (servers in different racks).
Replica Placement Strategy
First replica is placed on the local node.
Second replica is placed on a node in a remote rack.
Third replica: if the first two are in the same rack, choose a node in a different rack; otherwise choose a different node in the same rack as the first replica. Additional replicas are placed randomly.
If the client machine is a DataNode, the first replica is stored locally; otherwise a random DataNode is chosen.
Rack1: rack identifier.
D1: DataNode identifier.
B1: block identifier on a node.
Configuring HDFS Data Storage Policies
By default, NameNode automatically selects DataNodes for replica placement, but real‑world scenarios may require:
Tiered storage on heterogeneous devices within a DataNode.
Directory‑based tagging to store data on specific DataNodes according to importance.
Storing critical data on highly reliable node groups in heterogeneous clusters.
Tiered Storage Configuration
Configure DataNode for Tiered Storage
HDFS tiered storage framework provides four storage types: RAM_DISK, DISK, ARCHIVE, and SSD.
By combining these types, administrators can create policies suited to different scenarios.
Label‑Based Storage Configuration
Configure DataNode with Labels
Users can flexibly configure block placement by assigning label expressions to directories/files and labeling DataNodes.
When a label‑based placement policy is used, the system selects DataNodes matching the file’s label expression, then chooses an appropriate node within that set.
In short: assign labels to DataNodes and to data; the system stores data on DataNodes with matching labels.
Node‑Group Storage Configuration
Configure DataNode for Node‑Group Storage
Critical data can be forced to reside in highly reliable node groups by modifying DataNode storage policies.
Constraints:
The first replica must be placed in a forced rack group (e.g., rack group 2); if none is available, the write fails.
The second replica is chosen from the local client rack or a random node in the rack group when the client’s rack differs.
The third replica is placed in another rack group.
All replicas should reside in different rack groups; excess replicas are placed randomly.
If additional replicas or block failures require re‑replication and a forced rack group lacks nodes, replication is postponed until a node becomes available.
In summary, critical data can be forced onto designated servers.
Colocation (Same‑Node Distribution)
Definition: storing related or potentially joined data on the same storage node.
When files A and D need to be joined, placing them on the same node avoids massive data movement and reduces network bandwidth consumption.
HDFS colocation stores files that are frequently joined on the same DataNode, dramatically lowering network usage during join operations.
When files A and D are colocated, their blocks reside on the same node, reducing resource consumption.
Hadoop implements file colocation by ensuring that blocks of related files are placed on the same storage node, enabling fast access and avoiding costly data shuffling.
HDFS Data Integrity Assurance
HDFS is designed to guarantee data integrity and handle component failures reliably.
Rebuilding lost replicas when a DataNode fails.
Cluster-wide data balancing to keep block distribution even.
Metadata reliability through write‑ahead logs and replication on active‑standby NameNodes.
Snapshot mechanism for quick recovery from accidental deletions.
Safe mode: when critical failures occur, the cluster enters read‑only mode to prevent further damage.
Other Key Design Points of HDFS
Unified file system view for users.
Space reclamation mechanisms, including a recycle bin and dynamic replica count adjustment.
Data is stored as blocks on the underlying OS file system.
Access methods: Java API, HTTP, and command‑line tools.
Disk utilization example: a 100 GB disk with 30 GB used has a 30 % utilization rate.
Load balancing prevents hotspot nodes caused by uneven data distribution.
Review Questions
What is HDFS and what is it suitable for? – A distributed file system running on commodity hardware, ideal for large‑file storage, streaming data access, and big‑data workloads.
What roles does HDFS contain? – NameNode, DataNode, and Client.
Briefly describe HDFS read and write flows. Read: Client contacts NameNode for block locations, then contacts DataNodes to retrieve blocks, and finally closes the connection. Write: Client contacts NameNode to create a file node, establishes a pipeline with DataNodes, writes data (which is replicated), closes the file, and confirms completion with NameNode.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.