What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features
This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.
HDFS Overview and Application Scenarios
HDFS (Hadoop Distributed File System) is a distributed file system designed based on Google’s GFS paper and runs on commodity hardware. It offers high fault tolerance, high throughput for large‑scale data access, and supports storage of TB‑PB sized files.
Suitable for large‑file storage and streaming data access; unsuitable for many small files, random writes, or low‑latency reads.
Typical Application Scenarios
Website user behavior data storage
Ecosystem data storage
Weather data storage
HDFS Position in FusionInsight
In FusionInsight HD, HDFS serves as the fundamental storage layer, providing a distributed, highly fault‑tolerant, and linearly scalable file system.
System Design Goals
Hardware failure tolerance : hardware is assumed unreliable; the system must detect and recover from failures automatically.
Streaming data access : applications read data in a streaming fashion, focusing on throughput rather than response time.
Large data volumes : supports files ranging from gigabytes to petabytes.
Data consistency : uses a Write‑Once‑Read‑Many (WORM) model; files are append‑only.
Multi‑platform support : runs on diverse hardware platforms.
Data locality : computation is placed close to data to reduce network load.
Basic System Architecture
The architecture consists of three components: NameNode, DataNode, and Client.
NameNode stores metadata and namespace information.
DataNode stores actual data blocks and reports them to the NameNode.
Client interacts with HDFS, obtaining block locations from the NameNode and reading/writing data to DataNodes.
Data Write Process
Client creates a file via the HDFS API.
NameNode creates a file node in its metadata.
Client writes data; NameNode returns block IDs and DataNode locations.
Client pipelines data to the chosen DataNodes, which replicate the blocks.
DataNodes acknowledge completion; client closes the file.
Data Read Process
Client opens a file via the HDFS API.
NameNode provides block locations.
Client reads data from the nearest DataNodes based on block locations.
After reading, the client closes the file.
Key Features
Unified file system view for users.
RPC‑based communication between components.
Space reclamation and dynamic replica management.
Data organized in blocks stored on underlying OS file systems.
Access via Java API, HTTP, or shell commands.
Metadata Persistence
NameNode maintains FsImage (snapshot of the namespace) and EditLog (record of recent changes). During startup, FsImage is loaded into memory and EditLog entries are applied to bring the metadata up to date.
High Availability (HA)
HA adds a standby NameNode, ZooKeeper for coordination, ZKFC for failover control, and JournalNodes for shared edit logs, ensuring continuous service during NameNode failures.
Federation
Federation introduces multiple NameNodes, each managing a portion of the namespace, improving scalability, throughput, and isolation between workloads.
Storage Policies
Hierarchical storage types: RAM_DISK, DISK, ARCHIVE, SSD.
Tag‑based policies allow directories to be associated with storage tags, directing blocks to specific DataNodes.
Node‑group policies enable placement of critical data on high‑reliability node groups.
Colocation (Same‑Node Placement)
Files that are frequently joined are stored on the same DataNode to minimize network traffic during processing.
Common Shell Commands
Typical HDFS shell commands (e.g., hdfs dfs -ls, hdfs dfs -put, hdfs dfs -rm) are used for file system operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
