Big Data 8 min read

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

This article explains the fundamentals of distributed file systems, focusing on Hadoop’s HDFS architecture, the separation of metadata and data via NameNode and DataNode, and detailed step‑by‑step write and read processes, including replication, fault recovery, and block splitting across nodes.

ITPUB

Mar 19, 2016

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

1. Distributed File System

A distributed file system is a subset of distributed systems that provides storage across multiple machines, automatically spreading data blocks over different nodes to achieve scalability for massive data workloads.

2. Separation of Metadata and Data: NameNode and DataNode

In HDFS, metadata (file names, inode numbers, block locations, replication factor, etc.) is stored on the NameNode, while the actual file contents reside on a cluster of DataNodes. The NameNode acts as the master, managing metadata and monitoring node health; DataNodes act as slaves, storing data replicas.

3. HDFS Write Process

The client requests the NameNode to create a file (e.g., zhou.log).

The NameNode replies with a list of DataNodes (e.g., A, B, D) where the first block should be written.

The client writes the block to DataNode B, which then replicates it to DataNode A and DataNode D.

DataNode B confirms replication to A and D, and each DataNode acknowledges receipt back up the chain.

After all replicas confirm, the client receives a final acknowledgment that the write is complete.

4. HDFS Read Process

The client asks the NameNode for the locations of the blocks that compose the desired file.

The NameNode returns the block metadata, including the IP addresses of the DataNodes holding each replica.

The client contacts the appropriate DataNodes, requesting the needed blocks.

Each DataNode streams its block back to the client; the client assembles the blocks to reconstruct the complete file.

5. Fast Recovery via Replication

DataNodes send heartbeat messages to the NameNode every few seconds. If a heartbeat is missed beyond a timeout (default 10 minutes), the NameNode marks the DataNode as failed, removes it from the cluster, and initiates replica creation to maintain the configured replication factor (default three), preventing data loss.

6. Splitting Files Across DataNodes

Files are divided into blocks (typically 64 MB–128 MB). Each block is stored on different DataNodes, allowing parallel processing of large datasets. When a client writes a new block, it first contacts the NameNode for a suitable DataNode list, writes the block, then repeats the process for subsequent blocks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data replication Distributed File System HDFS NameNode DataNode

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.