Big Data 11 min read

Mastering HDFS: Architecture, Read/Write, and Best Practices Explained

This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write processes, component roles, command-line tools, replica placement strategies, and the advantages and disadvantages of using Hadoop's distributed file system for large-scale data storage.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Mastering HDFS: Architecture, Read/Write, and Best Practices Explained

Main Content

What is HDFS and what is it used for?

What does the HDFS architecture look like?

How does HDFS perform read/write operations and where are replicas placed?

What are the roles of the various HDFS components?

What file operation commands does HDFS provide?

What are the strengths and weaknesses of HDFS?

Introduction

A company needs to back up about 200 TB of read‑only MySQL databases to prevent data loss and ensure fast recovery. Single‑machine storage poses risks such as disk failure, high I/O, and inefficient resource usage, so a distributed storage solution like HDFS, Ceph, or S3 is considered.

What Is HDFS?

HDFS (Hadoop Distributed File System) is a fault‑tolerant, high‑throughput distributed file system designed to run on commodity hardware. It relaxes some POSIX constraints to enable streaming reads, originated from the Apache Nutch project, and is now a core component of Apache Hadoop.

HDFS Architecture

The architecture consists of the following components:

Client: initiates file operations.

NameNode: the master that stores metadata (namespace, block mapping, replica policies) and handles client requests.

DataNode: the slave that stores actual data blocks and executes commands from the NameNode.

SecondaryNameNode: assists the NameNode by merging FsImage and EditLog for backup; it is not a standby NameNode.

Read/Write Process

Write File

The client calls create to start a new file; the NameNode validates existence and permissions.

After validation, the client streams data to DataNodes in blocks (default 128 MB). DataNodes form a pipeline; data is sent in packets, each packet split into chunks with checksums.

Each DataNode acknowledges after completing a block, not after each packet.

When all blocks are written, the client closes the file.

Read File

The client calls open to access a file.

The NameNode returns the locations of the blocks, preferring DataNodes closest to the client.

The client reads data directly from the DataNodes, verifying checksums; the NameNode is not in the data path.

After reading a block, the client proceeds to the next block until the file is fully read, then closes it.

Replica Placement

Two strategies are illustrated for a typical three‑replica placement:

Old strategy: Replica 1 on a different node in the same rack, Replica 2 on another node in the same rack, Replica 3 on a node in a different rack.

New strategy: Replica 1 on the client’s node, Replica 2 on a different‑rack node, Replica 3 on another node in the same rack as Replica 2. This reduces read latency by allowing the client to read from the nearest replica.

Component Roles and Startup

NameNode starts, loads FsImage and EditLog into memory to reconstruct the namespace.

DataNodes start, register with the NameNode, and send BlockReport.

After exiting safe mode, clients can create directories, upload files, etc.; changes are recorded in EditLog.

NameNode Functions

Manage the HDFS namespace.

Maintain block‑to‑DataNode mapping.

Configure replica policies.

Process client read/write requests.

SecondaryNameNode

It assists the NameNode by merging FsImage and EditLog to create a new checkpoint and serves as a backup, but it does not take over client services if the NameNode fails.

HDFS Command‑Line Tools

#hadoop fs -ls /tmp/
#hadoop fs [cat|chgrp|chmod|chown|count|cp|df|get|ls|put|mv|rm|mkdir|tail]
#hdfs fsck [move|delete|files|blocks|locations|racks|blockId]

Advantages and Disadvantages

Advantages

Supports massive data storage.

Detects and quickly recovers from hardware failures.

Provides high‑throughput streaming access.

Simplified consistency model.

High fault tolerance.

Runs on commodity hardware.

Disadvantages

Not suitable for low‑latency access.

Inefficient for a large number of small files.

Limited file modification (append‑only support from Hadoop 2.x).

No concurrent writes by multiple users.

Hadoop 2.x introduces NameNode Federation for horizontal scaling and NameNode HA to eliminate the single‑point‑of‑failure problem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data replicationstorage architectureDistributed File SystemHDFSHadoop
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.