Big Data 10 min read

Understanding RAID and Its Role in HDFS Architecture

This article explains the storage challenges of big data, introduces RAID technologies and their variants, and shows how the principles of RAID are applied in the Hadoop Distributed File System (HDFS) to achieve scalable, reliable, and high‑performance data storage and processing.

Big Data Technology & Architecture

Apr 3, 2019

Understanding RAID and Its Role in HDFS Architecture

Big data technologies must first solve the problem of storing massive amounts of data, which involves three core issues: storage capacity, read/write speed, and reliability.

RAID (Redundant Array of Independent Disks) was created to address these challenges by combining multiple disks to increase capacity, improve throughput, and provide fault tolerance.

Common RAID levels include:

RAID0 : Stripes data across N disks for N‑fold speed improvement but offers no redundancy; a single disk failure destroys all data.

RAID1 : Mirrors data on two disks, providing high reliability; any single disk can fail without data loss.

RAID10 : Combines RAID0 striping and RAID1 mirroring, delivering both performance and reliability at the cost of 50% storage efficiency.

RAID3 : Uses N‑1 data disks plus one dedicated parity disk; can recover from a single disk failure but suffers from parity‑disk bottlenecks under heavy write workloads.

RAID5 : Distributes parity across all disks in a rotating fashion, balancing write load and improving reliability compared to RAID3.

RAID6 : Stores two independent parity blocks, allowing the system to survive the simultaneous failure of two disks.

While RAID is typically implemented on a single server, the same principles are extended to large‑scale clusters in the Hadoop Distributed File System (HDFS).

HDFS treats a cluster of many servers as a single massive storage pool, achieving petabyte‑scale capacity. Its key components are:

NameNode : Manages metadata such as file paths, block IDs, and locations, similar to a file allocation table.

DataNode : Stores actual data blocks and handles read/write operations. Blocks are replicated (default three copies) across different DataNodes to ensure high availability.

When a DataNode fails, the NameNode detects the missing heartbeats, identifies the blocks that need additional replicas, and instructs other DataNodes to create new copies, preserving the configured replication factor.

HDFS write workflow:

The client calls the HDFS API to create a file.

NameNode allocates free blocks and returns the IDs and target DataNode locations.

The client streams data to the first DataNode, which writes to its local disk and forwards the data to the next DataNode, and so on, forming a pipeline.

After all data is written, the client notifies the NameNode, which marks the file as complete and ready for reading.

In practice, rather than moving large datasets to a single processing node, computation is sent to the nodes where the data resides, a concept embodied by Hadoop’s MapReduce framework.

Overall, RAID provides the foundational techniques for improving storage performance and reliability, and HDFS scales these ideas to distributed environments to meet the demands of big data.

Common RAID configurations diagram.

Comparison of several RAID technologies.

HDFS block replication strategy.

HDFS architecture diagram.

HDFS file write operation flow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Replication storage HDFS RAID

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.