Big Data 34 min read

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

This article explains the fundamentals of distributed file systems by linking Google’s GFS, MapReduce, and BigTable concepts to Hadoop’s open‑source implementation, covering terminology, architecture, server roles, data distribution, RPC protocols, file operations, fault recovery, consistency, load balancing, and garbage collection.

ITFLY8 Architecture Home

May 10, 2017

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

Distributed File System Basics

The term “distributed” here refers specifically to systems built around Google’s three core components—GFS, MapReduce, and BigTable—and Hadoop serves as the open‑source Java implementation of these ideas.

Terminology Mapping

A concise term‑mapping table aligns Hadoop’s names (NameNode, DataNode, Block, etc.) with their GFS equivalents (Master, ChunkServer, Chunk, etc.) to clarify concepts.

Core Architecture

Three main server types form the backbone: the master server (NameNode/Master) stores the namespace, data servers (DataNode/ChunkServer) hold replicated blocks, and client libraries provide API access. The master is a single point of control, with a secondary standby to avoid bottlenecks.

Data Distribution

Files are split into 64 MB blocks replicated across three data servers. The block metadata resides in memory, while the actual block locations are discovered dynamically.

Server‑to‑Server Protocols

Hadoop uses RPC for control messages between NameNode and DataNode, while data transfer employs a custom pipeline (ClientDatanodeProtocol) that bypasses RPC for efficiency.

File Operations

Directory operations (create, delete, rename, permission changes) are handled by the client invoking NameNode RPCs, which update the namespace and write logs. Reads involve the client locating block replicas via the NameNode and streaming data directly from a chosen DataNode. Writes are pipeline‑based, requiring coordinated block replication and lease management to ensure consistency.

Distributed Support

Fault tolerance is achieved through client leases, DataNode heartbeats and block reports, and NameNode checkpointing with a SecondaryNameNode. Data integrity relies on generation stamps, CRC‑based chunk signatures, and lease‑recovery mechanisms. Load balancing distributes blocks across racks, while garbage collection removes obsolete replicas.

Conclusion

The design principles of distributed file systems—replicated block storage, centralized metadata, robust logging, and lease‑based coordination—underpin many modern big‑data platforms.

hdfs Hadoop GFS

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.