Fundamentals 5 min read

Understanding 4 Major Distributed File Systems: HDFS, CephFS, GFS, and TFS

This article provides a concise overview of four key distributed file systems—HDFS, CephFS, GFS, and TFS—explaining their architectures, strengths, weaknesses, and typical application scenarios for large‑scale data storage and processing.

Mike Chen's Internet Architecture

Sep 9, 2025

Hello, I am mikechen. Distributed file storage is the foundation of large‑scale architectures, and here I focus on four major distributed file systems.

HDFS

HDFS is a core component of the Apache Hadoop project, designed for batch processing of big data. It follows a master/slave architecture as shown below.

NameNode (master) manages metadata such as directory structures and file‑to‑block mappings, but does not store actual data. DataNode (slave) stores the actual data blocks, typically sized at 128 MB or 256 MB.

Advantages: high throughput for sequential reads/writes of large files and tight integration with the Hadoop ecosystem (MapReduce, Hive, Spark, etc.).

Disadvantages: poor low‑latency random access and unsuitable for small files due to metadata overhead.

Typical use case: offline big‑data processing platforms and log storage.

CephFS

CephFS is the distributed file system component of Ceph, using the CRUSH algorithm for data placement.

Ceph’s core is the CRUSH algorithm, which hashes and maps objects to storage devices across the cluster, achieving decentralization.

Key components: OSD (Object Storage Device) stores data objects (usually one per physical disk), Monitor maintains cluster state and metadata, and MDS (Metadata Server) manages CephFS file metadata.

Workflow: clients obtain metadata from MDS and then read/write data directly with OSDs.

Application scenario: cloud computing storage backends, e.g., OpenStack.

GFS

Google File System (GFS) is the distributed file system proposed by Google and inspired HDFS.

GFS also uses a master/slave architecture similar to HDFS.

Master manages metadata such as namespace, access control, and block locations. Chunkservers store actual data chunks (typically 64 MB each) with multiple replicas.

Workflow: clients first contact the Master for metadata, then interact directly with Chunkservers for data I/O.

TFS

TFS is Taobao’s proprietary distributed file system, optimized for massive small files like product images, user avatars, short videos, and documents.