Understanding 4 Major Distributed File Systems: HDFS, CephFS, GFS, and TFS
This article provides a concise overview of four key distributed file systems—HDFS, CephFS, GFS, and TFS—explaining their architectures, strengths, weaknesses, and typical application scenarios for large‑scale data storage and processing.
Hello, I am mikechen. Distributed file storage is the foundation of large‑scale architectures, and here I focus on four major distributed file systems.
HDFS
HDFS is a core component of the Apache Hadoop project, designed for batch processing of big data. It follows a master/slave architecture as shown below.
NameNode (master) manages metadata such as directory structures and file‑to‑block mappings, but does not store actual data. DataNode (slave) stores the actual data blocks, typically sized at 128 MB or 256 MB.
Advantages: high throughput for sequential reads/writes of large files and tight integration with the Hadoop ecosystem (MapReduce, Hive, Spark, etc.).
Disadvantages: poor low‑latency random access and unsuitable for small files due to metadata overhead.
Typical use case: offline big‑data processing platforms and log storage.
CephFS
CephFS is the distributed file system component of Ceph, using the CRUSH algorithm for data placement.
Ceph’s core is the CRUSH algorithm, which hashes and maps objects to storage devices across the cluster, achieving decentralization.
Key components: OSD (Object Storage Device) stores data objects (usually one per physical disk), Monitor maintains cluster state and metadata, and MDS (Metadata Server) manages CephFS file metadata.
Workflow: clients obtain metadata from MDS and then read/write data directly with OSDs.
Application scenario: cloud computing storage backends, e.g., OpenStack.
GFS
Google File System (GFS) is the distributed file system proposed by Google and inspired HDFS.
GFS also uses a master/slave architecture similar to HDFS.
Master manages metadata such as namespace, access control, and block locations. Chunkservers store actual data chunks (typically 64 MB each) with multiple replicas.
Workflow: clients first contact the Master for metadata, then interact directly with Chunkservers for data I/O.
TFS
TFS is Taobao’s proprietary distributed file system, optimized for massive small files like product images, user avatars, short videos, and documents.
TFS also follows a master/slave design but focuses on small‑file performance.
Typical use cases: e‑commerce product images, user avatars, short videos, and document storage.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
