Design and Implementation of a Custom Distributed File System (DFS) for Scalable Data Storage
This article describes the design, architecture, and key features of a Linux‑based distributed file system that emphasizes horizontal scalability, data consistency, fault tolerance, and efficient handling of both small and large files without traditional metadata services.
Keyword Explanation
Logical cluster: a group of storage machines that share the same groupname and syncgroup.
Abstract
We designed and implemented a Linux‑based Distributed File System (DFS) aimed at large‑scale data storage and massive data access with good scalability. Although similar to many industry DFS solutions, our DFS is tailored to specific business problems, offering better adaptability, maintainability, and usability.
DFS has been deployed in production, currently running on nine data nodes storing 600 GB, with expectations to exceed 1 TB soon and eventually handle around 20 TB.
Introduction
Our DFS stores article chapters for a reading‑oriented website, where traditional relational databases are unsuitable due to the massive amount of text data. Existing solutions like FastDFS cannot meet our requirements for horizontal scaling and performance, prompting us to build a custom DFS.
The system shares many goals with other DFS (integrity, consistency, recoverability, scalability, reliability, availability, cost) but differs by eliminating the need for public interfaces and metadata handling, leading to a KV‑style design.
By removing metadata, the architecture is simplified: a tracker node handles state control and load balancing, while storage nodes manage file operations and synchronization.
Design Overview
Goal
The system must scale horizontally by simply adding machines, with automatic handling of storage capacity and throughput.
Vertical scaling (adding disks or capacity) must also be automated.
Failure of cheap machines should be detected within 30 seconds and bypassed.
Multi‑master mode is required for high read traffic.
Support both small (KB) and medium (MB) files; large (GB) files are handled by external solutions.
Ensure data safety and consistency.
Frequent data modifications must be supported without excessive file‑hole overhead.
Data re‑balancing is avoided; only disaster recovery or sync moves data.
The system must tolerate network jitter both client‑to‑storage and internal communication.
Architecture
The design consists of a tracker node (state control and load balancing) and storage nodes. Tracker monitors storage health, performs health checks, and routes client requests. Storage nodes handle file CRUD operations and synchronization, organized into logical clusters and syncgroups for mirroring.
Process Internal Structure Model
DFS uses a three‑module thread model: mainsocket module (single‑threaded, accepts connections), network module (thread pool, processes socket buffers), and task module (thread pool, executes actual tasks and assembles responses).
Class Directory Service
A lightweight class‑directory service provides storage server location without storing actual file directories, similar to Ceph’s CRUSH algorithm but simplified.
Storage Model
Files are classified as "singlefile" (large files) or "chunkfile" (small files). Chunkfiles store metadata (creation time, delete flag, length, version "opver") to support MVCC‑style consistency.
Data Consistency Issues
Consistency is achieved using a monotonically increasing operation version (opver) instead of timestamps, avoiding clock skew problems across storage nodes.
Thread Safety
DFS employs object‑threading and object pools to minimize lock contention, serializing conflicting operations within the task module.
File Hole Issue
DFS reserves extra space (≈20% of initial allocation) for modify operations; larger modifications trigger reuse of reclaimed space via a skip‑list.
Data Integrity
Two log types are used: binlog (records CRUD operations) and synclog (records synchronized data). Real‑time sync uses a gossip‑based push model, while single‑disk recovery uses a pull model to rebuild lost data.
Runtime State
Each storage node maintains state files for persistent metadata and temporary runtime information, enabling graceful restarts.
Machine Changes and Data Migration
Adding or removing machines does not trigger data migration; logical clusters are fixed, and replication is limited to the number of nodes within a cluster.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
