How a Custom Linux‑Based Distributed File System Achieves Scalability and Consistency
This article describes the design and implementation of a Linux‑based distributed file system (DFS) that targets large‑scale data storage and access, detailing its architecture, goals, storage model, consistency mechanisms, thread‑safety approach, synchronization strategies, and recovery processes to ensure high availability and data integrity.
Keyword Explanation
Logical cluster: A group of storage machines sharing the same groupname and syncgroup.
Abstract
We designed and implemented a distributed file system (DFS) built on Linux, aimed at massive data storage, massive data access, and good scalability. DFS runs on inexpensive Linux machines while still providing strong consistency and integrity.
Although similar to many existing DFS solutions, our system focuses on the specific business problems we face, without being based on academic papers, giving it better adaptability, maintainability, and usability.
DFS has already been deployed online to serve a book library, currently operating with nine data nodes storing about 600 GB, with expectations to exceed 1 TB soon and later support another business requiring around 20 TB.
This document discusses DFS architecture and features for distributed file usage and presents test data.
Introduction
Our DFS stores article chapter content for a reading‑oriented website, where traditional relational databases are unsuitable due to large article volumes. Existing file systems lack horizontal scalability, data integrity, and consistency. FastDFS was considered but its full‑mirror approach did not meet our performance and scalability needs.
Thus we built our own DFS, sharing common goals with other DFS (integrity, consistency, recoverability, scalability, reliability, availability, cost) while differing in that it does not need to be compatible with public interfaces or handle file indexing.
DFS does not store metadata; article content is treated as a key‑value attribute stored in a relational database, simplifying the system compared to GFS or HBase by removing metadata servers and the single point of failure.
Data integrity and safety are critical; the first version uses real‑time sync and single‑disk recovery, with plans to add error‑retry sync and integrity checks.
DFS supports frequent modifications, not just append‑only operations, by sparsifying files and reserving space for updates.
Without a centralized metadata server, consistency is handled via a custom consistency model discussed later.
DFS does not implement caching; storage is sparse and accessed randomly, leaving caching responsibilities to the business layer.
DFS does not need to be POSIX‑compatible and runs as user‑space processes on Linux, interacting only with standard I/O interfaces.
Design Overview
Goals
The system must scale horizontally with minimal manual intervention; adding machines should automatically increase storage capacity and throughput.
Vertical scaling (adding disks or mount points) must also be automated.
The system must tolerate frequent failures of cheap machines, detecting faults within 30 seconds and routing traffic away.
Multi‑master architecture is preferred over simple master‑slave to handle massive traffic.
Support both small (KB) and medium (MB) files; large (GB) files are handled by external solutions like TFS.
Ensure data safety and consistency.
Support frequent data modifications, handling version control and avoiding excessive space waste.
Adding machines or disks should not trigger costly data rebalancing.
The system must tolerate network instability both for client‑to‑storage and inter‑storage communication.
Architecture
DFS consists of a simplified design without metadata servers, retaining only a tracker (state control and load balancer) and storage servers. This mirrors FastDFS’s approach but removes the metadata single point of failure.
The tracker performs health checks on storages, maintains their status, and routes client requests. Without a metadata server, the tracker becomes a lightweight, horizontally scalable component.
Storages handle file CRUD operations and synchronization. Within a logical cluster, storages are divided into syncgroups; each syncgroup mirrors its data, while different syncgroups operate in parallel, eliminating inter‑group dependencies and allowing seamless capacity expansion without data migration.
Both tracker and storage run as user‑space processes on Linux, exposing simple APIs without POSIX compatibility concerns.
Process Internal Structure Model
The system uses three logical modules: mainsocket (single‑threaded acceptor), network (thread pool handling socket I/O), and task (thread pool processing actual requests). The mainsocket distributes incoming sockets to network threads via load‑balancing; network threads read data into task contexts and forward them to task modules, which serialize responses back through the network.
Class Directory Service
Although DFS eliminates central metadata, a lightweight class directory service is provided to locate storage nodes. It returns a single suitable storage based on groupname, syncgroup, heartbeat, and disk capacity, similar to Ceph’s CRUSH but with a fixed selection algorithm.
Storage Model
Files are categorized as large (single‑file) or small (chunk‑file) based on a configurable threshold. Large files are stored individually; small files are aggregated into chunk files with embedded metadata (creation time, deletion flag, last modify time, total length, actual length, and an MVCC‑like version called opver).
Chunk‑file names include machine‑unique IDs, timestamps, sequence numbers, and thread IDs to guarantee global uniqueness.
Multiple mount points (up to 256 per storage) are supported, with configurable balancing strategies (round‑robin, max‑free‑space, etc.).
Data Consistency Issues
Consistency is primarily managed via the monotonic opver version, avoiding unreliable timestamps across machines. During sync, the higher opver indicates the newer version.
Thread Safety
DFS employs object‑threading and object pools to minimize lock contention. Insert operations are randomly distributed, while modify/delete/read operations are serialized within a task module thread, ensuring no concurrent writes to the same data block.
Disk writes are also thread‑isolated: each thread writes to its own file until full, reducing file‑level lock contention.
File Hole Problem
Because DFS supports frequent modify/delete operations, file holes can appear in chunk files, wasting space. DFS reserves ~20 % extra space for expected modifications; when modifications exceed the reserve, a skip‑list reuses abandoned space to mitigate holes.
Data Integrity
DFS maintains two types of logs: binlog (records CRUD operations) and synclog (records successful synchronizations). Synclog acts as a backup of binlog on remote machines and is crucial for single‑disk recovery.
Synchronization
DFS sync consists of intra‑syncgroup mirroring (real‑time sync) and single‑disk recovery. Real‑time sync uses a gossip‑style push algorithm, where each storage pushes its binlog entries to peers in the same group and syncgroup, employing a distributed sync state machine.
Real‑time Sync
Storages periodically push new binlog entries to remote storages; the remote side records them in its synclog. The tracker only provides storage location information; the actual sync occurs directly between storages.
Single‑Disk Recovery
When a disk fails, the affected storage pulls missing data and logs from a designated master storage within the same syncgroup (pull‑based). Recovery first ensures binlog and synclog are present, then replays them to restore data. The process can be paused and resumed without data loss.
Runtime State
Each storage maintains state files: persistent metadata for machine identity and transient files used during recovery, which are removed after completion.
Machine Changes and Data Migration
Adding or removing machines does not trigger data rebalancing; logical clusters are predefined, and data is only replicated within a cluster. Backup factor equals the number of machines in a logical cluster minus one, providing predictable redundancy without costly migrations.
Source: 94geek
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
