Big Data 32 min read

Design and Implementation of a Custom Distributed File System (DFS) for Scalable Data Storage

This article describes the design, architecture, and key features of a Linux‑based distributed file system that emphasizes horizontal scalability, data consistency, fault tolerance, and efficient handling of both small and large files without traditional metadata services.

Architecture Digest

Mar 26, 2016

Design and Implementation of a Custom Distributed File System (DFS) for Scalable Data Storage

Keyword Explanation

Logical cluster: a group of storage machines that share the same groupname and syncgroup.

Abstract

We designed and implemented a Linux‑based Distributed File System (DFS) aimed at large‑scale data storage and massive data access with good scalability. Although similar to many industry DFS solutions, our DFS is tailored to specific business problems, offering better adaptability, maintainability, and usability.

DFS has been deployed in production, currently running on nine data nodes storing 600 GB, with expectations to exceed 1 TB soon and eventually handle around 20 TB.

Introduction

Our DFS stores article chapters for a reading‑oriented website, where traditional relational databases are unsuitable due to the massive amount of text data. Existing solutions like FastDFS cannot meet our requirements for horizontal scaling and performance, prompting us to build a custom DFS.

The system shares many goals with other DFS (integrity, consistency, recoverability, scalability, reliability, availability, cost) but differs by eliminating the need for public interfaces and metadata handling, leading to a KV‑style design.

By removing metadata, the architecture is simplified: a tracker node handles state control and load balancing, while storage nodes manage file operations and synchronization.

Design Overview

Goal

The system must scale horizontally by simply adding machines, with automatic handling of storage capacity and throughput.

Vertical scaling (adding disks or capacity) must also be automated.

Failure of cheap machines should be detected within 30 seconds and bypassed.

Multi‑master mode is required for high read traffic.

Support both small (KB) and medium (MB) files; large (GB) files are handled by external solutions.

Ensure data safety and consistency.

Frequent data modifications must be supported without excessive file‑hole overhead.

Data re‑balancing is avoided; only disaster recovery or sync moves data.

The system must tolerate network jitter both client‑to‑storage and internal communication.

Architecture

The design consists of a tracker node (state control and load balancing) and storage nodes. Tracker monitors storage health, performs health checks, and routes client requests. Storage nodes handle file CRUD operations and synchronization, organized into logical clusters and syncgroups for mirroring.

Process Internal Structure Model

DFS uses a three‑module thread model: mainsocket module (single‑threaded, accepts connections), network module (thread pool, processes socket buffers), and task module (thread pool, executes actual tasks and assembles responses).

Class Directory Service

A lightweight class‑directory service provides storage server location without storing actual file directories, similar to Ceph’s CRUSH algorithm but simplified.

Storage Model

Files are classified as "singlefile" (large files) or "chunkfile" (small files). Chunkfiles store metadata (creation time, delete flag, length, version "opver") to support MVCC‑style consistency.

Data Consistency Issues

Consistency is achieved using a monotonically increasing operation version (opver) instead of timestamps, avoiding clock skew problems across storage nodes.

Thread Safety

DFS employs object‑threading and object pools to minimize lock contention, serializing conflicting operations within the task module.

File Hole Issue

DFS reserves extra space (≈20% of initial allocation) for modify operations; larger modifications trigger reuse of reclaimed space via a skip‑list.

Data Integrity

Two log types are used: binlog (records CRUD operations) and synclog (records synchronized data). Real‑time sync uses a gossip‑based push model, while single‑disk recovery uses a pull model to rebuild lost data.

Runtime State

Each storage node maintains state files for persistent metadata and temporary runtime information, enabling graceful restarts.

Machine Changes and Data Migration

Adding or removing machines does not trigger data migration; logical clusters are fixed, and replication is limited to the number of nodes within a cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability Data Consistency Synchronization storage architecture Distributed File System metadata-less design

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.