How 3FS Revolutionizes AI Storage with High‑Throughput Distributed Filesystem
3FS, DeepSeek’s high‑performance parallel file system, is engineered for AI workloads, offering ultra‑low latency, high‑throughput storage via RDMA, CRAQ consistency, and seamless cloud‑native integration, with detailed architecture, deployment steps, performance benchmarks, and cost‑saving strategies for large‑scale model training and inference.
Background
As AI models enter the trillion‑parameter era, the computational demand and multimodal training data grow exponentially, creating new challenges for distributed parallel file storage. DeepSeek’s 3FS (Fire‑Flyer File System) is introduced to provide high‑throughput storage for AI workloads.
3FS Architecture Overview
Design Philosophy
3FS abandons traditional FUSE kernel paths and adopts a user‑space zero‑copy RDMA transfer to maximize hardware performance and focus on AI‑specific large files and high‑bandwidth requirements.
Key Features
Hardware Performance : Bypasses the FUSE kernel layer and uses user‑space zero‑copy RDMA transmission.
AI‑Centric Design : Drops the “one‑size‑fits‑all” approach of generic file systems, concentrating on large files and high bandwidth.
Separation of Compute and Storage : Allows independent scaling of storage and compute resources.
Strong Data Consistency : Implements the CRAQ (Chain Replication with Apportioned Queries) protocol to guarantee strong consistency across nodes.
Storage Throughput Optimisation : Uses Direct I/O and RDMA to avoid OS cache overhead and achieve high I/O throughput.
Applicable Scenarios
Data preparation: hierarchical directory structures for massive intermediate outputs.
Data loading: random access to training samples without pre‑fetching.
Checkpoint storage: high‑throughput parallel checkpoint access for large‑scale training.
Model inference: KVCache interface provides a cost‑effective DRAM alternative with higher throughput.
Software Design
The system consists of four components: Client, Cluster Manager, Meta Service, and Storage Service, all communicating over RDMA.
Cluster Manager
Provides high availability with a primary‑backup architecture and uses etcd for failover.
Cluster change sync: real‑time node and configuration updates.
Health management: periodic heartbeats from Meta and Storage services.
Meta Service
Stateless service with multi‑instance scalability; metadata is persisted in FoundationDB and managed at chunk granularity using CRAQ for consistency.
Storage Service
Handles data persistence via a Chunk Engine composed of Chunk Allocator and MetaStore.
Client
Uses a FUSE client to connect to any Meta Service, retrieve node information, and perform I/O on the appropriate Storage Server.
Replication Strategy
Currently supports a three‑replica policy (no erasure coding). Default ChunkSize is 1 MiB and StripeSize is 16.
I/O Model
For random sample reads during training, 3FS employs asynchronous Direct I/O via
io_uringto avoid frequent system‑call context switches and eliminate file‑cache overhead.
FFRecord Format
FFRecord is a binary sequence format optimised for 3FS, compatible with PyTorch’s Dataset and DataLoader interfaces, enabling efficient data loading for training.
Deployment in "马上消费" (MSXF)
The following steps outline the end‑to‑end deployment of 3FS in a production environment.
Hardware Setup
4 storage ECS instances (each with 128 CPU, 1 TiB RAM, 100 Gbps RDMA NIC, 8 × 3.84 TiB NVMe SSDs).
4 compute ECS instances (each with 192 CPU, 512 GiB RAM, 100 Gbps RDMA NIC).
Software Installation
Install required packages, libfuse ≥ 3.16, Rust toolchain, and FoundationDB.
# for Ubuntu 22.04
apt install cmake libuv1-dev liblz4-dev liblzma-dev libdouble-conversion-dev libdwarf-dev libunwind-dev libaio-dev libgflags-dev libgoogle-glog-dev libgtest-dev libgmock-dev clang-format-14 clang-14 clang-tidy-14 lld-14 libgoogle-perftools-dev google-perftools libssl-dev gcc-12 g++-12 libboost-all-dev build-essential
# Install libfuse 3.16
wget https://github.com/libfuse/libfuse/releases/download/fuse-3.16.1/fuse-3.16.1.tar.gz
apt install -y build-essential meson ninja-build pkg-config libudev-dev
tar -xzf fuse-3.16.1.tar.gz
cd fuse-3.16.1
mkdir build && cd build
meson setup ..
ninja
sudo ninja install
sudo ldconfig
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install FoundationDB
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-server_7.3.63-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-clients_7.3.63-1_amd64.deb
sudo dpkg -i foundationdb-{server,clients}_7.3.63-1_amd64.deb
sudo systemctl start foundationdb
fdbcli --exec "status"Building 3FS
# Clone and build
git clone https://github.com/deepseek-ai/3fs
cd 3fs
git submodule update --init --recursive
./patches/apply.sh
cmake -S . -B build -DCMAKE_CXX_COMPILER=clang++-14 -DCMAKE_C_COMPILER=clang-14 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake --build build -j 56Service Deployment
Deploy Cluster Manager, Meta Service, Storage Service, and FUSE Client on the respective nodes using systemd unit files. Configure etcd addresses, token authentication, and RDMA endpoints as shown in the original configuration snippets.
Data Placement
Generate chain tables and placement policies with the provided Python scripts, then upload them via
admin_clito the management service.
# Generate chain table
python3 ~/3fs/deploy/data_placement/src/model/data_placement.py -ql -relax -type CR --num_nodes 4 --replication_factor 3 --min_targets_per_disk 6
python3 ~/3fs/deploy/data_placement/src/setup/gen_chain_table.py --chain_table_type CR --node_id_begin 10001 --node_id_end 10004 --num_disks_per_node 8 --num_targets_per_disk 6 --target_id_prefix 1 --chain_id_prefix 9 --incidence_matrix_path output/DataPlacementModel-v_4-b_8-r_6-k_3-λ_2-lb_2-ub_2/incidence_matrix.picklePerformance Evaluation
Using
fio, the single‑client read bandwidth reaches 9.5 GB/s (saturating a 100 Gbps RDMA link), write bandwidth 3.3 GB/s, and read IOPS ~117 K. Multi‑client tests show linear scaling of aggregate bandwidth.
# Random read bandwidth test
fio -numjobs=64 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=randread -bs=4M --group_reporting -size=100M -time_based -runtime=120 -name=iotest -directory=/3fs/stage/iotest
# Random read IOPS test
fio -numjobs=64 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=randread -bs=4K --group_reporting -size=100M -time_based -runtime=120 -name=iotest -directory=/3fs/stage/iotestBenefits
Accelerated AI Workflows : High‑throughput data loading and fast checkpoint storage reduce training iteration time.
Unified Storage Strategy : Integration with CSI enables seamless use of 3FS in Kubernetes, breaking the POSIX/HDFS/S3 silos.
Cost Reduction : Combining 3FS with object storage (OSS) moves 90 % of data to cheap storage, cutting overall storage cost by >50 % while keeping hot data on high‑performance 3FS.
Conclusion
The deployment demonstrates a complete, production‑grade 3FS stack that supports large‑scale AI training and inference, delivers multi‑fold performance gains over traditional storage, and provides a cost‑effective tiered storage solution when coupled with object storage.
Instant Consumer Technology Team
Instant Consumer Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.