How 3FS, vePFS, and CloudFS Stack Up in AI Training Workloads – A Deep Dive
This article compares 3FS, vePFS, and CloudFS across metadata and data planes, presents detailed benchmark results for AI training scenarios, analyzes architectural trade‑offs, and draws insights for future cloud‑native file storage development.
Introduction
In the previous article we analyzed the implementation principles of each component of 3FS. The 3FS team aims not to build a universal best‑in‑class product, but to find the optimal solution for a specific scenario – DeepSeek’s own large‑model training and inference.
This article extracts representative cases from AI and other scenarios, compares 3FS with typical industry solutions on metadata and data dimensions, presents objective test results, and derives insights for storage practitioners and future product directions.
AI Scenario Mainstream Solutions
AI workflows consist of data collection, preprocessing, training, and inference. Most companies use public‑cloud elastic compute‑storage resources. Object storage handles massive files in collection and preprocessing, while file storage serves training and inference. Three typical file‑storage approaches are:
Self‑developed distributed file storage (e.g., Meta’s Tectonic, DeepSeek’s 3FS, HDFS/Ceph‑based systems)
Parallel file storage (e.g., Volcano Engine vePFS, Alibaba CPFS, AWS FSx for Lustre, Tencent CFS Turbo)
File‑cache acceleration layer (e.g., JuiceFS, Alluxio, Volcano CloudFS, Baidu RapidFS, Tencent GooseFS)
Systems Compared
vePFS
Client uses a private protocol similar to FUSE, with kernel‑space VFS handling POSIX calls and user‑space communicating with metadata and storage services via shared memory. Metadata service is symmetric: each storage node also stores metadata, supports distributed locks, hash‑based directory trees, Fileset isolation, and both replica and erasure‑coding modes. Storage service supports replica and EC, chunked writes with WAL for small I/O, and RDMA‑based single‑sided communication.
Key product capabilities include linear capacity scaling, TiB/s throughput, millions of IOPS, sub‑millisecond latency, data insight, bidirectional sync with object storage, audit logs, IAM‑based directory permissions, quota, QoS, multi‑level reliability, and deep integration with Volcano Engine MLP and VKE.
CloudFS
Client accesses via FUSE or SDK. Metadata service (DanceNN) uses a federation architecture with subtree splitting and master‑slave high‑availability. Storage service combines near‑compute cache and a data lake; cache can run on idle GPU memory or local disks, supports chain and star replication, variable‑size chunks, and RDMA/NVMe‑SSD optimizations.
Features include write cache for checkpoint acceleration, lifecycle policies, read cache, zero‑code data migration, and integration with object storage.
Comparison of Architecture
Client : vePFS private protocol vs CloudFS FUSE/SDK vs 3FS FUSE + USRBIO.
Metadata Service : vePFS directory random scattering, CloudFS subtree splitting, 3FS metadata‑service‑compute separation using FoundationDB.
Storage Service : vePFS replica / EC, CloudFS cache + data lake, 3FS replica only (EC reserved).
Test Environment
All three systems were deployed on identical servers (Server A) with the configurations shown in the figures.
Metadata Tests
Three representative AI‑training scenarios were used: small‑file creation (scenario 1), small‑file concurrent read (scenario 2), and POSIX compatibility (scenario 3). Key results:
vePFS outperforms CloudFS and 3FS in small‑file create/delete due to shorter request path.
3FS metadata write suffers from long KV‑based path and transaction overhead.
POSIX tests show vePFS near‑complete support, while CloudFS and 3FS miss pipe, block, character, and socket files.
Data‑Plane Tests
Benchmarks covered storage‑node write throughput, single‑client write, storage‑node read, single‑client read, space‑amplification, and write‑amplification. Highlights:
vePFS and CloudFS use star routing for higher write throughput; 3FS uses chain replication, limiting throughput.
Single‑client write: vePFS and CloudFS achieve high rates; 3FS lower but tunable.
Read throughput: CloudFS SDK shows best single‑client read; vePFS limited by CRC and NUMA; 3FS USRBIO similar.
Space amplification: 3FS suffers in mis‑aligned small files; vePFS fixed‑size chunks cause fragmentation; CloudFS variable‑size chunks mitigate waste.
Write‑amplification: 3FS incurs read‑modify‑write for small‑file overwrites, leading to high amplification; vePFS only replica write overhead.
Conclusions
3FS’s layered design simplifies architecture but incurs longer metadata paths and higher space waste for small files. vePFS offers richer POSIX features and value‑added capabilities (quota, QoS, snapshots). CloudFS provides flexible cache and data‑lake integration. For DeepSeek’s closed‑loop AI training, 3FS’s read‑optimized design is a good fit.
Insights and Future Outlook
Key challenges identified for cloud file storage include bidirectional data flow with object storage, metadata scalability to trillions of files, decoupling performance from capacity, extreme client‑side performance, and near‑real‑time data insight. Volcano Engine’s AI Cloud‑Native file product line (TOS, NAS, vePFS, CloudFS, unified FSX, data flow, data insight) aims to address these problems.
Team Recruitment
The Volcano Engine file‑storage team (responsible for vePFS and NAS) is hiring for distributed file‑storage product engineers and system engineers. Links to job postings are provided.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
