Design and Implementation of a Cloud‑Native AI Storage Acceleration System (PCache) for Large‑Scale Model Training
This article examines the challenges of AI storage for massive models, describes Ant Group's multi‑cloud, high‑availability PCache architecture, and details its GPU‑mixed deployment, metadata services, data‑link optimizations, and performance results that enable petabyte‑scale training with low cost and high stability.
The rapid growth of data and AI workloads has made storage a critical bottleneck for large‑model training, prompting both academia and industry to explore "Storage for AI" solutions.
Background and Challenges include exponential dataset size growth, small‑file random access patterns, parameter‑scale mismatches, and the widening gap between compute (GPU/TPU) performance and storage bandwidth.
Industry Solutions are typically divided into independent file systems (e.g., GPFS, GFS/HDFS) and hybrid approaches that combine local caches with large‑capacity object storage.
Ant Group's PCache Architecture is a cloud‑native, multi‑cloud AI storage system built on Kubernetes, offering three layers: a performance layer (GPU‑proximate SSD cache), a capacity layer (object storage), and a persistence layer (HexS). It provides POSIX (FUSE) and SDK (Python/Java) interfaces.
GPU Mixed Deployment places storage services on GPU nodes, reducing network latency and leveraging east‑west bandwidth for linear throughput scaling as GPU clusters grow.
Metadata Service reduces metadata overhead via file folding and sharding, improving small‑file access performance at billions‑scale.
Data‑Link Optimizations include multi‑path writes, RDMA support, client‑side FUSE acceleration, metadata caching, dynamic replica adjustment, and automatic failover, achieving 3‑4× throughput gains and up to 20‑30 GB/s sustained throughput in multi‑client tests.
AI‑Native Features such as dynamic data eviction/loading based on access patterns and checkpoint write optimizations (e.g., DP‑group scattering) further enhance training efficiency.
Multi‑Cloud Synchronization provides cross‑site data sync with high throughput (15 TB/h) and integrity checks, supporting seamless checkpoint migration between AIDC sites.
Monitoring and Management offer end‑to‑end IO link inspection, offline analysis, quota control, read‑only datasets, and recycle‑bin capabilities.
The roadmap foresees handling petabyte‑to‑exabyte scale data, supporting new data interfaces beyond files, and continuing performance improvements for future large‑model training.
AntData
Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.