Artificial Intelligence 19 min read

Design and Implementation of a Cloud‑Native AI Storage Acceleration System (PCache) for Large‑Scale Model Training

This article examines the challenges of AI storage for massive models, describes Ant Group's multi‑cloud, high‑availability PCache architecture, and details its GPU‑mixed deployment, metadata services, data‑link optimizations, and performance results that enable petabyte‑scale training with low cost and high stability.

AntData

Mar 7, 2025

Design and Implementation of a Cloud‑Native AI Storage Acceleration System (PCache) for Large‑Scale Model Training

The rapid growth of data and AI workloads has made storage a critical bottleneck for large‑model training, prompting both academia and industry to explore "Storage for AI" solutions.

Background and Challenges include exponential dataset size growth, small‑file random access patterns, parameter‑scale mismatches, and the widening gap between compute (GPU/TPU) performance and storage bandwidth.

Industry Solutions are typically divided into independent file systems (e.g., GPFS, GFS/HDFS) and hybrid approaches that combine local caches with large‑capacity object storage.

Ant Group's PCache Architecture is a cloud‑native, multi‑cloud AI storage system built on Kubernetes, offering three layers: a performance layer (GPU‑proximate SSD cache), a capacity layer (object storage), and a persistence layer (HexS). It provides POSIX (FUSE) and SDK (Python/Java) interfaces.

GPU Mixed Deployment places storage services on GPU nodes, reducing network latency and leveraging east‑west bandwidth for linear throughput scaling as GPU clusters grow.

Metadata Service reduces metadata overhead via file folding and sharding, improving small‑file access performance at billions‑scale.

Data‑Link Optimizations include multi‑path writes, RDMA support, client‑side FUSE acceleration, metadata caching, dynamic replica adjustment, and automatic failover, achieving 3‑4× throughput gains and up to 20‑30 GB/s sustained throughput in multi‑client tests.

AI‑Native Features such as dynamic data eviction/loading based on access patterns and checkpoint write optimizations (e.g., DP‑group scattering) further enhance training efficiency.

Multi‑Cloud Synchronization provides cross‑site data sync with high throughput (15 TB/h) and integrity checks, supporting seamless checkpoint migration between AIDC sites.

Monitoring and Management offer end‑to‑end IO link inspection, offline analysis, quota control, read‑only datasets, and recycle‑bin capabilities.

The roadmap foresees handling petabyte‑to‑exabyte scale data, supporting new data interfaces beyond files, and continuing performance improvements for future large‑model training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization cloud-native Multi-Cloud large models AI storage PCache

Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.