How PCache Supercharges Large‑Scale AI Training Storage Performance
This talk explores large‑scale AI training storage challenges and presents PCache, a high‑performance, cloud‑native caching system that optimizes metadata, read/write paths, deployment, and high‑availability, delivering significant throughput gains and cost savings for massive model training workloads.
Introduction
The presentation discusses storage system optimization for large‑scale AI training, covering metadata and I/O performance improvements, disk and network layer tuning, and ecosystem construction.
AI Training Storage Challenges
Training requires massive sequential reads of checkpoint files and index data, often reaching 3–4 PB for language models and up to 10 PB for multimodal data.
Initial loading phases cause traffic bursts with hundreds of GB/s throughput, leading to GPU stalls if checkpoint or index loading is slow.
During iteration, random reads from the dataset dominate, and checkpoint writes can saturate at 200+ GB/s.
PCache System Overview
PCache acts as an intermediate caching layer between the training framework and underlying storage (OSS, OBS, HexStore). It provides high‑performance read/write paths and asynchronous persistence.
Engine layer : Master (metadata management via Raft), Worker (data nodes), Job Master (background tasks).
Interface layer : Linux FUSE for POSIX access, S3 protocol proxy, Python/Java SDKs.
Persistence layer : Asynchronous flush to cold‑storage backends.
Function Overview
Preloading of critical data and checkpoints to reduce initialization latency.
Persistence of important checkpoints to cold storage for long‑term reliability.
Data eviction based on access frequency when cache capacity is reached.
Two read paths: cache hit (direct SSD read) and cache miss (fallback to persistent storage).
Deployment Architecture
PCache is deployed as a mixed‑placement service co‑located with training containers, using existing GPU server disks (≈10 TB per node) to provide ~6 PB of cache without additional hardware, achieving near‑zero incremental cost.
High Availability Design
PCache integrates with Kubernetes (K8s) and Raft/Zookeeper for leader election, node domain naming for stable IDs, automatic pod recreation, and fast election hooks. It also includes health checks, fault‑tolerant replication, and automatic data migration on node failures.
Performance Optimizations
The I/O pipeline is split into four stages: request initiation, metadata access, data path access, and disk I/O. Optimizations include a custom userspace FUSE implementation that bypasses kernel overhead, shared‑memory data channels, large‑block transfers (1–4 MB), and reduction of context switches.
Metadata cache with LRU and close‑to‑open consistency reduces read overhead.
Client‑side block ID generation lowers master node pressure.
RDMA integration for low‑latency, high‑throughput data transfer, with fallback to TCP.
Asynchronous and parallel read/write pipelines increase throughput 3–4×.
AI‑Specific Optimizations
For MoE models, checkpoint aggregation is distributed across GPU groups to avoid network hotspots. Decoupling optimizer state and model parameters and parallelizing checkpoint writes reduces save time by two‑thirds. Profiling identified CPU serialization as a bottleneck, leading to client‑side optimizations.
Ecosystem Integration
PCache is part of a broader ecosystem that includes multi‑cloud synchronization, dataset management with TTL, checkpoint lifecycle management, and automated data eviction/preloading based on access patterns. These components form a closed loop that improves resource utilization and training efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
