How Fluid & JindoCache Accelerate Large‑Scale AI Training in a Cloud‑Native Environment
This article examines the challenges of data‑intensive AI training on heterogeneous cloud‑native infrastructure and explains how the Fluid framework combined with JindoCache and KubeDL provides distributed caching, metadata acceleration, and seamless POSIX access to dramatically improve I/O performance, GPU utilization, and cost efficiency.
Background
In 2024, emerging technologies such as large‑model AI, AIGC, and multimodal learning are being deployed in production, dramatically increasing compute and storage demands. Heterogeneous accelerators (GPU, FPGA) evolve rapidly, and Alibaba Group meets these demands through unified scheduling, resource pools, and elastic provisioning.
Challenges
Compute‑storage separation introduces high latency when AI training jobs repeatedly read data from OSS object storage, causing 1–2 orders of magnitude slower access compared with local disks and inflating bandwidth costs.
Kubernetes’ native scheduler is cache‑agnostic, so repeated accesses to the same dataset across different hyper‑parameter runs or AutoML jobs cannot reuse cached data.
OSS becomes a performance bottleneck under concurrent training workloads, leading to unstable I/O and occasional time‑outs.
Training data files are scattered across many paths; OSS list operations are slow, creating metadata pressure and increasing job start time.
FUSE‑based storage clients can fail silently, breaking I/O stability and causing training interruptions.
Solution Overview
The proposed stack combines three open‑source components:
Fluid : a cloud‑native data orchestration system that abstracts storage (OSS, HDFS, Ceph) as fluid data volumes, exposing them via standard PersistentVolumeClaim interfaces.
JindoCache : a distributed caching runtime (JindoRuntime) built on the JindoCache engine, offering both data and metadata caching with configurable cache‑set policies.
KubeDL : a Kubernetes‑based AI workload scheduler that manages the lifecycle of distributed training jobs and integrates with Fluid for data‑locality‑aware scheduling.
Fluid
Fluid creates a data layer that moves, copies, and evicts data between storage back‑ends and Kubernetes workloads transparently. Users interact with data through standard PV/PVC objects, while Fluid handles caching, replication, and POSIX‑compatible access via FUSE.
JindoCache
JindoCache (formerly JindoFSx) provides separate data and metadata caches. It supports multiple runtimes (JindoRuntime, Alluxio, JuiceFS, GooseFS) and implements a Cache‑Aside (lazy‑load) read policy and a Write‑Through write policy to maximize cache hit rates and reduce remote reads.
KubeDL
KubeDL orchestrates distributed AI jobs across Alibaba’s unified heterogeneous resource pool, supporting over 10 000 daily training tasks from various business units. It interacts with Fluid to schedule jobs onto nodes that already host the required cached datasets.
Architecture Details
The system consists of Fluid’s control plane, JindoRuntime cache workers, and KubeDL job controllers. Cache workers run on high‑performance SSD or memory tiers, while metadata services run on reliable compute nodes. Dataset objects can specify nodeAffinity to steer cache placement, and tieredstore configuration defines cache size, water‑mark thresholds, and storage class (MEM/SSD/HDD).
Practical Experience
Choose cache nodes with ample SSD and network bandwidth; Fluid’s dataset nodeAffinity ensures placement on optimal nodes.
Configure cache capacity and tieredstore paths to bound cache size and prevent over‑provisioning.
Secure OSS credentials via Kubernetes Secret objects referenced in Fluid’s EncryptOptions.
Pre‑load hot datasets using Fluid’s dataload feature to avoid unnecessary remote reads.
Enable Fluid’s FUSE self‑healing to automatically recover from OOM‑induced mount failures.
Results
In a production environment training the LLaMA‑13B model, the cache‑enabled setup achieved the following improvements compared with direct OSS access:
GPU utilization remained near 100% for all cards.
SM (Streaming Multiprocessor) utilization increased from ~60% to ~80% (+33%).
Tensor‑core throughput rose from ~135 TFLOPs to ~160 TFLOPs (+18%).
Effective TFLOPs (amperf) grew from ~60 to ~72 (+20%).
For checkpoint loading, JindoCache reduced model reload time from ~10 minutes to ~30 seconds, cutting idle GPU cost by roughly 80% in Spot‑instance scenarios.
Future Work
Implement reference‑counted automatic dataset reclamation.
Develop intelligent data pre‑heating based on access patterns, with per‑directory priority and parallel pre‑load.
Integrate RDMA to accelerate intra‑cluster worker communication.
Continue extending Fluid’s multi‑JindoCache orchestration and streamline integration with upstream AI platforms.
References
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
