Accelerate AI & Big Data on Kubernetes with Elastic File Client & Fluid
This article explains how the Elastic File Client (EFC) and Fluid together provide a cloud‑native, high‑performance storage solution for AI and big‑data workloads on Kubernetes, detailing architecture challenges, core features, performance benchmarks, and a step‑by‑step deployment guide.
Background
Running AI and big‑data workloads on Kubernetes provides elasticity and operational efficiency, but separating compute from storage introduces high network latency, bandwidth costs, and limited throughput. High‑performance scenarios such as AI training, genomics, and industrial simulation require many pods to share the same dataset concurrently.
Fluid – Cloud‑Native Data Abstraction
Fluid is an open‑source distributed data orchestration and acceleration system that introduces a Dataset abstraction. A Dataset aggregates data from multiple back‑ends (NAS, CPFS, OSS, Ceph) and exposes unified CRUD, migration, and observability APIs. Fluid supports two runtime families:
CacheRuntime – provides distributed caching (Alluxio, JuiceFS, EFCRuntime, Jindo, GooseFS).
ThinRuntime – offers uniform read‑only access to external stores (s3fs, nfs‑fuse, etc.).
Key capabilities include unified dataset metadata, extensible plugins, CRD‑driven pre‑heat/migration/backup, autoscaling, portability, observability, and runtime‑agnostic deployment via CSI or sidecar.
Requirements for Cloud‑Native Storage
Service stability and automatic recovery across many pods.
Elastic capacity and performance that scale with pod scaling.
Support for rapid massive pod launches (thousands per minute).
Pod‑level observability (PV and dataset metrics).
Near‑local I/O performance despite storage‑compute separation.
Elastic File Client (EFC) Core Features
POSIX protocol – native POSIX interface for NAS/CPFS, enabling containers to mount shared data.
Second‑level failover – automatic FUSE recovery within seconds after crashes or upgrades.
Strong consistency – distributed lease ensures immediate visibility of writes across pods.
Enhanced client‑side caching – optimized FUSE cache improves small‑file read/write performance (>50% faster than traditional NFS).
Distributed cache pool – aggregates memory from multiple nodes; cache size scales with compute.
Small‑file prefetch – proactively fetches hot files to reduce latency.
Performance Benchmark
Using the InsightFace (ms1m‑ibug) dataset on a Kubernetes cluster with Arena, EFCRuntime with local caching reduced training time by 87% compared with open‑source NFS. Throughput increased from 648 MiB/s to 1 034.3 MiB/s (≈59.5% improvement).
Quick Start Guide
Prerequisites : an Alibaba Cloud Container Service (ACK) Pro cluster and an Alibaba Cloud NAS file system.
Step 1 – Create a Dataset and EFCRuntime
Save the following manifest as dataset.yaml and replace NAS_URL and NAS_DIR with your NAS endpoint and directory.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: efc-demo
spec:
placement: Shared
mounts:
- mountPoint: "nfs://NAS_URL:NAS_DIR"
name: efc
path: "/"
---
apiVersion: data.fluid.io/v1alpha1
kind: EFCRuntime
metadata:
name: efc-demo
spec:
replicas: 3
master:
networkMode: ContainerNetwork
worker:
networkMode: ContainerNetwork
fuse:
networkMode: ContainerNetwork
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 15GiKey fields :
mountPoint – NAS or CPFS URL, e.g., nfs://NAS_URL:NAS_DIR.
replicas – number of cache workers; choose based on node memory and dataset size.
networkMode – use ContainerNetwork in ACK to avoid extra latency.
mediumtype – cache media (MEM, SSD, HDD); MEM (in‑memory) is recommended.
path – cache directory, typically /dev/shm (tmpfs).
quota – per‑worker cache capacity; total cache (replicas × quota) should exceed dataset size.
Apply the manifest: kubectl create -f dataset.yaml Verify the Dataset status:
kubectl get dataset efc-demoStep 2 – Deploy an Application that Consumes the Dataset
Example app.yaml (a simple StatefulSet that mounts the dataset):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: efc-app
labels:
app: nginx
spec:
serviceName: nginx
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
command: ["/bin/bash"]
args: ["-c", "sleep inf"]
volumeMounts:
- mountPath: "/data"
name: data-vol
volumes:
- name: data-vol
persistentVolumeClaim:
claimName: efc-demoCreate the application: kubectl create -f app.yaml Check the file size inside a pod (assumes a 10 GiB test file /data/allzero-demo exists on NAS):
kubectl exec -it efc-app-0 -- du -h /data/allzero-demoMeasure read latency on each pod:
kubectl exec -it efc-app-0 -- bash -c "time cat /data/allzero-demo > /dev/null"
kubectl exec -it efc-app-1 -- bash -c "time cat /data/allzero-demo > /dev/null"Typical output shows a reduction from ~0.65 GiB/s (NFS) to >1.0 GiB/s with EFCRuntime, confirming the performance gain.
Conclusion
Combining Fluid with EFC provides a stable, elastic, and high‑performance storage layer for cloud‑native AI and big‑data workloads. The solution offers standardized data pre‑heat, migration, and automated operations, and future work will extend support to serverless environments for distributed file access.
References
Fluid project: https://github.com/fluid-cloudnative/fluid
InsightFace dataset: https://github.com/deepinsight/insightface/tree/master/recognition/_datasets_#ms1m-ibug-85k-ids38m-images-56
Arena documentation: https://help.aliyun.com/document_detail/212117.html
EFC documentation: https://help.aliyun.com/document_detail/600930.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
