Cloud Native 16 min read

Accelerate AI & Big Data on Kubernetes with Elastic File Client & Fluid

This article explains how the Elastic File Client (EFC) and Fluid together provide a cloud‑native, high‑performance storage solution for AI and big‑data workloads on Kubernetes, detailing architecture challenges, core features, performance benchmarks, and a step‑by‑step deployment guide.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Accelerate AI & Big Data on Kubernetes with Elastic File Client & Fluid

Background

Running AI and big‑data workloads on Kubernetes provides elasticity and operational efficiency, but separating compute from storage introduces high network latency, bandwidth costs, and limited throughput. High‑performance scenarios such as AI training, genomics, and industrial simulation require many pods to share the same dataset concurrently.

Fluid – Cloud‑Native Data Abstraction

Fluid is an open‑source distributed data orchestration and acceleration system that introduces a Dataset abstraction. A Dataset aggregates data from multiple back‑ends (NAS, CPFS, OSS, Ceph) and exposes unified CRUD, migration, and observability APIs. Fluid supports two runtime families:

CacheRuntime – provides distributed caching (Alluxio, JuiceFS, EFCRuntime, Jindo, GooseFS).

ThinRuntime – offers uniform read‑only access to external stores (s3fs, nfs‑fuse, etc.).

Key capabilities include unified dataset metadata, extensible plugins, CRD‑driven pre‑heat/migration/backup, autoscaling, portability, observability, and runtime‑agnostic deployment via CSI or sidecar.

Requirements for Cloud‑Native Storage

Service stability and automatic recovery across many pods.

Elastic capacity and performance that scale with pod scaling.

Support for rapid massive pod launches (thousands per minute).

Pod‑level observability (PV and dataset metrics).

Near‑local I/O performance despite storage‑compute separation.

Elastic File Client (EFC) Core Features

POSIX protocol – native POSIX interface for NAS/CPFS, enabling containers to mount shared data.

Second‑level failover – automatic FUSE recovery within seconds after crashes or upgrades.

Strong consistency – distributed lease ensures immediate visibility of writes across pods.

Enhanced client‑side caching – optimized FUSE cache improves small‑file read/write performance (>50% faster than traditional NFS).

Distributed cache pool – aggregates memory from multiple nodes; cache size scales with compute.

Small‑file prefetch – proactively fetches hot files to reduce latency.

Performance Benchmark

Using the InsightFace (ms1m‑ibug) dataset on a Kubernetes cluster with Arena, EFCRuntime with local caching reduced training time by 87% compared with open‑source NFS. Throughput increased from 648 MiB/s to 1 034.3 MiB/s (≈59.5% improvement).

Quick Start Guide

Prerequisites : an Alibaba Cloud Container Service (ACK) Pro cluster and an Alibaba Cloud NAS file system.

Step 1 – Create a Dataset and EFCRuntime

Save the following manifest as dataset.yaml and replace NAS_URL and NAS_DIR with your NAS endpoint and directory.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: efc-demo
spec:
  placement: Shared
  mounts:
    - mountPoint: "nfs://NAS_URL:NAS_DIR"
      name: efc
      path: "/"
---
apiVersion: data.fluid.io/v1alpha1
kind: EFCRuntime
metadata:
  name: efc-demo
spec:
  replicas: 3
  master:
    networkMode: ContainerNetwork
  worker:
    networkMode: ContainerNetwork
  fuse:
    networkMode: ContainerNetwork
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 15Gi

Key fields :

mountPoint – NAS or CPFS URL, e.g., nfs://NAS_URL:NAS_DIR.

replicas – number of cache workers; choose based on node memory and dataset size.

networkMode – use ContainerNetwork in ACK to avoid extra latency.

mediumtype – cache media (MEM, SSD, HDD); MEM (in‑memory) is recommended.

path – cache directory, typically /dev/shm (tmpfs).

quota – per‑worker cache capacity; total cache (replicas × quota) should exceed dataset size.

Apply the manifest: kubectl create -f dataset.yaml Verify the Dataset status:

kubectl get dataset efc-demo

Step 2 – Deploy an Application that Consumes the Dataset

Example app.yaml (a simple StatefulSet that mounts the dataset):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: efc-app
  labels:
    app: nginx
spec:
  serviceName: nginx
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/bash"]
        args: ["-c", "sleep inf"]
        volumeMounts:
        - mountPath: "/data"
          name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: efc-demo

Create the application: kubectl create -f app.yaml Check the file size inside a pod (assumes a 10 GiB test file /data/allzero-demo exists on NAS):

kubectl exec -it efc-app-0 -- du -h /data/allzero-demo

Measure read latency on each pod:

kubectl exec -it efc-app-0 -- bash -c "time cat /data/allzero-demo > /dev/null"
kubectl exec -it efc-app-1 -- bash -c "time cat /data/allzero-demo > /dev/null"

Typical output shows a reduction from ~0.65 GiB/s (NFS) to >1.0 GiB/s with EFCRuntime, confirming the performance gain.

Conclusion

Combining Fluid with EFC provides a stable, elastic, and high‑performance storage layer for cloud‑native AI and big‑data workloads. The solution offers standardized data pre‑heat, migration, and automated operations, and future work will extend support to serverless environments for distributed file access.

References

Fluid project: https://github.com/fluid-cloudnative/fluid

InsightFace dataset: https://github.com/deepinsight/insightface/tree/master/recognition/_datasets_#ms1m-ibug-85k-ids38m-images-56

Arena documentation: https://help.aliyun.com/document_detail/212117.html

EFC documentation: https://help.aliyun.com/document_detail/600930.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceCloud NativeBig DataaiKubernetesstorage
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.