What Is Fluid? A Cloud‑Native Data Orchestration and Acceleration Platform
Fluid is an open‑source cloud‑native data orchestration and acceleration system that runs on Kubernetes, offering storage‑agnostic datasets, distributed caching, intelligent scheduling, and performance optimizations for data‑intensive AI and big‑data workloads.
Fluid is an open‑source cloud‑native data orchestration and acceleration system that runs on Kubernetes. The project, hosted at https://github.com/fluid-cloudnative/fluid, was accepted as a CNCF sandbox project in April 2021. It addresses the latency and management challenges of data‑intensive workloads (big data, AI) in compute‑storage‑separated environments by providing distributed caching and intelligent scheduling.
Core Architecture
Dataset CRD : A Custom Resource Definition abstracts heterogeneous storage systems (object stores, HDFS, etc.) as a storage‑agnostic data object, enabling observability and elastic scaling.
CacheRuntime : Extends the Kubernetes API to manage distributed cache engines. Native support includes Alluxio and JindoFS.
Intelligent orchestration : Uses Kubernetes container scheduling and auto‑scaling to deploy cache instances close to the consuming pods.
Co‑scheduling : The scheduler is extended to be cache‑aware, allowing pods to be placed on nodes where the required dataset is already cached, reducing data‑access latency.
Standard access : Datasets are exposed to applications via the Persistent Volume Claim (PVC) interface, requiring no code changes in cloud‑native workloads.
Scenario‑driven tuning : Provides mechanisms for dataset pre‑warming, metadata optimization, small‑file I/O improvement, and automatic elastic scaling to boost performance for deep‑learning and batch‑processing jobs.
Usage Example
# Install Fluid CRDs
kubectl apply -f https://github.com/fluid-cloudnative/fluid/releases/download/v0.9.0/crds.yaml
# Create a Dataset that points to an S3 bucket
cat > dataset.yaml <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: s3-data
spec:
mounts:
- mountPoint: "s3://my-bucket"
name: s3
EOF
kubectl apply -f dataset.yaml
# Deploy a workload that consumes the dataset via PVC
cat > job.yaml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: spark-job
spec:
template:
spec:
containers:
- name: spark
image: spark:latest
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: s3-data-pvc
EOF
kubectl apply -f job.yamlThe above steps illustrate how a Dataset is defined, how Fluid provisions a cache, and how an application accesses the data through a standard PVC.
Adoption and Outlook
Since its open‑source release in September 2020, Fluid has been adopted by large enterprises such as Weibo, Qihoo 360, and China Telecom. The core maintainers are from Nanjing University, Alibaba Cloud, and the Alluxio community, with contributions from engineers at several Chinese tech firms. Future development aims to enhance flexibility, intelligence, and extensibility of the architecture, further integrating academic research with industrial practice to support a broader range of big‑data and AI workloads on native Kubernetes.
Related Links Alluxio: https://www.alluxio.io/ JindoFS: https://github.com/aliyun/alibabacloud-jindofs
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
