Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid
This guide explains how to deploy large language models on Alibaba Cloud's ACK using KServe for serverless inference, integrates Fluid for distributed data caching to cut cold‑start latency, provides step‑by‑step commands, performance benchmarks, and practical tips for production‑grade AI model serving.
Background
KServe is a standard model inference platform built on Kubernetes, designed for highly scalable scenarios and supporting modern serverless workloads. It abstracts common ML frameworks (TensorFlow, XGBoost, Scikit‑Learn, PyTorch, ONNX) and handles auto‑scaling, networking, health checks, and service configuration, including GPU auto‑scaling and canary releases.
Why use KServe for AIGC/LLM
Distributed processing : LLMs have massive parameters and require extensive compute; KServe can distribute tasks across multiple nodes for parallel execution.
Serverless : Automatic scaling and shrinking adapt to traffic changes, making large‑model deployment flexible and fast.
Unified deployment : Users can start training and inference without manually configuring environments.
Monitoring & management : Built‑in metrics let users observe model status and adjust parameters promptly.
Challenges with large language models
Long model startup time : Hundreds of gigabytes must be transferred to GPU memory; the storage initializer pulls the model from remote storage, slowing down serverless auto‑scaling.
Long container image pull time : GPU‑enabled images are large, delaying pod startup.
Low update efficiency : Updating a model requires container restart and full model re‑pull, preventing hot upgrades.
Fluid integration
Fluid is an open‑source, Kubernetes‑native distributed dataset orchestration and acceleration engine. By pre‑warming model data into a distributed cache, Fluid reduces pod startup time by up to 80 % and enables hot upgrades without container restarts.
Prerequisites
Alibaba Cloud Container Service (ACK) cluster with Kubernetes version ≥ 1.18.
ASM (Alibaba Cloud Service Mesh) instance of Enterprise edition with Istio ≥ 1.17, and the cluster added to the ASM instance.
Three ecs.g7.xlarge nodes and one ecs.g7.2xlarge node (or equivalent) in the ACK cluster.
OSS bucket in the same region as the ACK cluster.
Step 1: Enable KServe on ASM
Log in to the ASM console, go to Service Mesh → Mesh Management .
Select the target mesh instance, then Ecosystem Integration Center → KServe on ASM .
Click Enable KServe on ASM . If cert‑manager is not installed, enable the automatic installation option.
Step 2: Install ack‑fluid and enable AI model cache
Deploy the ack‑fluid component (version ≥ 0.9.10) on the ACK/ASK cluster.
Upload the AI model files to an OSS bucket and note the oss://{bucket}/{path} location.
Create a namespace for the demo:
kubectl create ns kserve-fluid-demo kubectl label namespace kserve-fluid-demo alibabacloud.com/eci=trueCreate a secret for OSS access:
apiVersion: v1
kind: Secret
metadata:
name: access-key
stringData:
fs.oss.accessKeyId: <your‑AccessKeyId>
fs.oss.accessKeySecret: <your‑AccessKeySecret> kubectl apply -f oss-secret.yaml -n kserve-fluid-demoDeclare the dataset and runtime (JindoFS) in oss-jindo.yaml:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: oss-data
spec:
mounts:
- mountPoint: "oss://{bucket}/{path}"
name: bloom-560m
path: /bloom-560m
options:
fs.oss.endpoint: "{endpoint}"
encryptOptions:
- name: fs.oss.accessKeyId
valueFrom:
secretKeyRef:
name: access-key
key: fs.oss.accessKeyId
- name: fs.oss.accessKeySecret
valueFrom:
secretKeyRef:
name: access-key
key: fs.oss.accessKeySecret
accessModes:
- ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: oss-data
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: SSD
volumeType: emptyDir
path: /mnt/ssd0/cache
quota: 50Gi
high: "0.95"
low: "0.7"
fuse:
args:
- -ometrics_port=-1
master:
nodeSelector:
node.kubernetes.io/instance-type: ecs.g7.xlarge
worker:
nodeSelector:
node.kubernetes.io/instance-type: ecs.g7.xlarge kubectl create -f oss-jindo.yaml -n kserve-fluid-demoVerify deployment:
kubectl get jindoruntime,dataset -n kserve-fluid-demoPre‑warm the data with a DataLoad CR ( oss-dataload.yaml) and apply it:
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: oss-dataload
spec:
dataset:
name: oss-data
namespace: kserve-fluid-demo
target:
- path: /bloom-560m
replicas: 2 kubectl create -f oss-dataload.yaml -n kserve-fluid-demoCheck progress with kubectl get dataload -n kserve-fluid-demo.
Step 3: Deploy the AI model inference service
Create an InferenceService manifest ( oss-fluid-isvc.yaml) such as:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "fluid-bloom"
spec:
predictor:
timeout: 600
minReplicas: 0
nodeSelector:
node.kubernetes.io/instance-type: ecs.g7.2xlarge
containers:
- name: kserve-container
image: cheyang/kserve-fluid:bloom-gpu
resources:
limits:
cpu: "3"
memory: 8Gi
requests:
cpu: "3"
memory: 8Gi
env:
- name: STORAGE_URI
value: "pvc://oss-data/bloom-560m"
- name: MODEL_NAME
value: "bloom"
- name: GPU_ENABLED
value: "False"Apply the manifest:
kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demoVerify the service is ready:
kubectl get inferenceservice -n kserve-fluid-demoStep 4: Access the inference service
Obtain the ASM ingress gateway address from the ASM console and run:
curl -v -H "Content-Type: application/json" -H "Host: fluid-bloom.kserve-fluid-demo.example.com" "http://{ASM_GATEWAY}:80/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'The response contains the generated continuation of the prompt, confirming successful inference.
Performance Benchmark
The benchmark compares cold‑start latency of KServe using its native storage initializer versus KServe + Fluid for two models.
bigscience/bloom‑560m (3.14 GB) on ecs.g7.2xlarge:
KServe + Storage Initializer: total ≈ 58.0 s (download 33.9 s, load 5.0 s).
KServe + Fluid: total ≈ 8.5 s (load 2.35 s) with 2 workers.
bigscience/bloom‑7b1 (26.35 GB) on ecs.g7.4xlarge:
KServe + Storage Initializer: total ≈ 329 s (download 228 s, load 72 s).
KServe + Fluid: total ≈ 27.8 s (load 12.1 s) with 3 workers.
Fluid dramatically reduces cold‑start time, especially for larger models, by caching model files and avoiding repeated remote downloads.
Conclusion and Outlook
Integrating Fluid with KServe on Alibaba Cloud’s serverless Kubernetes platform provides a simple, plug‑in‑compatible solution that cuts model startup latency, improves elasticity, and enables hot upgrades without container restarts. Future work includes cost‑aware elastic scaling of the cache and hot‑update mechanisms for even larger LLMs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
