Cloud Native 22 min read

Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid

This guide explains how to deploy large language models on Alibaba Cloud's ACK using KServe for serverless inference, integrates Fluid for distributed data caching to cut cold‑start latency, provides step‑by‑step commands, performance benchmarks, and practical tips for production‑grade AI model serving.

Alibaba Cloud Native

Jun 23, 2023

Accelerating LLM Inference on Alibaba Cloud with KServe and Fluid

Background

KServe is a standard model inference platform built on Kubernetes, designed for highly scalable scenarios and supporting modern serverless workloads. It abstracts common ML frameworks (TensorFlow, XGBoost, Scikit‑Learn, PyTorch, ONNX) and handles auto‑scaling, networking, health checks, and service configuration, including GPU auto‑scaling and canary releases.

Why use KServe for AIGC/LLM

Distributed processing : LLMs have massive parameters and require extensive compute; KServe can distribute tasks across multiple nodes for parallel execution.

Serverless : Automatic scaling and shrinking adapt to traffic changes, making large‑model deployment flexible and fast.

Unified deployment : Users can start training and inference without manually configuring environments.

Monitoring & management : Built‑in metrics let users observe model status and adjust parameters promptly.

Challenges with large language models

Long model startup time : Hundreds of gigabytes must be transferred to GPU memory; the storage initializer pulls the model from remote storage, slowing down serverless auto‑scaling.

Long container image pull time : GPU‑enabled images are large, delaying pod startup.

Low update efficiency : Updating a model requires container restart and full model re‑pull, preventing hot upgrades.

Fluid integration

Fluid is an open‑source, Kubernetes‑native distributed dataset orchestration and acceleration engine. By pre‑warming model data into a distributed cache, Fluid reduces pod startup time by up to 80 % and enables hot upgrades without container restarts.

Prerequisites

Alibaba Cloud Container Service (ACK) cluster with Kubernetes version ≥ 1.18.

ASM (Alibaba Cloud Service Mesh) instance of Enterprise edition with Istio ≥ 1.17, and the cluster added to the ASM instance.

Three ecs.g7.xlarge nodes and one ecs.g7.2xlarge node (or equivalent) in the ACK cluster.

OSS bucket in the same region as the ACK cluster.

Step 1: Enable KServe on ASM

Select the target mesh instance, then Ecosystem Integration Center → KServe on ASM .

Click Enable KServe on ASM . If cert‑manager is not installed, enable the automatic installation option.

Step 2: Install ack‑fluid and enable AI model cache

Deploy the ack‑fluid component (version ≥ 0.9.10) on the ACK/ASK cluster.

Upload the AI model files to an OSS bucket and note the oss://{bucket}/{path} location.

Create a namespace for the demo:

kubectl create ns kserve-fluid-demo

kubectl label namespace kserve-fluid-demo alibabacloud.com/eci=true

Create a secret for OSS access:

apiVersion: v1
kind: Secret
metadata:
  name: access-key
stringData:
  fs.oss.accessKeyId: <your‑AccessKeyId>
  fs.oss.accessKeySecret: <your‑AccessKeySecret>

kubectl apply -f oss-secret.yaml -n kserve-fluid-demo

Declare the dataset and runtime (JindoFS) in oss-jindo.yaml:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: oss-data
spec:
  mounts:
  - mountPoint: "oss://{bucket}/{path}"
    name: bloom-560m
    path: /bloom-560m
    options:
      fs.oss.endpoint: "{endpoint}"
    encryptOptions:
    - name: fs.oss.accessKeyId
      valueFrom:
        secretKeyRef:
          name: access-key
          key: fs.oss.accessKeyId
    - name: fs.oss.accessKeySecret
      valueFrom:
        secretKeyRef:
          name: access-key
          key: fs.oss.accessKeySecret
  accessModes:
  - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: oss-data
spec:
  replicas: 2
  tieredstore:
    levels:
    - mediumtype: SSD
      volumeType: emptyDir
      path: /mnt/ssd0/cache
      quota: 50Gi
      high: "0.95"
      low: "0.7"
  fuse:
    args:
    - -ometrics_port=-1
  master:
    nodeSelector:
      node.kubernetes.io/instance-type: ecs.g7.xlarge
  worker:
    nodeSelector:
      node.kubernetes.io/instance-type: ecs.g7.xlarge

kubectl create -f oss-jindo.yaml -n kserve-fluid-demo

Verify deployment:

kubectl get jindoruntime,dataset -n kserve-fluid-demo

Pre‑warm the data with a DataLoad CR ( oss-dataload.yaml) and apply it:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: oss-dataload
spec:
  dataset:
    name: oss-data
    namespace: kserve-fluid-demo
  target:
  - path: /bloom-560m
    replicas: 2

kubectl create -f oss-dataload.yaml -n kserve-fluid-demo

Check progress with kubectl get dataload -n kserve-fluid-demo.

Step 3: Deploy the AI model inference service

Create an InferenceService manifest ( oss-fluid-isvc.yaml) such as:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fluid-bloom"
spec:
  predictor:
    timeout: 600
    minReplicas: 0
    nodeSelector:
      node.kubernetes.io/instance-type: ecs.g7.2xlarge
    containers:
    - name: kserve-container
      image: cheyang/kserve-fluid:bloom-gpu
      resources:
        limits:
          cpu: "3"
          memory: 8Gi
        requests:
          cpu: "3"
          memory: 8Gi
      env:
      - name: STORAGE_URI
        value: "pvc://oss-data/bloom-560m"
      - name: MODEL_NAME
        value: "bloom"
      - name: GPU_ENABLED
        value: "False"

Apply the manifest:

kubectl create -f oss-fluid-isvc.yaml -n kserve-fluid-demo

Verify the service is ready:

kubectl get inferenceservice -n kserve-fluid-demo

Step 4: Access the inference service

Obtain the ASM ingress gateway address from the ASM console and run:

curl -v -H "Content-Type: application/json" -H "Host: fluid-bloom.kserve-fluid-demo.example.com" "http://{ASM_GATEWAY}:80/v1/models/bloom:predict" -d '{"prompt": "It was a dark and stormy night", "result_length": 50}'

The response contains the generated continuation of the prompt, confirming successful inference.

Performance Benchmark

The benchmark compares cold‑start latency of KServe using its native storage initializer versus KServe + Fluid for two models.

bigscience/bloom‑560m (3.14 GB) on ecs.g7.2xlarge:

KServe + Storage Initializer: total ≈ 58.0 s (download 33.9 s, load 5.0 s).

KServe + Fluid: total ≈ 8.5 s (load 2.35 s) with 2 workers.

bigscience/bloom‑7b1 (26.35 GB) on ecs.g7.4xlarge:

KServe + Storage Initializer: total ≈ 329 s (download 228 s, load 72 s).

KServe + Fluid: total ≈ 27.8 s (load 12.1 s) with 3 workers.

Fluid dramatically reduces cold‑start time, especially for larger models, by caching model files and avoiding repeated remote downloads.

Conclusion and Outlook

Integrating Fluid with KServe on Alibaba Cloud’s serverless Kubernetes platform provides a simple, plug‑in‑compatible solution that cuts model startup latency, improves elasticity, and enables hot upgrades without container restarts. Future work includes cost‑aware elastic scaling of the cache and hot‑update mechanisms for even larger LLMs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native LLM kubernetes Performance Benchmark model serving Fluid KServe

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.