Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid
This article explains how to reduce inference cost and improve performance for large language models on Alibaba Cloud ACK by using Knative's request‑based autoscaling, custom ResourcePolicy priority scheduling, and Fluid data‑caching to achieve elastic scaling, resource pre‑emption, and faster model loading.
In the era of AI commercialization, model inference is used far more often than training, making efficient, cost‑effective inference a top priority for infrastructure teams.
The main challenges are the high GPU cost and the need to balance performance, especially for consumer‑facing services where low latency is critical. Traditional GPU provisioning leads to idle resources during traffic troughs, while scaling down to zero can cause long cold‑start times.
To address these issues, the article proposes a "reverse elastic scaling" approach that keeps a fixed set of GPU nodes but dynamically reallocates workloads and pre‑empts lower‑priority tasks (such as training or offline inference) during low‑traffic periods. Three prerequisites are required: a scientific autoscaling mechanism, guaranteed compute resources at scaling time, and SLA protection for user experience.
Scientific Autoscaling Mechanism
Knative provides Horizontal Pod Autoscaler (HPA) and CronHPA, but HPA suffers from latency and limited support for GPU metrics. Instead, the solution uses Knative's KPA (Knative Pod Autoscaler) with concurrency as the primary metric, which better matches the long‑running nature of LLM inference requests.
Resource Guarantee at Scale‑up
During scale‑up, lower‑priority pods (e.g., training jobs) are pre‑empted using the ACK Pro scheduler's ResourcePolicy , allowing high‑priority inference pods to acquire GPU resources immediately.
SLA Protection During Scaling
Because LLM images and models are large (e.g., a 5 GB vLLM image and a 15 GB Qwen‑2.5‑7B model), loading from network storage can dominate latency. The solution integrates Fluid to cache model data locally, reducing cold‑start time dramatically.
Overall Architecture
The architecture combines Knative (KPA), ResourcePolicy, and Fluid on an ACK cluster. Knative detects load changes and adjusts pod counts, ResourcePolicy defines priority‑based GPU scheduling and pre‑emption, and Fluid provides tiered, in‑memory caching of model files.
Quick Practice Steps
1. Prepare the environment: install ACK, Knative, and Fluid; provision GPU nodes (e.g., A10 and T4 instances) and storage volumes.
2. Create a Fluid Dataset and JindoRuntime to cache the model (example YAML shown below).
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: qwen-7b-chat-int8-dataset
spec:
mounts:
- mountPoint: pvc://qwen-7b-chat-int8
name: data
path: /
accessModes:
- ReadOnlyMany
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- "ecs.g8i.24xlarge"
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: qwen-7b-chat-int8-dataset
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
volumeType: emptyDir
path: /dev/shm
quota: 20Gi
high: "0.9"
low: "0.8"3. Warm up the cache with a DataLoad resource (YAML omitted for brevity).
4. Define a ResourcePolicy to prioritize A10 GPUs and enable pre‑emptive scheduling of training pods.
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: qwen
spec:
selector:
release: qwen
strategy: prefer
preemptPolicy: BeforeNextUnit
units:
- resource: ecs
nodeSelector:
aliyun.accelerator/nvidia_name: NVIDIA-A10
- resource: ecs
nodeSelector:
aliyun.accelerator/nvidia_name: Tesla-T45. Deploy the inference service as a Knative Service with KPA annotations, setting concurrency target, min/max scale, and priority class for pre‑emptive scheduling.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: qwen
labels:
release: qwen
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "2"
autoscaling.knative.dev/min-scale: "1"
autoscaling.knative.dev/max-scale: "3"
spec:
containers:
- image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
command: ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"]
resources:
limits:
cpu: "32"
memory: 64Gi
nvidia.com/gpu: "1"
requests:
cpu: "16"
memory: 64Gi
nvidia.com/gpu: "1"
volumeMounts:
- name: qwen-7b-chat-int8
mountPath: /mnt/models/Qwen-7B-Chat-Int8
priorityClassName: infrence-with-high-priority
volumes:
- name: qwen-7b-chat-int8
persistentVolumeClaim:
claimName: qwen-7b-chat-int8-dataset6. Enable Knative reserve‑instance feature to keep a low‑spec GPU node during idle periods, balancing cost and cold‑start latency.
Performance Testing
Tests compare baseline zero‑scale‑to‑zero with and without Fluid, as well as reserve‑instance configurations. Results show Fluid reduces model loading time from ~90 s to ~20 s and improves total time‑to‑first‑token (TTFT) across all scenarios, especially when a reserve instance is kept.
Conclusion
Knative’s request‑based autoscaling aligns well with LLM inference workloads, and its reserve‑instance capability helps cut costs. ResourcePolicy adds fine‑grained priority and pre‑emptive scheduling for heterogeneous GPU clusters, while Fluid dramatically speeds up model loading, making elastic inference both cost‑effective and responsive.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.