Artificial Intelligence 22 min read

Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

This article explains how to reduce inference cost and improve performance for large language models on Alibaba Cloud ACK by using Knative's request‑based autoscaling, custom ResourcePolicy priority scheduling, and Fluid data‑caching to achieve elastic scaling, resource pre‑emption, and faster model loading.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

In the era of AI commercialization, model inference is used far more often than training, making efficient, cost‑effective inference a top priority for infrastructure teams.

The main challenges are the high GPU cost and the need to balance performance, especially for consumer‑facing services where low latency is critical. Traditional GPU provisioning leads to idle resources during traffic troughs, while scaling down to zero can cause long cold‑start times.

To address these issues, the article proposes a "reverse elastic scaling" approach that keeps a fixed set of GPU nodes but dynamically reallocates workloads and pre‑empts lower‑priority tasks (such as training or offline inference) during low‑traffic periods. Three prerequisites are required: a scientific autoscaling mechanism, guaranteed compute resources at scaling time, and SLA protection for user experience.

Scientific Autoscaling Mechanism

Knative provides Horizontal Pod Autoscaler (HPA) and CronHPA, but HPA suffers from latency and limited support for GPU metrics. Instead, the solution uses Knative's KPA (Knative Pod Autoscaler) with concurrency as the primary metric, which better matches the long‑running nature of LLM inference requests.

Resource Guarantee at Scale‑up

During scale‑up, lower‑priority pods (e.g., training jobs) are pre‑empted using the ACK Pro scheduler's ResourcePolicy , allowing high‑priority inference pods to acquire GPU resources immediately.

SLA Protection During Scaling

Because LLM images and models are large (e.g., a 5 GB vLLM image and a 15 GB Qwen‑2.5‑7B model), loading from network storage can dominate latency. The solution integrates Fluid to cache model data locally, reducing cold‑start time dramatically.

Overall Architecture

The architecture combines Knative (KPA), ResourcePolicy, and Fluid on an ACK cluster. Knative detects load changes and adjusts pod counts, ResourcePolicy defines priority‑based GPU scheduling and pre‑emption, and Fluid provides tiered, in‑memory caching of model files.

Quick Practice Steps

1. Prepare the environment: install ACK, Knative, and Fluid; provision GPU nodes (e.g., A10 and T4 instances) and storage volumes.

2. Create a Fluid Dataset and JindoRuntime to cache the model (example YAML shown below).

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: qwen-7b-chat-int8-dataset
spec:
  mounts:
    - mountPoint: pvc://qwen-7b-chat-int8
      name: data
      path: /
  accessModes:
    - ReadOnlyMany
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values:
                - "ecs.g8i.24xlarge"
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: qwen-7b-chat-int8-dataset
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 20Gi
        high: "0.9"
        low: "0.8"

3. Warm up the cache with a DataLoad resource (YAML omitted for brevity).

4. Define a ResourcePolicy to prioritize A10 GPUs and enable pre‑emptive scheduling of training pods.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: qwen
spec:
  selector:
    release: qwen
  strategy: prefer
  preemptPolicy: BeforeNextUnit
  units:
    - resource: ecs
      nodeSelector:
        aliyun.accelerator/nvidia_name: NVIDIA-A10
    - resource: ecs
      nodeSelector:
        aliyun.accelerator/nvidia_name: Tesla-T4

5. Deploy the inference service as a Knative Service with KPA annotations, setting concurrency target, min/max scale, and priority class for pre‑emptive scheduling.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: qwen
  labels:
    release: qwen
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: "concurrency"
        autoscaling.knative.dev/target: "2"
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/max-scale: "3"
    spec:
      containers:
        - image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
          command: ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"]
          resources:
            limits:
              cpu: "32"
              memory: 64Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "16"
              memory: 64Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: qwen-7b-chat-int8
              mountPath: /mnt/models/Qwen-7B-Chat-Int8
          priorityClassName: infrence-with-high-priority
      volumes:
        - name: qwen-7b-chat-int8
          persistentVolumeClaim:
            claimName: qwen-7b-chat-int8-dataset

6. Enable Knative reserve‑instance feature to keep a low‑spec GPU node during idle periods, balancing cost and cold‑start latency.

Performance Testing

Tests compare baseline zero‑scale‑to‑zero with and without Fluid, as well as reserve‑instance configurations. Results show Fluid reduces model loading time from ~90 s to ~20 s and improves total time‑to‑first‑token (TTFT) across all scenarios, especially when a reserve instance is kept.

Conclusion

Knative’s request‑based autoscaling aligns well with LLM inference workloads, and its reserve‑instance capability helps cut costs. ResourcePolicy adds fine‑grained priority and pre‑emptive scheduling for heterogeneous GPU clusters, while Fluid dramatically speeds up model loading, making elastic inference both cost‑effective and responsive.

LLMKuberneteselastic scalingInferenceKnativeResourcePolicyFluid
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.