Artificial Intelligence 13 min read

Hybrid Cloud Elastic LLM Inference Solution with ACK Edge and KServe

This article presents a hybrid‑cloud solution that uses ACK Edge and KServe to dynamically allocate on‑premise and cloud GPU resources for large‑language‑model inference, addressing tidal traffic patterns, reducing costs, and ensuring high availability through elastic scaling and custom scheduling policies.

Alibaba Cloud Infrastructure

Feb 10, 2025

Hybrid Cloud Elastic LLM Inference Solution with ACK Edge and KServe

During the Chinese New Year, the DeepSeek LLM attracted massive traffic, causing inference servers to run out of resources, highlighting the need for elastic LLM inference.

LLM inference traffic shows tidal patterns, with peaks and troughs that make GPU resource allocation difficult in on‑premise IDC environments. To address this, a hybrid‑cloud solution built on Alibaba Cloud Container Service for Edge (ACK Edge) was designed.

Overall architecture uses ACK Edge to manage both cloud and edge GPU pools, and KServe to deploy the LLM service with elastic scaling.

Key technologies

ACK Edge – a cloud‑native platform that unifies management of cloud and edge Kubernetes clusters.

KServe – an open‑source model serving framework that supports auto‑scaling, zero‑scale, and multi‑framework models.

Elastic node pool – driven by the cluster‑autoscaler to add or remove nodes automatically.

ResourcePolicy – a custom scheduling policy that prioritises IDC resources during low load and falls back to cloud resources when needed.

Quick practice

1. Prepare the cluster: create an ACK Edge cluster, an elastic node pool, install KServe, configure the Arena client, and enable GPU monitoring.

2. Prepare the model data in OSS/NAS.

3. Define a ResourcePolicy CRD to set the scheduling order (IDC first, then cloud elastic pool).

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: qwen-chat
  namespace: default
spec:
  selector:
    app: isvc.qwen-predictor
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: npxxxxxx  # IDC pool
  - resource: elastic
    nodeSelector:
      alibabacloud.com/nodepool-id: npxxxxxy  # elastic pool

4. Deploy the LLM service with a single Arena command that uses KServe, sets GPU‑utilisation scaling, and specifies resource limits.

arena serve kserve \
    --name=qwen-chat \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1 \
    --max-replicas=3 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --data="llm-model:/mnt/models/Qwen" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

5. Test the service with a curl request and simulate a traffic spike using the hey load‑testing tool.

curl -H "Host: qwen-chat-default.example.com" \
-H "Content-Type: application/json" \
http://xx.xx.xx.xx:80/v1/chat/completions \
-X POST \
-d '{"model":"qwen","messages":[{"role":"user","content":"你好"}],"max_tokens":512,"temperature":0.7,"top_p":0.9,"seed":10,"stop":["<|endoftext|>","<|im_end|>","<|im_start|>"]}'

hey -z 2m -c 5 \
-m POST -host qwen-chat-default.example.com \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"测试一下"}],"max_tokens":10,"temperature":0.7,"top_p":0.9,"seed":10}' \
http://xx.xx.xx.xx:80/v1/chat/completions

During the load test the GPU utilisation exceeds the threshold, triggering HPA to scale the pods; pending pods cause the elastic node pool to provision additional cloud GPU nodes, demonstrating seamless scaling from edge to cloud.

Conclusion The tidal nature of LLM inference traffic can be solved with a hybrid‑cloud elastic solution on ACK Edge, which dynamically balances on‑premise and cloud resources, reduces operating costs, and ensures high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Auto Scaling Hybrid Cloud ACK@Edge elastic inference KServe

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.