Hybrid Cloud Elastic LLM Inference Solution with ACK Edge and KServe
This article presents a hybrid‑cloud solution that uses ACK Edge and KServe to dynamically allocate on‑premise and cloud GPU resources for large‑language‑model inference, addressing tidal traffic patterns, reducing costs, and ensuring high availability through elastic scaling and custom scheduling policies.
During the Chinese New Year, the DeepSeek LLM attracted massive traffic, causing inference servers to run out of resources, highlighting the need for elastic LLM inference.
LLM inference traffic shows tidal patterns, with peaks and troughs that make GPU resource allocation difficult in on‑premise IDC environments. To address this, a hybrid‑cloud solution built on Alibaba Cloud Container Service for Edge (ACK Edge) was designed.
Overall architecture uses ACK Edge to manage both cloud and edge GPU pools, and KServe to deploy the LLM service with elastic scaling.
Key technologies
ACK Edge – a cloud‑native platform that unifies management of cloud and edge Kubernetes clusters.
KServe – an open‑source model serving framework that supports auto‑scaling, zero‑scale, and multi‑framework models.
Elastic node pool – driven by the cluster‑autoscaler to add or remove nodes automatically.
ResourcePolicy – a custom scheduling policy that prioritises IDC resources during low load and falls back to cloud resources when needed.
Quick practice
1. Prepare the cluster: create an ACK Edge cluster, an elastic node pool, install KServe, configure the Arena client, and enable GPU monitoring.
2. Prepare the model data in OSS/NAS.
3. Define a ResourcePolicy CRD to set the scheduling order (IDC first, then cloud elastic pool).
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: qwen-chat
namespace: default
spec:
selector:
app: isvc.qwen-predictor
strategy: prefer
units:
- resource: ecs
nodeSelector:
alibabacloud.com/nodepool-id: npxxxxxx # IDC pool
- resource: elastic
nodeSelector:
alibabacloud.com/nodepool-id: npxxxxxy # elastic pool4. Deploy the LLM service with a single Arena command that uses KServe, sets GPU‑utilisation scaling, and specifies resource limits.
arena serve kserve \
--name=qwen-chat \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
--scale-target=50 \
--min-replicas=1 \
--max-replicas=3 \
--gpus=1 \
--cpu=4 \
--memory=12Gi \
--data="llm-model:/mnt/models/Qwen" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"5. Test the service with a curl request and simulate a traffic spike using the hey load‑testing tool.
curl -H "Host: qwen-chat-default.example.com" \
-H "Content-Type: application/json" \
http://xx.xx.xx.xx:80/v1/chat/completions \
-X POST \
-d '{"model":"qwen","messages":[{"role":"user","content":"你好"}],"max_tokens":512,"temperature":0.7,"top_p":0.9,"seed":10,"stop":["<|endoftext|>","<|im_end|>","<|im_start|>"]}' hey -z 2m -c 5 \
-m POST -host qwen-chat-default.example.com \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"测试一下"}],"max_tokens":10,"temperature":0.7,"top_p":0.9,"seed":10}' \
http://xx.xx.xx.xx:80/v1/chat/completionsDuring the load test the GPU utilisation exceeds the threshold, triggering HPA to scale the pods; pending pods cause the elastic node pool to provision additional cloud GPU nodes, demonstrating seamless scaling from edge to cloud.
Conclusion The tidal nature of LLM inference traffic can be solved with a hybrid‑cloud elastic solution on ACK Edge, which dynamically balances on‑premise and cloud resources, reduces operating costs, and ensures high availability.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.