Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU
This article explains how to use Alibaba Cloud ACK Edge to manage on‑premise GPU resources and seamlessly fall back to cloud‑based ACS Serverless GPU via virtual nodes for deploying DeepSeek R1 inference, covering environment preparation, model download, storage setup, custom scheduling, and scaling strategies.
Alibaba Cloud ACK Edge clusters adopt a cloud‑edge integrated architecture, hosting the Kubernetes control plane in the cloud while IDC machines act as data‑plane nodes, enabling containerized management of existing on‑premise GPU resources and improving deployment efficiency.
With the rapid growth of AI large‑model services, ACK Edge has helped many customers manage IDC GPU machines and quickly deploy inference workloads. The DeepSeek R1 model, however, uses a Mixture‑of‑Experts architecture that requires at least eight GPUs and newer GPU cards for FP8 training, creating a resource challenge for IDC environments.
This guide demonstrates how to manage IDC GPU machines through ACK Edge and deploy the DeepSeek inference service using the ACK AI suite. The workflow prioritizes running inference Pods on IDC GPUs, and when those resources are insufficient, it automatically creates cloud‑based ACS Serverless GPU virtual nodes to run the Pods, achieving business scalability and cost optimization.
Solution Advantages
• Extreme elasticity: provides massive, second‑level scaling to handle traffic spikes. • Fine‑grained cost control: pay‑as‑you‑go without purchasing servers. • Rich elastic resources: supports CPU, GPU, and other instance types.
Usage Example
Prepare Environment
• Choose a region as the central region and create an ACK Edge cluster. • Install the virtual‑node component (see component management documentation). • Install KServe (see ack‑kserve component guide). • Install Arena (see Arena client configuration). • Deploy monitoring components and configure GPU metrics for auto‑scaling. • Create an edge node pool in a dedicated VPC and add IDC resources to the pool.
Step 1: Download DeepSeek‑R1‑Distill‑Qwen‑7B model
git lfs install</code><code>GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git</code><code>cd DeepSeek-R1-Distill-Qwen-7B/</code><code>git lfs pullUpload the model to OSS (create a bucket directory first):
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B</code><code>ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7BStep 2: Create PV and PVC for the model
apiVersion: v1</code><code>kind: Secret</code><code>metadata:</code><code> name: oss-secret</code><code>stringData:</code><code> akId: <your-oss-ak></code><code> akSecret: <your-oss-sk></code><code>---</code><code>apiVersion: v1</code><code>kind: PersistentVolume</code><code>metadata:</code><code> name: llm-model</code><code> labels:</code><code> alicloud-pvname: llm-model</code><code>spec:</code><code> capacity:</code><code> storage: 30Gi</code><code> accessModes:</code><code> - ReadOnlyMany</code><code> persistentVolumeReclaimPolicy: Retain</code><code> csi:</code><code> driver: ossplugin.csi.alibabacloud.com</code><code> volumeHandle: llm-model</code><code> nodePublishSecretRef:</code><code> name: oss-secret</code><code> namespace: default</code><code> volumeAttributes:</code><code> bucket: <your-bucket-name></code><code> url: <your-bucket-endpoint></code><code> otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"</code><code> path: /models/DeepSeek-R1-Distill-Qwen-7B/</code><code>---</code><code>apiVersion: v1</code><code>kind: PersistentVolumeClaim</code><code>metadata:</code><code> name: llm-model</code><code>spec:</code><code> accessModes:</code><code> - ReadOnlyMany</code><code> resources:</code><code> requests:</code><code> storage: 30Gi</code><code> selector:</code><code> matchLabels:</code><code> alicloud-pvname: llm-modelStep 3: Create a custom scheduling policy
apiVersion: scheduling.alibabacloud.com/v1alpha1</code><code>kind: ResourcePolicy</code><code>metadata:</code><code> name: deepseek</code><code> namespace: default</code><code>spec:</code><code> selector:</code><code> app: isvc.deepseek-predictor</code><code> strategy: prefer</code><code> units:</code><code> - resource: ecs</code><code> nodeSelector:</code><code> alibabacloud.com/nodepool-id: np*********</code><code> - resource: eciStep 4: Deploy the model with Arena/KServe
arena serve kserve \
--name=deepseek \
--annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
--annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
--gpus=1 \
--cpu=4 \
--memory=12Gi \
--scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
--scale-target=50 \
--min-replicas=1 \
--max-replicas=3 \
--data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"Check node status: kubectl get nodes -owide Expected output shows one IDC node (idc001) with a V100 GPU and one virtual node.
Query the inference service: arena serve get deepseek Expected output confirms the Pod is scheduled on the IDC node.
Step 5: Simulate traffic spikes to trigger cloud‑side scaling
hey -z 5m -c 5 \
-m POST -host deepseek-default.example.com \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completionsWhen GPU utilization exceeds the threshold, the HPA creates additional replicas on the virtual node.
Final summary: ACK Edge provides a cloud‑native, edge‑integrated Kubernetes platform that manages IDC, ENS, and cross‑region ECS resources, reducing operational complexity while seamlessly leveraging cloud elasticity. Combining ACK Edge with virtual nodes enables fine‑grained cost control and reliable scaling for AI inference workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
