Artificial Intelligence 16 min read

Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU

This article explains how to use Alibaba Cloud ACK Edge to manage on‑premise GPU resources and seamlessly fall back to cloud‑based ACS Serverless GPU via virtual nodes for deploying DeepSeek R1 inference, covering environment preparation, model download, storage setup, custom scheduling, and scaling strategies.

Alibaba Cloud Infrastructure

Feb 21, 2025

Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU

Alibaba Cloud ACK Edge clusters adopt a cloud‑edge integrated architecture, hosting the Kubernetes control plane in the cloud while IDC machines act as data‑plane nodes, enabling containerized management of existing on‑premise GPU resources and improving deployment efficiency.

With the rapid growth of AI large‑model services, ACK Edge has helped many customers manage IDC GPU machines and quickly deploy inference workloads. The DeepSeek R1 model, however, uses a Mixture‑of‑Experts architecture that requires at least eight GPUs and newer GPU cards for FP8 training, creating a resource challenge for IDC environments.

This guide demonstrates how to manage IDC GPU machines through ACK Edge and deploy the DeepSeek inference service using the ACK AI suite. The workflow prioritizes running inference Pods on IDC GPUs, and when those resources are insufficient, it automatically creates cloud‑based ACS Serverless GPU virtual nodes to run the Pods, achieving business scalability and cost optimization.

Solution Advantages

• Extreme elasticity: provides massive, second‑level scaling to handle traffic spikes. • Fine‑grained cost control: pay‑as‑you‑go without purchasing servers. • Rich elastic resources: supports CPU, GPU, and other instance types.

Usage Example

Prepare Environment

• Choose a region as the central region and create an ACK Edge cluster. • Install the virtual‑node component (see component management documentation). • Install KServe (see ack‑kserve component guide). • Install Arena (see Arena client configuration). • Deploy monitoring components and configure GPU metrics for auto‑scaling. • Create an edge node pool in a dedicated VPC and add IDC resources to the pool.

Step 1: Download DeepSeek‑R1‑Distill‑Qwen‑7B model

git lfs install</code><code>GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git</code><code>cd DeepSeek-R1-Distill-Qwen-7B/</code><code>git lfs pull

Upload the model to OSS (create a bucket directory first):

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B</code><code>ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

Step 2: Create PV and PVC for the model

apiVersion: v1</code><code>kind: Secret</code><code>metadata:</code><code>  name: oss-secret</code><code>stringData:</code><code>  akId: <your-oss-ak></code><code>  akSecret: <your-oss-sk></code><code>---</code><code>apiVersion: v1</code><code>kind: PersistentVolume</code><code>metadata:</code><code>  name: llm-model</code><code>  labels:</code><code>    alicloud-pvname: llm-model</code><code>spec:</code><code>  capacity:</code><code>    storage: 30Gi</code><code>  accessModes:</code><code>    - ReadOnlyMany</code><code>  persistentVolumeReclaimPolicy: Retain</code><code>  csi:</code><code>    driver: ossplugin.csi.alibabacloud.com</code><code>    volumeHandle: llm-model</code><code>    nodePublishSecretRef:</code><code>      name: oss-secret</code><code>      namespace: default</code><code>    volumeAttributes:</code><code>      bucket: <your-bucket-name></code><code>      url: <your-bucket-endpoint></code><code>      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"</code><code>      path: /models/DeepSeek-R1-Distill-Qwen-7B/</code><code>---</code><code>apiVersion: v1</code><code>kind: PersistentVolumeClaim</code><code>metadata:</code><code>  name: llm-model</code><code>spec:</code><code>  accessModes:</code><code>    - ReadOnlyMany</code><code>  resources:</code><code>    requests:</code><code>      storage: 30Gi</code><code>  selector:</code><code>    matchLabels:</code><code>      alicloud-pvname: llm-model

Step 3: Create a custom scheduling policy

apiVersion: scheduling.alibabacloud.com/v1alpha1</code><code>kind: ResourcePolicy</code><code>metadata:</code><code>  name: deepseek</code><code>  namespace: default</code><code>spec:</code><code>  selector:</code><code>    app: isvc.deepseek-predictor</code><code>  strategy: prefer</code><code>  units:</code><code>  - resource: ecs</code><code>    nodeSelector:</code><code>      alibabacloud.com/nodepool-id: np*********</code><code>  - resource: eci

Step 4: Deploy the model with Arena/KServe

arena serve kserve \
    --name=deepseek \
    --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
    --annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1 \
    --max-replicas=3 \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"

Check node status: kubectl get nodes -owide Expected output shows one IDC node (idc001) with a V100 GPU and one virtual node.

Query the inference service: arena serve get deepseek Expected output confirms the Pod is scheduled on the IDC node.

Step 5: Simulate traffic spikes to trigger cloud‑side scaling

hey -z 5m -c 5 \
    -m POST -host deepseek-default.example.com \
    -H "Content-Type: application/json" \
    -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
    http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions

When GPU utilization exceeds the threshold, the HPA creates additional replicas on the virtual node.

Final summary: ACK Edge provides a cloud‑native, edge‑integrated Kubernetes platform that manages IDC, ENS, and cross‑region ECS resources, reducing operational complexity while seamlessly leveraging cloud elasticity. Combining ACK Edge with virtual nodes enables fine‑grained cost control and reliable scaling for AI inference workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes DeepSeek GPU Virtual Node ACK@Edge

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.