15 min read

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

Alibaba Cloud Native

Feb 18, 2025

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

Background

DeepSeek‑R1 is the first‑generation inference model from DeepSeek. It achieves top results on mathematical reasoning, programming contests, creative writing and other tasks, and is available in distilled sizes (14B, 32B, 70B) that outperform many open‑source alternatives.

ACK One Registered Cluster

Alibaba Cloud ACK One registers an on‑premise or other‑cloud Kubernetes cluster to the Alibaba Cloud Container Service platform, enabling seamless scaling of compute resources.

ACS GPU Compute

Container Computing Service (ACS) provides serverless GPU compute that can be attached to the registered cluster. Adding the labels alibabacloud.com/acs="true" and alibabacloud.com/compute-class=gpu (and a GPU model series label) directs workloads to ACS GPU nodes.

vLLM Inference Framework

vLLM is an efficient large‑language‑model serving framework that supports DeepSeek‑R1 via PagedAttention, dynamic batching and quantization. Repository: https://github.com/vllm-project/vllm

Step‑by‑Step Deployment

Prepare the model files Download the DeepSeek‑R1‑Distill‑Qwen‑7B repository from ModelScope using git‑lfs :

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B
git lfs pull

Upload the model directory to an OSS bucket (install ossutil first):

ossutil mkdir oss://<bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

Create PersistentVolume (PV) and PersistentVolumeClaim (PVC) Define a Secret with OSS credentials, a PV that uses the ossplugin.csi.alibabacloud.com driver, and a PVC that binds to the PV. Example YAML:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>
  akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
    volumeAttributes:
      bucket: <bucket-name>
      url: <oss-endpoint>
      path: /models/DeepSeek-R1-Distill-Qwen-7B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Deploy the model service Check cluster nodes (virtual‑kubelet nodes host ACS GPU): kubectl get nodes -o wide Use the arena CLI to create a custom serving job that runs vLLM on the model stored in the PVC:

arena serve custom \
  --name=deepseek-r1 \
  --version=v1 \
  --gpus=1 \
  --cpu=8 \
  --memory=32Gi \
  --replicas=1 \
  --env-from-secret=akId=oss-secret \
  --env-from-secret=akSecret=oss-secret \
  --label=alibabacloud.com/acs="true" \
  --label=alibabacloud.com/compute-class=gpu \
  --label=alibabacloud.com/gpu-model-series=example-model \
  --restful-port=8000 \
  --readiness-probe-action="tcpSocket" \
  --readiness-probe-action-option="port: 8000" \
  --readiness-probe-option="initialDelaySeconds: 30" \
  --readiness-probe-option="periodSeconds: 30" \
  --image=registry-cn-hangzhou-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6 \
  --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
  "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

Expected creation output:

service/deepseek-r1-v1 created
deployment.apps/deepseek-r1-v1-custom-serving created

Verify the service Check the job status: arena serve get deepseek-r1 Port‑forward the service to the local machine:

kubectl port-forward svc/deepseek-r1-v1 8000:8000

Send a test request:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"deepseek-r1","messages":[{"role":"user","content":"你好，DeepSeek。"}],"max_tokens":100,"temperature":0.7,"top_p":0.9,"seed":10}'

The response is a JSON object containing the model’s answer.

Key Parameters

--label : Use the three labels shown above to request ACS GPU compute.

--image : Image containing vLLM (registry‑cn‑hangzhou‑vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6).

--data : Mount the PVC to /model/DeepSeek-R1-Distill-Qwen-7B inside the container.

References

DeepSeek AI GitHub: https://github.com/deepseek-ai

vLLM GitHub: https://github.com/vllm-project/vllm

model deployment vLLM DeepSeek ACK One ACS GPU

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.