Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes
This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.
Background
DeepSeek‑R1 is a high‑performance large language model (LLM) optimized for mathematical reasoning, coding challenges, and general question answering. Deploying it on‑premises can be limited by compute capacity, so Alibaba Cloud ACK One registered clusters with ACS GPU compute can be used to scale inference workloads.
Prerequisites
Access to Alibaba Cloud Container Service (ACK) console and ACS GPU resources.
git‑lfs and ossutil installed.
Arena client configured.
Step 1 – Prepare the model
Clone the DeepSeek‑R1‑Distill‑Qwen‑7B repository from ModelScope and pull the LFS files.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B
git lfs pullUpload the model directory to an OSS bucket.
ossutil mkdir oss://my-bucket/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://my-bucket/models/DeepSeek-R1-Distill-Qwen-7BStep 2 – Create PersistentVolume and PersistentVolumeClaim
Define a static OSS‑backed PersistentVolume (PV) using the ossplugin.csi.alibabacloud.com driver and a matching PersistentVolumeClaim (PVC) named llm-model. Example YAML:
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak>
akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: my-bucket
url: oss-cn-hangzhou-internal.aliyuncs.com
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: /models/DeepSeek-R1-Distill-Qwen-7B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelStep 3 – Deploy the inference service with Arena
Run a custom serving job that uses ACS GPU resources. The command specifies GPU count, CPU, memory, required labels, Docker image, and mounts the PVC.
arena serve custom \
--name=deepseek-r1 \
--version=v1 \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--replicas=1 \
--env-from-secret=akId=oss-secret \
--env-from-secret=akSecret=oss-secret \
--label=alibabacloud.com/acs="true" \
--label=alibabacloud.com/compute-class=gpu \
--label=alibabacloud.com/gpu-model-series=example-model \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=registry-cn-hangzhou-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6 \
--data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"Key labels required for ACS GPU:
--label=alibabacloud.com/acs="true" --label=alibabacloud.com/compute-class=gpu --label=alibabacloud.com/gpu-model-series=example-modelStep 4 – Verify the service
Check the job status:
arena serve get deepseek-r1Confirm the pod is scheduled on a virtual node:
kubectl get po -owide | grep deepseek-r1-v1Port‑forward the service to the local machine:
kubectl port-forward svc/deepseek-r1-v1 8000:8000Send a test request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"deepseek-r1","messages":[{"role":"user","content":"你好,DeepSeek。"}],"max_tokens":100,"temperature":0.7,"top_p":0.9,"seed":10}'The response is a JSON object containing the model’s answer.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
