Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes
This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.
Background
DeepSeek‑R1 is the first‑generation inference model from DeepSeek. It achieves top results on mathematical reasoning, programming contests, creative writing and other tasks, and is available in distilled sizes (14B, 32B, 70B) that outperform many open‑source alternatives.
ACK One Registered Cluster
Alibaba Cloud ACK One registers an on‑premise or other‑cloud Kubernetes cluster to the Alibaba Cloud Container Service platform, enabling seamless scaling of compute resources.
ACS GPU Compute
Container Computing Service (ACS) provides serverless GPU compute that can be attached to the registered cluster. Adding the labels alibabacloud.com/acs="true" and alibabacloud.com/compute-class=gpu (and a GPU model series label) directs workloads to ACS GPU nodes.
vLLM Inference Framework
vLLM is an efficient large‑language‑model serving framework that supports DeepSeek‑R1 via PagedAttention, dynamic batching and quantization. Repository: https://github.com/vllm-project/vllm
Step‑by‑Step Deployment
Prepare the model files Download the DeepSeek‑R1‑Distill‑Qwen‑7B repository from ModelScope using git‑lfs :
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B
git lfs pullUpload the model directory to an OSS bucket (install ossutil first):
ossutil mkdir oss://<bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<bucket-name>/models/DeepSeek-R1-Distill-Qwen-7BCreate PersistentVolume (PV) and PersistentVolumeClaim (PVC) Define a Secret with OSS credentials, a PV that uses the ossplugin.csi.alibabacloud.com driver, and a PVC that binds to the PV. Example YAML:
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak>
akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
volumeAttributes:
bucket: <bucket-name>
url: <oss-endpoint>
path: /models/DeepSeek-R1-Distill-Qwen-7B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelDeploy the model service Check cluster nodes (virtual‑kubelet nodes host ACS GPU): kubectl get nodes -o wide Use the arena CLI to create a custom serving job that runs vLLM on the model stored in the PVC:
arena serve custom \
--name=deepseek-r1 \
--version=v1 \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--replicas=1 \
--env-from-secret=akId=oss-secret \
--env-from-secret=akSecret=oss-secret \
--label=alibabacloud.com/acs="true" \
--label=alibabacloud.com/compute-class=gpu \
--label=alibabacloud.com/gpu-model-series=example-model \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=registry-cn-hangzhou-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6 \
--data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"Expected creation output:
service/deepseek-r1-v1 created
deployment.apps/deepseek-r1-v1-custom-serving createdVerify the service Check the job status: arena serve get deepseek-r1 Port‑forward the service to the local machine:
kubectl port-forward svc/deepseek-r1-v1 8000:8000Send a test request:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"deepseek-r1","messages":[{"role":"user","content":"你好,DeepSeek。"}],"max_tokens":100,"temperature":0.7,"top_p":0.9,"seed":10}'The response is a JSON object containing the model’s answer.
Key Parameters
--label : Use the three labels shown above to request ACS GPU compute.
--image : Image containing vLLM (registry‑cn‑hangzhou‑vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6).
--data : Mount the PVC to /model/DeepSeek-R1-Distill-Qwen-7B inside the container.
References
DeepSeek AI GitHub: https://github.com/deepseek-ai
vLLM GitHub: https://github.com/vllm-project/vllm
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
