Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide
This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.
Background
Alibaba Cloud Container Compute Service (ACS) provides on‑demand GPU compute without low‑level hardware management, making it suitable for large language model (LLM) inference. The QwQ‑32B model has 32 billion parameters and achieves benchmark scores comparable to DeepSeek‑R1 671B.
Prerequisites
Alibaba Cloud account with real‑name verification.
ACS cluster in a GPU‑enabled region/zone. kubectl configured to access the Kubernetes cluster.
GPU Instance Specification
The model parameters require roughly 60 GiB of VRAM (32 × 10⁹ × 2 bytes). To accommodate KV cache and buffers, a GPU with at least 80 GiB VRAM, 16 vCPU and 128 GiB RAM is recommended.
Step 1 – Prepare Model Data
Download the model from ModelScope using git‑lfs and upload the files to an OSS bucket.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pullCreate the target directory in OSS and copy the model files:
ossutil mkdir oss://<code><your-bucket-name></code>/models/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<code><your-bucket-name></code>/models/QwQ-32BStep 2 – Create PV and PVC
Define a secret with OSS credentials, a PersistentVolume (PV) that uses the OSS CSI driver, and a PersistentVolumeClaim (PVC) that binds to the PV.
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak>
akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name>
url: <your-bucket-endpoint>
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: /models/QwQ-32B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelStep 3 – Deploy the Model with vLLM
Apply a Deployment that runs the vLLM container and a Service exposing port 8000.
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwq-32b
labels:
app: qwq-32b
spec:
replicas: 1
selector:
matchLabels:
app: qwq-32b
template:
metadata:
labels:
app: qwq-32b
alibabacloud.com/compute-class: gpu
alibabacloud.com/gpu-model-series: <example-model>
spec:
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
containers:
- name: vllm
image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
command: ["sh", "-c", "vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"]
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
cpu: "16"
memory: 128G
volumeMounts:
- name: model
mountPath: /models/QwQ-32B
- name: dshm
mountPath: /dev/shm
---
apiVersion: v1
kind: Service
metadata:
name: qwq-32b-v1
spec:
type: ClusterIP
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: qwq-32bStep 4 – Deploy OpenWebUI
OpenWebUI provides a web front‑end that forwards requests to the vLLM service.
apiVersion: apps/v1
kind: Deployment
metadata:
name: openwebui
spec:
replicas: 1
selector:
matchLabels:
app: openwebui
template:
metadata:
labels:
app: openwebui
spec:
containers:
- name: openwebui
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/open-webui:main
env:
- name: ENABLE_OPENAI_API
value: "True"
- name: ENABLE_OLLAMA_API
value: "False"
- name: OPENAI_API_BASE_URL
value: http://qwq-32b-v1:8000/v1
- name: ENABLE_AUTOCOMPLETE_GENERATION
value: "False"
- name: ENABLE_TAGS_GENERATION
value: "False"
ports:
- containerPort: 8080
volumeMounts:
- name: data-volume
mountPath: /app/backend/data
volumes:
- name: data-volume
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: openwebui
labels:
app: openwebui
spec:
type: ClusterIP
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
app: openwebuiStep 5 – Verify the Inference Service
Port‑forward the OpenWebUI service to a local port and open the UI in a browser. kubectl port-forward svc/openwebui 8080:8080 Then navigate to http://localhost:8080, log in, and send prompts to the model.
Optional Step 6 – Benchmark the Service
Create a benchmark pod that mounts the model volume, install the benchmark dataset, and run the vLLM benchmark script.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-benchmark
labels:
app: vllm-benchmark
spec:
replicas: 1
selector:
matchLabels:
app: vllm-benchmark
template:
metadata:
labels:
app: vllm-benchmark
spec:
volumes:
- name: llm-model
persistentVolumeClaim:
claimName: llm-model
containers:
- name: vllm-benchmark
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-benchmark:v1
command: ["sh", "-c", "sleep inf"]
volumeMounts:
- name: llm-model
mountPath: /models/QwQ-32BEnter the pod, install modelscope, download the ShareGPT dataset, and run the benchmark:
# Enter the benchmark pod
PODNAME=$(kubectl get po -o custom-columns=":metadata.name" | grep vllm-benchmark)
kubectl exec -it $PODNAME -- bash
# Install modelscope and download data
pip3 install modelscope
modelscope download --dataset gliang1001/ShareGPT_V3_unfiltered_cleaned_split ShareGPT_V3_unfiltered_cleaned_split.json --local_dir /root/
# Run benchmark
python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/QwQ-32B \
--served-model-name qwq-32b \
--trust-remote-code \
--dataset-name random \
--dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-input-len 4096 \
--random-output-len 512 \
--random-range-ratio 1 \
--num-prompts 80 \
--max-concurrency 8 \
--host qwq-32b-v1 \
--port 8000 \
--endpoint /v1/completions \
--save-result | tee benchmark_serving.txtThe benchmark reports metrics such as request throughput (~0.17 req/s), token throughput (~790 tok/s), mean time‑to‑first‑token (~10 s), and per‑token latency.
References
ModelScope repository: https://www.modelscope.cn/Qwen/QwQ-32B.git vLLM project:
https://github.com/vllm-project/vllmHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
