Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

Background

Alibaba Cloud Container Compute Service (ACS) provides on‑demand GPU compute without low‑level hardware management, making it suitable for large language model (LLM) inference. The QwQ‑32B model has 32 billion parameters and achieves benchmark scores comparable to DeepSeek‑R1 671B.

Prerequisites

Alibaba Cloud account with real‑name verification.

ACS cluster in a GPU‑enabled region/zone. kubectl configured to access the Kubernetes cluster.

GPU Instance Specification

The model parameters require roughly 60 GiB of VRAM (32 × 10⁹ × 2 bytes). To accommodate KV cache and buffers, a GPU with at least 80 GiB VRAM, 16 vCPU and 128 GiB RAM is recommended.

Step 1 – Prepare Model Data

Download the model from ModelScope using git‑lfs and upload the files to an OSS bucket.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull

Create the target directory in OSS and copy the model files:

ossutil mkdir oss://<code><your-bucket-name></code>/models/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<code><your-bucket-name></code>/models/QwQ-32B

Step 2 – Create PV and PVC

Define a secret with OSS credentials, a PersistentVolume (PV) that uses the OSS CSI driver, and a PersistentVolumeClaim (PVC) that binds to the PV.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>
  akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>
      url: <your-bucket-endpoint>
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: /models/QwQ-32B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 3 – Deploy the Model with vLLM

Apply a Deployment that runs the vLLM container and a Service exposing port 8000.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwq-32b
  labels:
    app: qwq-32b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwq-32b
  template:
    metadata:
      labels:
        app: qwq-32b
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/gpu-model-series: <example-model>
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
      containers:
        - name: vllm
          image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
          command: ["sh", "-c", "vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"]
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "16"
              memory: 128G
          volumeMounts:
            - name: model
              mountPath: /models/QwQ-32B
            - name: dshm
              mountPath: /dev/shm
---
apiVersion: v1
kind: Service
metadata:
  name: qwq-32b-v1
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwq-32b

Step 4 – Deploy OpenWebUI

OpenWebUI provides a web front‑end that forwards requests to the vLLM service.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
        - name: openwebui
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/open-webui:main
          env:
            - name: ENABLE_OPENAI_API
              value: "True"
            - name: ENABLE_OLLAMA_API
              value: "False"
            - name: OPENAI_API_BASE_URL
              value: http://qwq-32b-v1:8000/v1
            - name: ENABLE_AUTOCOMPLETE_GENERATION
              value: "False"
            - name: ENABLE_TAGS_GENERATION
              value: "False"
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: data-volume
              mountPath: /app/backend/data
      volumes:
        - name: data-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui
  labels:
    app: openwebui
spec:
  type: ClusterIP
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    app: openwebui

Step 5 – Verify the Inference Service

Port‑forward the OpenWebUI service to a local port and open the UI in a browser. kubectl port-forward svc/openwebui 8080:8080 Then navigate to http://localhost:8080, log in, and send prompts to the model.

Optional Step 6 – Benchmark the Service

Create a benchmark pod that mounts the model volume, install the benchmark dataset, and run the vLLM benchmark script.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-benchmark
  labels:
    app: vllm-benchmark
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-benchmark
  template:
    metadata:
      labels:
        app: vllm-benchmark
    spec:
      volumes:
        - name: llm-model
          persistentVolumeClaim:
            claimName: llm-model
      containers:
        - name: vllm-benchmark
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-benchmark:v1
          command: ["sh", "-c", "sleep inf"]
          volumeMounts:
            - name: llm-model
              mountPath: /models/QwQ-32B

Enter the pod, install modelscope, download the ShareGPT dataset, and run the benchmark:

# Enter the benchmark pod
PODNAME=$(kubectl get po -o custom-columns=":metadata.name" | grep vllm-benchmark)
kubectl exec -it $PODNAME -- bash
# Install modelscope and download data
pip3 install modelscope
modelscope download --dataset gliang1001/ShareGPT_V3_unfiltered_cleaned_split ShareGPT_V3_unfiltered_cleaned_split.json --local_dir /root/
# Run benchmark
python3 /root/vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model /models/QwQ-32B \
  --served-model-name qwq-32b \
  --trust-remote-code \
  --dataset-name random \
  --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-input-len 4096 \
  --random-output-len 512 \
  --random-range-ratio 1 \
  --num-prompts 80 \
  --max-concurrency 8 \
  --host qwq-32b-v1 \
  --port 8000 \
  --endpoint /v1/completions \
  --save-result | tee benchmark_serving.txt

The benchmark reports metrics such as request throughput (~0.17 req/s), token throughput (~790 tok/s), mean time‑to‑first‑token (~10 s), and per‑token latency.

References

ModelScope repository: https://www.modelscope.cn/Qwen/QwQ-32B.git vLLM project:

https://github.com/vllm-project/vllm
OpenWebUI screenshot
OpenWebUI screenshot
LLMKubernetesvLLMbenchmarkGPUAlibaba CloudInferenceACS
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.