Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing
This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.
Background : Alibaba Cloud recently released the QwQ-32B model (320 billion parameters) whose performance rivals DeepSeek‑R1 671B. The vLLM framework provides efficient inference with features such as PagedAttention, dynamic batching, and model quantization.
Prerequisites : A GPU‑enabled ACK Kubernetes cluster (e.g., ecs.gn7i-c32g1.32xlarge with 4 × A10 GPUs) and an OSS bucket for model storage.
Step 1 – Prepare Model Data :
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pullUpload the downloaded model files to OSS:
ossutil mkdir oss://<Your-Bucket-Name>/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<Your-Bucket-Name>/QwQ-32BConfigure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that use the OSS static volume (example configurations are shown in the original tables).
Step 2 – Deploy Inference Service (vLLM deployment):
kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: qwq-32b
name: qwq-32b
namespace: default
spec:
replicas: 5 # adjust to GPU node count
selector:
matchLabels:
app: qwq-32b
template:
metadata:
labels:
app: qwq-32b
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
spec:
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
containers:
- command:
- sh
- -c
- vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
name: vllm
ports:
- containerPort: 8000
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
periodSeconds: 30
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- mountPath: /models/QwQ-32B
name: model
- mountPath: /dev/shm
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: qwq-32b-v1
spec:
type: ClusterIP
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: qwq-32b
EOFStep 3 – Deploy OpenWebUI :
kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: openwebui
spec:
replicas: 1
selector:
matchLabels:
app: openwebui
template:
metadata:
labels:
app: openwebui
spec:
containers:
- env:
- name: ENABLE_OPENAI_API
value: "True"
- name: ENABLE_OLLAMA_API
value: "False"
- name: OPENAI_API_BASE_URL
value: http://qwq-32b-v1:8000/v1
- name: ENABLE_AUTOCOMPLETE_GENERATION
value: "False"
- name: ENABLE_TAGS_GENERATION
value: "False"
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/open-webui:main
name: openwebui
ports:
- containerPort: 8080
volumeMounts:
- mountPath: /app/backend/data
name: data-volume
volumes:
- emptyDir: {}
name: data-volume
---
apiVersion: v1
kind: Service
metadata:
name: openwebui
spec:
type: ClusterIP
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
app: openwebui
EOFStep 4 – Verify Inference Service : Use kubectl port-forward svc/openwebui 8080:8080 and access http://localhost:8080 to log into OpenWebUI and test a prompt (e.g., "0.11和0.9谁大?").
Optional Step 5 – Benchmark Inference Service :
Deploy a benchmark pod, download the ShareGPT_V3 dataset, and run benchmark_serving.py with appropriate parameters. Sample output shows request throughput, token throughput, and latency metrics (e.g., TTFT ≈ 4.9 s, output token throughput ≈ 101.89 tok/s for 8‑concurrency).
Intelligent Routing with ACK Gateway :
Enable the ACK Gateway with AI Extension component, create a GatewayClass and Gateway with listeners on ports 8080 (standard HTTP) and 8081 (inference extension). Define HTTPRoute for the backend service and create InferencePool and InferenceModel CRDs to bind the QwQ‑32B model to the gateway.
Verify routing by sending a POST request to the gateway IP on the appropriate port and model name.
Observing Performance :
Collect vLLM metrics via Prometheus (e.g., gpu_cache_usage_perc, request_queue_time_seconds_sum, num_requests_running, avg_generation_throughput_toks_per_s, time_to_first_token_seconds_bucket). Import the provided Grafana JSON model to visualise these metrics.
Run comparative benchmarks against the default gateway (port 8080) and the inference‑extension gateway (port 8081). Results show the extension reduces mean TTFT by 26.8 % and P99 TTFT by 62.32 % while improving cache utilisation.
Conclusion : The tutorial demonstrates rapid deployment of the QwQ‑32B model on ACK with reduced resource requirements (bf16 precision on 64 GB GPU memory, 4 × A10 GPUs). The ACK Gateway with AI Extension provides superior routing for LLM workloads, yielding lower latency and higher throughput compared to traditional least‑request scheduling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
