Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing
This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.
Background : Alibaba Cloud recently released the QwQ-32B model (320 billion parameters) whose performance rivals DeepSeek‑R1 671B. The vLLM framework provides efficient inference with features such as PagedAttention, dynamic batching, and model quantization.
Prerequisites : A GPU‑enabled ACK Kubernetes cluster (e.g., ecs.gn7i-c32g1.32xlarge with 4 × A10 GPUs) and an OSS bucket for model storage.
Step 1 – Prepare Model Data :
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pullUpload the downloaded model files to OSS:
ossutil mkdir oss://
/QwQ-32B
ossutil cp -r ./QwQ-32B oss://
/QwQ-32BConfigure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that use the OSS static volume (example configurations are shown in the original tables).
Step 2 – Deploy Inference Service (vLLM deployment):
kubectl apply -f- <Step 3 – Deploy OpenWebUI :
kubectl apply -f- <Step 4 – Verify Inference Service : Use kubectl port-forward svc/openwebui 8080:8080 and access http://localhost:8080 to log into OpenWebUI and test a prompt (e.g., "0.11和0.9谁大?").
Optional Step 5 – Benchmark Inference Service :
Deploy a benchmark pod, download the ShareGPT_V3 dataset, and run benchmark_serving.py with appropriate parameters. Sample output shows request throughput, token throughput, and latency metrics (e.g., TTFT ≈ 4.9 s, output token throughput ≈ 101.89 tok/s for 8‑concurrency).
Intelligent Routing with ACK Gateway :
Enable the ACK Gateway with AI Extension component, create a GatewayClass and Gateway with listeners on ports 8080 (standard HTTP) and 8081 (inference extension). Define HTTPRoute for the backend service and create InferencePool and InferenceModel CRDs to bind the QwQ‑32B model to the gateway.
Verify routing by sending a POST request to the gateway IP on the appropriate port and model name.
Observing Performance :
Collect vLLM metrics via Prometheus (e.g., gpu_cache_usage_perc, request_queue_time_seconds_sum, num_requests_running, avg_generation_throughput_toks_per_s, time_to_first_token_seconds_bucket). Import the provided Grafana JSON model to visualise these metrics.
Run comparative benchmarks against the default gateway (port 8080) and the inference‑extension gateway (port 8081). Results show the extension reduces mean TTFT by 26.8 % and P99 TTFT by 62.32 % while improving cache utilisation.
Conclusion : The tutorial demonstrates rapid deployment of the QwQ‑32B model on ACK with reduced resource requirements (bf16 precision on 64 GB GPU memory, 4 × A10 GPUs). The ACK Gateway with AI Extension provides superior routing for LLM workloads, yielding lower latency and higher throughput compared to traditional least‑request scheduling.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.