Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide
This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.
Overview
This guide demonstrates how to serve the 32‑billion‑parameter LLM QwQ‑32B on Alibaba Cloud Container Service for Kubernetes (ACK) using the ACK Gateway with AI Extension. The solution provides production‑grade intelligent routing, model‑level gray release, and load‑balancing for LLM inference workloads.
Prerequisites
An ACK cluster with GPU nodes (e.g., ecs.gn7i-c32g1.32xlarge equipped with 4 × A10 GPUs). The example uses five such nodes.
An OSS bucket to store the model files.
Step 1 – Prepare Model Data
Clone the model repository without pulling LFS objects:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.gitEnter the repository and download the LFS files: cd QwQ-32B</code><code>git lfs pull Upload the model directory to OSS (replace my-bucket with your bucket name):
ossutil mkdir oss://my-bucket/QwQ-32B</code><code>ossutil cp -r ./QwQ-32B oss://my-bucket/QwQ-32BCreate a PersistentVolume (PV) that uses the OSS storage class and a PersistentVolumeClaim (PVC) named llm-model to bind the PV. Example PV/YAML (simplified):
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-pv
spec:
storageClassName: oss
capacity:
storage: 500Gi
accessModes:
- ReadOnlyMany
csi:
driver: oss.csi.aliyun.com
volumeHandle: oss://my-bucket/QwQ-32B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
storageClassName: oss
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 500Gi
volumeName: llm-pvStep 2 – Deploy the Inference Service
Deploy a vLLM container that mounts the PVC and exposes port 8000. The deployment runs five replicas (one per GPU node) and allocates four GPUs per pod.
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwq-32b
labels:
app: qwq-32b
spec:
replicas: 5
selector:
matchLabels:
app: qwq-32b
template:
metadata:
labels:
app: qwq-32b
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
spec:
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
containers:
- name: vllm
image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
command:
- sh
- -c
- vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
ports:
- containerPort: 8000
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
periodSeconds: 30
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- name: model
mountPath: /models/QwQ-32B
- name: dshm
mountPath: /dev/shm
---
apiVersion: v1
kind: Service
metadata:
name: qwq-32b-v1
spec:
type: ClusterIP
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: qwq-32bStep 3 – Enable ACK Gateway with AI Extension
Enable the component in the ACK console and create a Gateway with two listeners: port 8080 for standard HTTP routing and port 8081 for AI‑extension routing.
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: inference-gateway
listeners:
- name: http
protocol: HTTP
port: 8080
- name: llm-gw
protocol: HTTP
port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: backend
spec:
parentRefs:
- name: inference-gateway
sectionName: http
rules:
- backendRefs:
- kind: Service
name: qwq-32b-v1
port: 8000
matches:
- path:
type: PathPrefix
value: /
timeouts:
request: "24h"
backendRequest: "24h"Step 4 – Configure InferencePool and InferenceModel
These custom resources bind the gateway listener (8081) to the QwQ‑32B deployment and define the traffic‑splitting policy.
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
name: reasoning-pool
annotations:
inference.networking.x-k8s.io/attach-to: |
name: inference-gateway
port: 8081
spec:
targetPortNumber: 8000
selector:
app: qwq-32b
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
criticality: Critical
modelName: qwq
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: reasoning-pool
targetModels:
- name: qwq-32b
weight: 100Step 5 – Verify Intelligent Routing
Obtain the gateway address and send a test request to the /v1/chat/completions endpoint. The request is routed through the AI extension to the QwQ‑32B service.
GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST http://${GATEWAY_IP}:8081/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "qwq", "messages": [{"role": "user", "content": "你是谁?"}]}' -vStep 6 – Performance Benchmark
Two benchmark runs are executed with the benchmark_serving.py script from the vLLM benchmark suite. The only difference is the target port (8080 for standard routing, 8081 for AI‑extension routing).
# Standard gateway routing (port 8080)
python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/QwQ-32B \
--served-model-name qwq-32b \
--trust-remote-code \
--dataset-name random \
--dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-prefix-len 1000 \
--random-input-len 4000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 3000 \
--max-concurrency 60 \
--host 172.16.12.92 \
--port 8080 \
--endpoint /v1/completions \
--save-result | tee benchmark_standard.txt
# AI‑extension routing (port 8081)
python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/QwQ-32B \
--served-model-name qwq-32b \
--trust-remote-code \
--dataset-name random \
--dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-prefix-len 1000 \
--random-input-len 4000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 3000 \
--max-concurrency 60 \
--host 172.16.12.92 \
--port 8081 \
--endpoint /v1/completions \
--save-result | tee benchmark_ai_extension.txtPerformance Comparison
Key metrics from the two runs show that the AI‑extension route reduces latency and slightly improves token throughput.
Average TTFT: 2456 ms (standard) vs. 1797 ms (AI extension) – 26.8 % reduction.
P99 TTFT: 23509 ms vs. 8857 ms – 62.3 % reduction.
Output token throughput: 916 tok/s vs. 927 tok/s.
These results demonstrate that ACK Gateway with AI Extension provides lower latency and higher throughput for LLM inference compared with a traditional gateway.
Conclusion
The ACK Gateway with AI Extension enables production‑grade intelligent routing, model‑level gray release, and traffic mirroring for large language model serving on Kubernetes. Benchmarks confirm measurable latency reductions and modest throughput gains, making it a suitable solution for high‑performance LLM inference workloads.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
