16 min read

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

Alibaba Cloud Infrastructure

Mar 17, 2025

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

Overview

This guide demonstrates how to serve the 32‑billion‑parameter LLM QwQ‑32B on Alibaba Cloud Container Service for Kubernetes (ACK) using the ACK Gateway with AI Extension. The solution provides production‑grade intelligent routing, model‑level gray release, and load‑balancing for LLM inference workloads.

Prerequisites

An ACK cluster with GPU nodes (e.g., ecs.gn7i-c32g1.32xlarge equipped with 4 × A10 GPUs). The example uses five such nodes.

An OSS bucket to store the model files.

Step 1 – Prepare Model Data

Clone the model repository without pulling LFS objects:

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git

Enter the repository and download the LFS files: cd QwQ-32B</code><code>git lfs pull Upload the model directory to OSS (replace my-bucket with your bucket name):

ossutil mkdir oss://my-bucket/QwQ-32B</code><code>ossutil cp -r ./QwQ-32B oss://my-bucket/QwQ-32B

Create a PersistentVolume (PV) that uses the OSS storage class and a PersistentVolumeClaim (PVC) named llm-model to bind the PV. Example PV/YAML (simplified):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-pv
spec:
  storageClassName: oss
  capacity:
    storage: 500Gi
  accessModes:
    - ReadOnlyMany
  csi:
    driver: oss.csi.aliyun.com
    volumeHandle: oss://my-bucket/QwQ-32B
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  storageClassName: oss
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 500Gi
  volumeName: llm-pv

Step 2 – Deploy the Inference Service

Deploy a vLLM container that mounts the PVC and exposes port 8000. The deployment runs five replicas (one per GPU node) and allocates four GPUs per pod.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwq-32b
  labels:
    app: qwq-32b
spec:
  replicas: 5
  selector:
    matchLabels:
      app: qwq-32b
  template:
    metadata:
      labels:
        app: qwq-32b
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 30Gi
      containers:
      - name: vllm
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
        command:
        - sh
        - -c
        - vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        ports:
        - containerPort: 8000
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "4"
        volumeMounts:
        - name: model
          mountPath: /models/QwQ-32B
        - name: dshm
          mountPath: /dev/shm
---
apiVersion: v1
kind: Service
metadata:
  name: qwq-32b-v1
spec:
  type: ClusterIP
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwq-32b

Step 3 – Enable ACK Gateway with AI Extension

Enable the component in the ACK console and create a Gateway with two listeners: port 8080 for standard HTTP routing and port 8081 for AI‑extension routing.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
  - name: http
    protocol: HTTP
    port: 8080
  - name: llm-gw
    protocol: HTTP
    port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: backend
spec:
  parentRefs:
  - name: inference-gateway
    sectionName: http
  rules:
  - backendRefs:
    - kind: Service
      name: qwq-32b-v1
      port: 8000
    matches:
    - path:
        type: PathPrefix
        value: /
    timeouts:
      request: "24h"
      backendRequest: "24h"

Step 4 – Configure InferencePool and InferenceModel

These custom resources bind the gateway listener (8081) to the QwQ‑32B deployment and define the traffic‑splitting policy.

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  name: reasoning-pool
  annotations:
    inference.networking.x-k8s.io/attach-to: |
      name: inference-gateway
      port: 8081
spec:
  targetPortNumber: 8000
  selector:
    app: qwq-32b
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  criticality: Critical
  modelName: qwq
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: reasoning-pool
  targetModels:
  - name: qwq-32b
    weight: 100

Step 5 – Verify Intelligent Routing

Obtain the gateway address and send a test request to the /v1/chat/completions endpoint. The request is routed through the AI extension to the QwQ‑32B service.

GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST http://${GATEWAY_IP}:8081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwq", "messages": [{"role": "user", "content": "你是谁？"}]}' -v

Step 6 – Performance Benchmark

Two benchmark runs are executed with the benchmark_serving.py script from the vLLM benchmark suite. The only difference is the target port (8080 for standard routing, 8081 for AI‑extension routing).

# Standard gateway routing (port 8080)
python3 /root/vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model /models/QwQ-32B \
  --served-model-name qwq-32b \
  --trust-remote-code \
  --dataset-name random \
  --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-prefix-len 1000 \
  --random-input-len 4000 \
  --random-output-len 3000 \
  --random-range-ratio 0.2 \
  --num-prompts 3000 \
  --max-concurrency 60 \
  --host 172.16.12.92 \
  --port 8080 \
  --endpoint /v1/completions \
  --save-result | tee benchmark_standard.txt

# AI‑extension routing (port 8081)
python3 /root/vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model /models/QwQ-32B \
  --served-model-name qwq-32b \
  --trust-remote-code \
  --dataset-name random \
  --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
  --random-prefix-len 1000 \
  --random-input-len 4000 \
  --random-output-len 3000 \
  --random-range-ratio 0.2 \
  --num-prompts 3000 \
  --max-concurrency 60 \
  --host 172.16.12.92 \
  --port 8081 \
  --endpoint /v1/completions \
  --save-result | tee benchmark_ai_extension.txt

Performance Comparison

Key metrics from the two runs show that the AI‑extension route reduces latency and slightly improves token throughput.

Average TTFT: 2456 ms (standard) vs. 1797 ms (AI extension) – 26.8 % reduction.

P99 TTFT: 23509 ms vs. 8857 ms – 62.3 % reduction.

Output token throughput: 916 tok/s vs. 927 tok/s.

These results demonstrate that ACK Gateway with AI Extension provides lower latency and higher throughput for LLM inference compared with a traditional gateway.

Conclusion

The ACK Gateway with AI Extension enables production‑grade intelligent routing, model‑level gray release, and traffic mirroring for large language model serving on Kubernetes. Benchmarks confirm measurable latency reductions and modest throughput gains, making it a suitable solution for high‑performance LLM inference workloads.

LLM Kubernetes vLLM Performance Benchmark ACK Gateway AI Extension

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.