Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE
This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.
Why KV‑Cache matters
Transformer‑based large language models compute attention between each new token and all previous tokens. The quadratic cost of this "prefill" stage dominates inference latency. KV‑Cache stores the Key and Value vectors of already‑processed tokens in GPU memory, allowing subsequent tokens to reuse them and avoid recomputation.
vLLM Automatic Prefix Caching (APC)
vLLM (v0.10.0+) introduces APC, which detects identical request prefixes (e.g., a shared system prompt) and reuses the corresponding KV‑Cache blocks. In a single‑node test, the first‑token response time dropped from 4.3 s to 0.6 s (≈7× faster) when the prompt was cached.
Distributed deployment challenge
When multiple vLLM replicas run behind a load balancer, each replica maintains an independent KV‑Cache. Traditional load‑balancing algorithms are unaware of cache state, causing identical prefixes to be scattered across pods, repeated prefill computation, and severe performance loss.
Precise‑mode prefix‑cache‑aware routing
ACK Gateway with Inference Extension (ACK GIE) solves this by collecting KV‑Cache events from each vLLM pod via ZeroMQ (supported from v0.10.0). The gateway builds a global index of KV blocks (hash, location, storage medium) and routes incoming requests to the pod that holds the most matching blocks, while also considering load metrics such as queue length and GPU utilization.
Routing algorithm steps
KV event reporting : each vLLM pod publishes cache‑creation, update, and deletion events.
Global index construction : the gateway records block hash, pod identifier, and storage type (GPU/CPU).
Intelligent routing decision : for a new request, compute the hash sequence of its prefix, query the index for matching pods, and select the optimal pod based on current load.
Deployment guide (ACK cluster)
Prerequisites
ACK managed cluster with a GPU node pool
Gateway with Inference Extension v1.4.0‑apsara.3 or later
vLLM v0.10.0 or later
Step 1: Prepare model files
# Download model
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull
# Upload to OSS
ossutil mkdir oss://YOUR_BUCKET_NAME/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://YOUR_BUCKET_NAME/Qwen3-32BStep 2: Configure storage volume
# llm-model.yaml
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: YOUR_OSS_AK
akSecret: YOUR_OSS_SK
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: YOUR_BUCKET_NAME
url: YOUR_BUCKET_ENDPOINT
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelStep 3: Deploy vLLM service
# vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen3
labels:
app: qwen3
spec:
replicas: 3
selector:
matchLabels:
app: qwen3
template:
metadata:
labels:
app: qwen3
spec:
containers:
- name: vllm
image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
command:
- sh
- -c
- |
vllm serve /models/Qwen3-32B \
--served-model-name Qwen3-32B \
--trust-remote-code \
--port=8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95 \
--enforce-eager \
--kv-events-config '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}' \
--prefix-caching-hash-algo sha256_cbor_64bit \
--block-size 64
env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: PYTHONHASHSEED
value: '42'
ports:
- containerPort: 8000
name: restful
resources:
limits:
nvidia.com/gpu: '1'
requests:
nvidia.com/gpu: '1'
volumeMounts:
- name: model
mountPath: /models/Qwen3-32B
- name: dshm
mountPath: /dev/shm
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
---
apiVersion: v1
kind: Service
metadata:
name: qwen3
labels:
app: qwen3
spec:
ports:
- name: http-serving
port: 8000
targetPort: 8000
selector:
app: qwen3
type: ClusterIPStep 4: Configure precise‑mode routing policy
# inference-policy.yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: qwen-inference-pool
spec:
targetPortNumber: 8000
selector:
app: qwen3
---
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
name: inference-policy
spec:
poolRef:
name: qwen-inference-pool
profile:
single:
trafficPolicy:
prefixCache:
mode: tracking
trackingConfig:
indexerConfig: {}
tokenProcessorConfig:
blockSize: 64
hashSeed: 42
model: Qwen/Qwen3-32BStep 5: Deploy gateway and route
# inference-gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: ack-gateway
listeners:
- name: http-llm
protocol: HTTP
port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-route
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1
backendRefs:
- name: qwen-inference-pool
kind: InferencePool
group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: backend-timeout
spec:
timeout:
http:
requestTimeout: 24h
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: inference-gatewayPerformance validation
Two request files with identical prefixes were sent through the gateway. Logs confirmed that both requests were handled by the same pod, demonstrating successful cache reuse.
# round1.txt …
# round2.txt …
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"Benchmark results
In an 8‑pod cluster simulating a B2B SaaS workload (150 customers, each with a 6 k‑token system prompt, 5 concurrent users per customer), the precise‑mode routing achieved:
TTFT P90 of 0.542 s vs. >92 s for random scheduling (≈170× speedup)
Nearly 2× higher throughput
Almost no request queues, keeping the system healthy
Conclusion
Optimizing KV‑Cache hit rate is essential for cost‑effective LLM inference. By exposing KV‑Cache events, building a global cache index, and routing requests to the pod with the most relevant cached prefixes, ACK GIE transforms a distributed vLLM deployment from a cache‑blind system into a coordinated, high‑performance inference platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
