Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

Why KV‑Cache matters

Transformer‑based large language models compute attention between each new token and all previous tokens. The quadratic cost of this "prefill" stage dominates inference latency. KV‑Cache stores the Key and Value vectors of already‑processed tokens in GPU memory, allowing subsequent tokens to reuse them and avoid recomputation.

vLLM Automatic Prefix Caching (APC)

vLLM (v0.10.0+) introduces APC, which detects identical request prefixes (e.g., a shared system prompt) and reuses the corresponding KV‑Cache blocks. In a single‑node test, the first‑token response time dropped from 4.3 s to 0.6 s (≈7× faster) when the prompt was cached.

Distributed deployment challenge

When multiple vLLM replicas run behind a load balancer, each replica maintains an independent KV‑Cache. Traditional load‑balancing algorithms are unaware of cache state, causing identical prefixes to be scattered across pods, repeated prefill computation, and severe performance loss.

Precise‑mode prefix‑cache‑aware routing

ACK Gateway with Inference Extension (ACK GIE) solves this by collecting KV‑Cache events from each vLLM pod via ZeroMQ (supported from v0.10.0). The gateway builds a global index of KV blocks (hash, location, storage medium) and routes incoming requests to the pod that holds the most matching blocks, while also considering load metrics such as queue length and GPU utilization.

Routing algorithm steps

KV event reporting : each vLLM pod publishes cache‑creation, update, and deletion events.

Global index construction : the gateway records block hash, pod identifier, and storage type (GPU/CPU).

Intelligent routing decision : for a new request, compute the hash sequence of its prefix, query the index for matching pods, and select the optimal pod based on current load.

Deployment guide (ACK cluster)

Prerequisites

ACK managed cluster with a GPU node pool

Gateway with Inference Extension v1.4.0‑apsara.3 or later

vLLM v0.10.0 or later

Step 1: Prepare model files

# Download model
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull
# Upload to OSS
ossutil mkdir oss://YOUR_BUCKET_NAME/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://YOUR_BUCKET_NAME/Qwen3-32B

Step 2: Configure storage volume

# llm-model.yaml
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: YOUR_OSS_AK
  akSecret: YOUR_OSS_SK
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: YOUR_BUCKET_NAME
      url: YOUR_BUCKET_ENDPOINT
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 3: Deploy vLLM service

# vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3
  labels:
    app: qwen3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      labels:
        app: qwen3
    spec:
      containers:
      - name: vllm
        image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
        command:
        - sh
        - -c
        - |
          vllm serve /models/Qwen3-32B \
            --served-model-name Qwen3-32B \
            --trust-remote-code \
            --port=8000 \
            --max-model-len 8192 \
            --gpu-memory-utilization 0.95 \
            --enforce-eager \
            --kv-events-config '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}' \
            --prefix-caching-hash-algo sha256_cbor_64bit \
            --block-size 64
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: PYTHONHASHSEED
          value: '42'
        ports:
        - containerPort: 8000
          name: restful
        resources:
          limits:
            nvidia.com/gpu: '1'
          requests:
            nvidia.com/gpu: '1'
        volumeMounts:
        - name: model
          mountPath: /models/Qwen3-32B
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 30Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3
  labels:
    app: qwen3
spec:
  ports:
  - name: http-serving
    port: 8000
    targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

Step 4: Configure precise‑mode routing policy

# inference-policy.yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen3
---
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  profile:
    single:
      trafficPolicy:
        prefixCache:
          mode: tracking
          trackingConfig:
            indexerConfig: {}
            tokenProcessorConfig:
              blockSize: 64
              hashSeed: 42
              model: Qwen/Qwen3-32B

Step 5: Deploy gateway and route

# inference-gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

Performance validation

Two request files with identical prefixes were sent through the gateway. Logs confirmed that both requests were handled by the same pod, demonstrating successful cache reuse.

# round1.txt …
# round2.txt …
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"

Benchmark results

In an 8‑pod cluster simulating a B2B SaaS workload (150 customers, each with a 6 k‑token system prompt, 5 concurrent users per customer), the precise‑mode routing achieved:

TTFT P90 of 0.542 s vs. >92 s for random scheduling (≈170× speedup)

Nearly 2× higher throughput

Almost no request queues, keeping the system healthy

Conclusion

Optimizing KV‑Cache hit rate is essential for cost‑effective LLM inference. By exposing KV‑Cache events, building a global cache index, and routing requests to the pod with the most relevant cached prefixes, ACK GIE transforms a distributed vLLM deployment from a cache‑blind system into a coordinated, high‑performance inference platform.

Performance optimizationLLMKubernetesvLLMAlibaba CloudInferenceKV cache
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.