Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes
This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.
ACK Gateway with AI Extension is a component designed for LLM inference scenarios, supporting four‑layer/seven‑layer traffic routing and load balancing based on model server load. It enables flexible traffic distribution strategies for inference services through custom resources (CRDs) like InferencePool and InferenceModel, allowing LoRA model gray releases and base model gray releases.
LoRA Model Gray Release Scenario
Low‑Rank Adaptation (LoRA) is a popular fine‑tuning technique for large language models that allows multiple LoRA adapters to share GPU resources (Multi‑LoRA). By deploying LLM inference services on a Kubernetes cluster and using ACK Gateway with AI Extension, you can define traffic distribution policies for different LoRA models, enabling gray testing of fine‑tuned adapters.
Prerequisites
GPU‑enabled Kubernetes cluster (see reference [1]).
At least one ecs.gn7i-c8g1.2xlarge (1×A10) node; the example uses two such nodes.
Step 1: Deploy Example LLM Inference Service
Deploy a Llama‑2 model with ten LoRA adapters (sql‑lora to sql‑lora‑4 and tweet‑summary to tweet‑summary‑4) using the following manifest:
kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
name: vllm-llama2-7b-pool
spec:
selector:
app: vllm-llama2-7b-pool
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
name: chat-template
data:
llama-2-chat.jinja: |
{% if messages[0]['role'] == 'system' %}
{% set system_message = '<<SYS>>
' + messages[0]['content'] | trim + '
<</SYS>>
' %}
{% set messages = messages[1:] %}
{% else %}
{% set system_message = '' %}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{% if loop.index0 == 0 %}
{% set content = system_message + message['content'] %}
{% else %}
{% set content = message['content'] %}
{% endif %}
{% if message['role'] == 'user' %}
{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
{% elif message['role'] == 'assistant' %}
{{ ' ' + content | trim + ' ' + eos_token }}
{% endif %}
{% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-7b-pool
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: vllm-llama2-7b-pool
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: '8000'
prometheus.io/scrape: 'true'
labels:
app: vllm-llama2-7b-pool
spec:
containers:
- name: lora
image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
imagePullPolicy: IfNotPresent
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "/model/llama2"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--gpu_memory_utilization"
- "0.8"
- "--enable-lora"
- "--max-loras"
- "10"
- "--max-cpu-loras"
- "12"
- "--lora-modules"
- "sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0"
- "sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1"
- "sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2"
- "sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3"
- "sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4"
- "tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
- "tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1"
- "tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2"
- "tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3"
- "tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4"
- "--chat-template"
- "/etc/vllm/llama-2-chat.jinja"
env:
- name: PORT
value: "8000"
ports:
- containerPort: 8000
name: http
protocol: TCP
livenessProbe:
failureThreshold: 2400
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 6000
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
- mountPath: /etc/vllm
name: chat-template
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: chat-template
configMap:
name: chat-template
EOFStep 2: Configure LoRA Model Gray Release with ACK Gateway
Enable the ACK Gateway with AI Extension component in the cluster’s component management, turn on the “Enable Gateway API inference extension” option, and create a Gateway instance:
kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: inference-gateway
listeners:
- name: llm-gw
protocol: HTTP
port: 8081
EOFThen create InferencePool and InferenceModel resources to define traffic split (e.g., 50% tweet‑summary, 50% sql‑lora):
kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
annotations:
inference.networking.x-k8s.io/attach-to: |
name: inference-gateway
port: 8081
name: vllm-llama2-7b-pool
spec:
targetPortNumber: 8000
selector:
app: vllm-llama2-7b-pool
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
modelName: lora-request
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama2-7b-pool
targetModels:
- name: tweet-summary
weight: 50
- name: sql-lora
weight: 50
EOFValidate the gray release by sending requests to the gateway and observing roughly equal traffic distribution between the two LoRA models.
Base Model Gray Release Scenario
In a Multi‑LoRA architecture, you can also gray‑release between different base models. This guide demonstrates gray‑releasing between QwQ‑32B and DeepSeek‑R1‑Distill‑Qwen‑7B.
QwQ‑32B Model Overview
QwQ‑32B is a 3.2‑billion‑parameter LLM from Alibaba Cloud, comparable to DeepSeek‑R1 671B in performance, supporting BF16 precision and running on a 4×A10 GPU node.
DeepSeek‑R1‑Distill‑Qwen‑7B Model Overview
DeepSeek‑R1‑Distill‑Qwen‑7B is a 7‑billion‑parameter model distilled from DeepSeek‑R1 (671B) with strong math, code, and reasoning capabilities.
Step 1: Deploy QwQ‑32B and DeepSeek‑R1 Models
Prepare GPU nodes (4×A10 for QwQ‑32B, 1×A10 for DeepSeek‑R1), clone the model repositories, and upload them to OSS. Configure PV/PVC for OSS storage as described in the documentation.
Deploy the models using the following manifests:
kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: custom-serving
release: qwq-32b
name: qwq-32b
spec:
replicas: 5
selector:
matchLabels:
app: custom-serving
release: qwq-32b
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
labels:
app: custom-serving
release: qwq-32b
spec:
containers:
- command:
- sh
- -c
- vllm serve /model/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
env:
- name: ARENA_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.7.2
imagePullPolicy: IfNotPresent
name: custom-serving
ports:
- containerPort: 8000
name: restful
protocol: TCP
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /model/QwQ-32B
name: llm-model
volumes:
- emptyDir:
medium: Memory
sizeLimit: 30Gi
name: dshm
- name: llm-model
persistentVolumeClaim:
claimName: llm-model
---
apiVersion: v1
kind: Service
metadata:
labels:
app: custom-serving
release: qwq-32b
name: qwq-32b
spec:
ports:
- name: http-serving
port: 8000
protocol: TCP
targetPort: 8000
selector:
app: custom-serving
release: qwq-32b
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: custom-serving
release: deepseek-r1
name: deepseek-r1
spec:
replicas: 2
selector:
matchLabels:
app: custom-serving
release: deepseek-r1
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
labels:
app: custom-serving
release: deepseek-r1
spec:
containers:
- command:
- sh
- -c
- vllm serve /models/DeepSeek-R1-1.5B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager
image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
name: vllm
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /models/DeepSeek-R1-1.5B
name: model
- mountPath: /dev/shm
name: dshm
volumes:
- name: model
persistentVolumeClaim:
claimName: llm-model-ds
- emptyDir:
medium: Memory
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1
spec:
type: ClusterIP
ports:
- port: 8000
protocol: TCP
targetPort: 8000
selector:
app: custom-serving
release: deepseek-r1
EOFStep 2: Configure Model Gray Release with ACK Gateway
Enable the ACK Gateway with AI Extension component, create a Gateway instance (port 8081), and define InferencePool and InferenceModel resources that split traffic 50/50 between QwQ‑32B and DeepSeek‑R1:
kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: inference-gateway
listeners:
- name: llm-gw
protocol: HTTP
port: 8081
EOF
kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
annotations:
inference.networking.x-k8s.io/attach-to: |
name: inference-gateway
port: 8081
name: reasoning-pool
spec:
extensionRef:
group: ""
kind: Service
name: inference-gateway-ext-proc
selector:
app: custom-serving
targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
criticality: Critical
modelName: qwq
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: reasoning-pool
targetModels:
- name: qwq-32b
weight: 50
- name: deepseek-r1
weight: 50
EOFValidate the gray release by sending chat completion requests to the gateway and confirming that both models respond with roughly equal frequency.
Summary
Using ACK Gateway with AI Extension, you can achieve intelligent routing, load balancing, and gray release for both LoRA fine‑tuned models and base LLMs in a cloud‑native Kubernetes environment, providing a flexible and efficient solution for large‑scale AI inference workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
