Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes
This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.
ACK Gateway with AI Extension is a component designed for LLM inference scenarios, supporting four‑layer/seven‑layer traffic routing and load balancing based on model server load. It enables flexible traffic distribution strategies for inference services through custom resources (CRDs) like InferencePool and InferenceModel , allowing LoRA model gray releases and base model gray releases.
LoRA Model Gray Release Scenario
Low‑Rank Adaptation (LoRA) is a popular fine‑tuning technique for large language models that allows multiple LoRA adapters to share GPU resources (Multi‑LoRA). By deploying LLM inference services on a Kubernetes cluster and using ACK Gateway with AI Extension, you can define traffic distribution policies for different LoRA models, enabling gray testing of fine‑tuned adapters.
Prerequisites
GPU‑enabled Kubernetes cluster (see reference [1]).
At least one ecs.gn7i-c8g1.2xlarge (1×A10) node; the example uses two such nodes.
Step 1: Deploy Example LLM Inference Service
Deploy a Llama‑2 model with ten LoRA adapters (sql‑lora to sql‑lora‑4 and tweet‑summary to tweet‑summary‑4) using the following manifest:
kubectl apply -f- <
>\n\n' %}
{% set messages = messages[1:] %}
{% else %}
{% set system_message = '' %}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{% if loop.index0 == 0 %}
{% set content = system_message + message['content'] %}
{% else %}
{% set content = message['content'] %}
{% endif %}
{% if message['role'] == 'user' %}
{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
{% elif message['role'] == 'assistant' %}
{{ ' ' + content | trim + ' ' + eos_token }}
{% endif %}
{% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-7b-pool
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: vllm-llama2-7b-pool
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: '8000'
prometheus.io/scrape: 'true'
labels:
app: vllm-llama2-7b-pool
spec:
containers:
- name: lora
image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
imagePullPolicy: IfNotPresent
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "/model/llama2"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--gpu_memory_utilization"
- "0.8"
- "--enable-lora"
- "--max-loras"
- "10"
- "--max-cpu-loras"
- "12"
- "--lora-modules"
- "sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0"
- "sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1"
- "sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2"
- "sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3"
- "sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4"
- "tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
- "tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1"
- "tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2"
- "tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3"
- "tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4"
- "--chat-template"
- "/etc/vllm/llama-2-chat.jinja"
env:
- name: PORT
value: "8000"
ports:
- containerPort: 8000
name: http
protocol: TCP
livenessProbe:
failureThreshold: 2400
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 6000
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
- mountPath: /etc/vllm
name: chat-template
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: chat-template
configMap:
name: chat-template
EOFStep 2: Configure LoRA Model Gray Release with ACK Gateway
Enable the ACK Gateway with AI Extension component in the cluster’s component management, turn on the “Enable Gateway API inference extension” option, and create a Gateway instance:
kubectl apply -f- <Then create InferencePool and InferenceModel resources to define traffic split (e.g., 50% tweet‑summary, 50% sql‑lora):
kubectl apply -f- <Validate the gray release by sending requests to the gateway and observing roughly equal traffic distribution between the two LoRA models.
Base Model Gray Release Scenario
In a Multi‑LoRA architecture, you can also gray‑release between different base models. This guide demonstrates gray‑releasing between QwQ‑32B and DeepSeek‑R1‑Distill‑Qwen‑7B.
QwQ‑32B Model Overview
QwQ‑32B is a 3.2‑billion‑parameter LLM from Alibaba Cloud, comparable to DeepSeek‑R1 671B in performance, supporting BF16 precision and running on a 4×A10 GPU node.
DeepSeek‑R1‑Distill‑Qwen‑7B Model Overview
DeepSeek‑R1‑Distill‑Qwen‑7B is a 7‑billion‑parameter model distilled from DeepSeek‑R1 (671B) with strong math, code, and reasoning capabilities.
Step 1: Deploy QwQ‑32B and DeepSeek‑R1 Models
Prepare GPU nodes (4×A10 for QwQ‑32B, 1×A10 for DeepSeek‑R1), clone the model repositories, and upload them to OSS. Configure PV/PVC for OSS storage as described in the documentation.
Deploy the models using the following manifests:
kubectl apply -f- <Step 2: Configure Model Gray Release with ACK Gateway
Enable the ACK Gateway with AI Extension component, create a Gateway instance (port 8081), and define InferencePool and InferenceModel resources that split traffic 50/50 between QwQ‑32B and DeepSeek‑R1:
kubectl apply -f- <Validate the gray release by sending chat completion requests to the gateway and confirming that both models respond with roughly equal frequency.
Summary
Using ACK Gateway with AI Extension, you can achieve intelligent routing, load balancing, and gray release for both LoRA fine‑tuned models and base LLMs in a cloud‑native Kubernetes environment, providing a flexible and efficient solution for large‑scale AI inference workloads.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.