Cloud Native 25 min read

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

ACK Gateway with AI Extension is a component designed for LLM inference scenarios, supporting four‑layer/seven‑layer traffic routing and load balancing based on model server load. It enables flexible traffic distribution strategies for inference services through custom resources (CRDs) like InferencePool and InferenceModel , allowing LoRA model gray releases and base model gray releases.

LoRA Model Gray Release Scenario

Low‑Rank Adaptation (LoRA) is a popular fine‑tuning technique for large language models that allows multiple LoRA adapters to share GPU resources (Multi‑LoRA). By deploying LLM inference services on a Kubernetes cluster and using ACK Gateway with AI Extension, you can define traffic distribution policies for different LoRA models, enabling gray testing of fine‑tuned adapters.

Prerequisites

GPU‑enabled Kubernetes cluster (see reference [1]).

At least one ecs.gn7i-c8g1.2xlarge (1×A10) node; the example uses two such nodes.

Step 1: Deploy Example LLM Inference Service

Deploy a Llama‑2 model with ten LoRA adapters (sql‑lora to sql‑lora‑4 and tweet‑summary to tweet‑summary‑4) using the following manifest:

kubectl apply -f- <
>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
      {% set system_message = '' %}
    {% endif %}
    {% for message in messages %}
      {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
      {% endif %}
      {% if loop.index0 == 0 %}
        {% set content = system_message + message['content'] %}
      {% else %}
        {% set content = message['content'] %}
      {% endif %}
      {% if message['role'] == 'user' %}
        {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
      {% elif message['role'] == 'assistant' %}
        {{ ' ' + content | trim + ' ' + eos_token }}
      {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
      - name: lora
        image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
        imagePullPolicy: IfNotPresent
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model"
        - "/model/llama2"
        - "--tensor-parallel-size"
        - "1"
        - "--port"
        - "8000"
        - "--gpu_memory_utilization"
        - "0.8"
        - "--enable-lora"
        - "--max-loras"
        - "10"
        - "--max-cpu-loras"
        - "12"
        - "--lora-modules"
        - "sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0"
        - "sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1"
        - "sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2"
        - "sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3"
        - "sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4"
        - "tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
        - "tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1"
        - "tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2"
        - "tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3"
        - "tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4"
        - "--chat-template"
        - "/etc/vllm/llama-2-chat.jinja"
        env:
        - name: PORT
          value: "8000"
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        livenessProbe:
          failureThreshold: 2400
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6000
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /data
          name: data
        - mountPath: /dev/shm
          name: shm
        - mountPath: /etc/vllm
          name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
      - name: data
        emptyDir: {}
      - name: shm
        emptyDir:
          medium: Memory
      - name: chat-template
        configMap:
          name: chat-template
EOF

Step 2: Configure LoRA Model Gray Release with ACK Gateway

Enable the ACK Gateway with AI Extension component in the cluster’s component management, turn on the “Enable Gateway API inference extension” option, and create a Gateway instance:

kubectl apply -f- <

Then create InferencePool and InferenceModel resources to define traffic split (e.g., 50% tweet‑summary, 50% sql‑lora):

kubectl apply -f- <

Validate the gray release by sending requests to the gateway and observing roughly equal traffic distribution between the two LoRA models.

Base Model Gray Release Scenario

In a Multi‑LoRA architecture, you can also gray‑release between different base models. This guide demonstrates gray‑releasing between QwQ‑32B and DeepSeek‑R1‑Distill‑Qwen‑7B.

QwQ‑32B Model Overview

QwQ‑32B is a 3.2‑billion‑parameter LLM from Alibaba Cloud, comparable to DeepSeek‑R1 671B in performance, supporting BF16 precision and running on a 4×A10 GPU node.

DeepSeek‑R1‑Distill‑Qwen‑7B Model Overview

DeepSeek‑R1‑Distill‑Qwen‑7B is a 7‑billion‑parameter model distilled from DeepSeek‑R1 (671B) with strong math, code, and reasoning capabilities.

Step 1: Deploy QwQ‑32B and DeepSeek‑R1 Models

Prepare GPU nodes (4×A10 for QwQ‑32B, 1×A10 for DeepSeek‑R1), clone the model repositories, and upload them to OSS. Configure PV/PVC for OSS storage as described in the documentation.

Deploy the models using the following manifests:

kubectl apply -f- <

Step 2: Configure Model Gray Release with ACK Gateway

Enable the ACK Gateway with AI Extension component, create a Gateway instance (port 8081), and define InferencePool and InferenceModel resources that split traffic 50/50 between QwQ‑32B and DeepSeek‑R1:

kubectl apply -f- <

Validate the gray release by sending chat completion requests to the gateway and confirming that both models respond with roughly equal frequency.

Summary

Using ACK Gateway with AI Extension, you can achieve intelligent routing, load balancing, and gray release for both LoRA fine‑tuned models and base LLMs in a cloud‑native Kubernetes environment, providing a flexible and efficient solution for large‑scale AI inference workloads.

cloud nativeLLMKubernetesLoRAAI inferenceACK GatewayModel Gray Release
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.