Cloud Native 25 min read

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

ACK Gateway with AI Extension is a component designed for LLM inference scenarios, supporting four‑layer/seven‑layer traffic routing and load balancing based on model server load. It enables flexible traffic distribution strategies for inference services through custom resources (CRDs) like InferencePool and InferenceModel, allowing LoRA model gray releases and base model gray releases.

LoRA Model Gray Release Scenario

Low‑Rank Adaptation (LoRA) is a popular fine‑tuning technique for large language models that allows multiple LoRA adapters to share GPU resources (Multi‑LoRA). By deploying LLM inference services on a Kubernetes cluster and using ACK Gateway with AI Extension, you can define traffic distribution policies for different LoRA models, enabling gray testing of fine‑tuned adapters.

Prerequisites

GPU‑enabled Kubernetes cluster (see reference [1]).

At least one ecs.gn7i-c8g1.2xlarge (1×A10) node; the example uses two such nodes.

Step 1: Deploy Example LLM Inference Service

Deploy a Llama‑2 model with ten LoRA adapters (sql‑lora to sql‑lora‑4 and tweet‑summary to tweet‑summary‑4) using the following manifest:

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>
' + messages[0]['content'] | trim + '
<</SYS>>

' %}
      {% set messages = messages[1:] %}
    {% else %}
      {% set system_message = '' %}
    {% endif %}
    {% for message in messages %}
      {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
      {% endif %}
      {% if loop.index0 == 0 %}
        {% set content = system_message + message['content'] %}
      {% else %}
        {% set content = message['content'] %}
      {% endif %}
      {% if message['role'] == 'user' %}
        {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
      {% elif message['role'] == 'assistant' %}
        {{ ' ' + content | trim + ' ' + eos_token }}
      {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
      - name: lora
        image: "registry-cn-hangzhou-vpc.ack.aliyuncs.com/dev/llama2-with-lora:v0.2"
        imagePullPolicy: IfNotPresent
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model"
        - "/model/llama2"
        - "--tensor-parallel-size"
        - "1"
        - "--port"
        - "8000"
        - "--gpu_memory_utilization"
        - "0.8"
        - "--enable-lora"
        - "--max-loras"
        - "10"
        - "--max-cpu-loras"
        - "12"
        - "--lora-modules"
        - "sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0"
        - "sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1"
        - "sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2"
        - "sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3"
        - "sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4"
        - "tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
        - "tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1"
        - "tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2"
        - "tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3"
        - "tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4"
        - "--chat-template"
        - "/etc/vllm/llama-2-chat.jinja"
        env:
        - name: PORT
          value: "8000"
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        livenessProbe:
          failureThreshold: 2400
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6000
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /data
          name: data
        - mountPath: /dev/shm
          name: shm
        - mountPath: /etc/vllm
          name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
      - name: data
        emptyDir: {}
      - name: shm
        emptyDir:
          medium: Memory
      - name: chat-template
        configMap:
          name: chat-template
EOF

Step 2: Configure LoRA Model Gray Release with ACK Gateway

Enable the ACK Gateway with AI Extension component in the cluster’s component management, turn on the “Enable Gateway API inference extension” option, and create a Gateway instance:

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
  - name: llm-gw
    protocol: HTTP
    port: 8081
EOF

Then create InferencePool and InferenceModel resources to define traffic split (e.g., 50% tweet‑summary, 50% sql‑lora):

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/attach-to: |
      name: inference-gateway
      port: 8081
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: lora-request
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: tweet-summary
    weight: 50
  - name: sql-lora
    weight: 50
EOF

Validate the gray release by sending requests to the gateway and observing roughly equal traffic distribution between the two LoRA models.

Base Model Gray Release Scenario

In a Multi‑LoRA architecture, you can also gray‑release between different base models. This guide demonstrates gray‑releasing between QwQ‑32B and DeepSeek‑R1‑Distill‑Qwen‑7B.

QwQ‑32B Model Overview

QwQ‑32B is a 3.2‑billion‑parameter LLM from Alibaba Cloud, comparable to DeepSeek‑R1 671B in performance, supporting BF16 precision and running on a 4×A10 GPU node.

DeepSeek‑R1‑Distill‑Qwen‑7B Model Overview

DeepSeek‑R1‑Distill‑Qwen‑7B is a 7‑billion‑parameter model distilled from DeepSeek‑R1 (671B) with strong math, code, and reasoning capabilities.

Step 1: Deploy QwQ‑32B and DeepSeek‑R1 Models

Prepare GPU nodes (4×A10 for QwQ‑32B, 1×A10 for DeepSeek‑R1), clone the model repositories, and upload them to OSS. Configure PV/PVC for OSS storage as described in the documentation.

Deploy the models using the following manifests:

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: custom-serving
    release: qwq-32b
  name: qwq-32b
spec:
  replicas: 5
  selector:
    matchLabels:
      app: custom-serving
      release: qwq-32b
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: custom-serving
        release: qwq-32b
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /model/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        env:
        - name: ARENA_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.7.2
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "4"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /model/QwQ-32B
          name: llm-model
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: custom-serving
    release: qwq-32b
  name: qwq-32b
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: custom-serving
    release: qwq-32b
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: custom-serving
    release: deepseek-r1
  name: deepseek-r1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: custom-serving
      release: deepseek-r1
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: custom-serving
        release: deepseek-r1
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/DeepSeek-R1-1.5B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
        name: vllm
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /models/DeepSeek-R1-1.5B
          name: model
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model-ds
      - emptyDir:
          medium: Memory
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1
spec:
  type: ClusterIP
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: custom-serving
    release: deepseek-r1
EOF

Step 2: Configure Model Gray Release with ACK Gateway

Enable the ACK Gateway with AI Extension component, create a Gateway instance (port 8081), and define InferencePool and InferenceModel resources that split traffic 50/50 between QwQ‑32B and DeepSeek‑R1:

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
  - name: llm-gw
    protocol: HTTP
    port: 8081
EOF

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/attach-to: |
      name: inference-gateway
      port: 8081
  name: reasoning-pool
spec:
  extensionRef:
    group: ""
    kind: Service
    name: inference-gateway-ext-proc
  selector:
    app: custom-serving
  targetPortNumber: 8000
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  criticality: Critical
  modelName: qwq
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: reasoning-pool
  targetModels:
  - name: qwq-32b
    weight: 50
  - name: deepseek-r1
    weight: 50
EOF

Validate the gray release by sending chat completion requests to the gateway and confirming that both models respond with roughly equal frequency.

Summary

Using ACK Gateway with AI Extension, you can achieve intelligent routing, load balancing, and gray release for both LoRA fine‑tuned models and base LLMs in a cloud‑native Kubernetes environment, providing a flexible and efficient solution for large‑scale AI inference workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeLLMKubernetesLoRAAI inferenceACK GatewayModel Gray Release
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.