Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

Deploying production‑grade LLM inference services on‑premise requires handling traffic routing, high availability, resource scheduling, model loading, and service orchestration. This guide shows how to use llmaz to deploy vLLM‑based models on a Kubernetes GPU cluster and expose them through Higress , a cloud‑native API gateway that provides traffic control, observability, and fallback capabilities.

Prerequisites

Kubernetes cluster with GPU support (at least two GPUs). A one‑click GPU Kind cluster tutorial can be used. kubectl installed locally.

HuggingFace token if the model requires authentication.

Install Higress

Install the Higress gateway via Helm:

helm repo add higress.io https://higress.cn/helm-charts
helm install higress -n higress-system higress.io/higress --create-namespace --render-subchart-notes

Forward the Higress console for initial login:

kubectl port-forward -n higress-system svc/higress-console 8080:8080

Open http://127.0.0.1:8080 in a browser and set the admin credentials.

Install llmaz

Install llmaz, which supports multiple back‑ends (vLLM, SGLang, TensorRT‑LLM, llama.cpp, etc.) and can pull models from HuggingFace, ModelScope, or object storage:

helm install llmaz oci://registry-1.docker.io/inftyai/llmaz \
  --namespace llmaz-system --create-namespace --version 0.0.10

If a model requires a HuggingFace token, create a secret:

kubectl create secret generic modelhub-secret \
  --from-literal=HF_TOKEN=<your_token>

Deploy Two vLLM Models with llmaz

The example deploys Qwen2.5‑1.5B‑Instruct and Google Gemma‑2‑2B‑IT , each allocated one GPU. Two custom resources are defined: OpenModel – describes the model source (model hub ID). Playground – defines runtime configuration, including backendRuntimeConfig.backendName: vllm and resource limits/requests.

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: qwen2-1-5b
spec:
  familyName: qwen2
  source:
    modelHub:
      modelID: Qwen/Qwen2.5-1.5B-Instruct
---
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: qwen2-1-5b
spec:
  replicas: 1
  modelClaim:
    modelName: qwen2-1-5b
  backendRuntimeConfig:
    backendName: vllm
    resources:
      limits:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
---
# Similar YAML for gemma-2-2b omitted for brevity

Apply the YAML files and verify the pods and services:

kubectl get pods
kubectl get svc

Configure Higress to Proxy vLLM Services

Create an AI Service Provider for each model using the OpenAI‑compatible endpoint exposed by the vLLM pods:

http://qwen2-1-5b-lb.default.svc.cluster.local:8080/v1
http://gemma-2-2b-lb.default.svc.cluster.local:8080/v1

Define an AI Route that matches the model field in the request body and routes to the appropriate provider. The full YAML can be found at

https://github.com/cr7258/hands-on-lab/blob/main/gateway/higress/ai-proxy/on-premises/ai-route.yaml

.

Test the Inference Endpoints

Forward the Higress gateway to a local port:

kubectl port-forward -n higress-system svc/higress-gateway 18000:80

Send a chat completion request to the Qwen model:

curl http://localhost:18000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen2-1-5b",
    "messages": [{"role": "user", "content": "Who are you?"}]
  }'

The response includes the model answer, token usage, and metadata. Replace the model field with gemma-2-2b to query the Gemma model. Adding "stream": true to the JSON payload enables streaming responses.

Observability

Enable Higress observability (input/output token counts, first‑token latency, total request latency):

helm upgrade --install higress -n higress-system higress.io/higress \
  --create-namespace --render-subchart-notes \
  --set global.o11y.enabled=true \
  --set global.pvc.rwxSupported=false

Metrics become visible in the Higress Console AI Dashboard.

Fallback Model Switching

Configure the AI Route for the Qwen model to fall back to the Gemma model when the primary returns a 5xx error. Deleting the Qwen pod simulates a failure; subsequent requests are automatically served by Gemma, demonstrating seamless high‑availability. The fallback YAML is available at

https://github.com/cr7258/hands-on-lab/blob/main/gateway/higress/ai-proxy/on-premises/fallback.yaml

.

Conclusion

This workflow demonstrates how to quickly deploy vLLM‑based LLM inference services with llmaz, expose them via Higress for traffic management, observability, and automatic fallback, and verify end‑to‑end functionality using curl commands.

observabilityvLLMfallbackHigressllmaz
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.