Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide
This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.
Deploying production‑grade LLM inference services on‑premise requires handling traffic routing, high availability, resource scheduling, model loading, and service orchestration. This guide shows how to use llmaz to deploy vLLM‑based models on a Kubernetes GPU cluster and expose them through Higress , a cloud‑native API gateway that provides traffic control, observability, and fallback capabilities.
Prerequisites
Kubernetes cluster with GPU support (at least two GPUs). A one‑click GPU Kind cluster tutorial can be used. kubectl installed locally.
HuggingFace token if the model requires authentication.
Install Higress
Install the Higress gateway via Helm:
helm repo add higress.io https://higress.cn/helm-charts
helm install higress -n higress-system higress.io/higress --create-namespace --render-subchart-notesForward the Higress console for initial login:
kubectl port-forward -n higress-system svc/higress-console 8080:8080Open http://127.0.0.1:8080 in a browser and set the admin credentials.
Install llmaz
Install llmaz, which supports multiple back‑ends (vLLM, SGLang, TensorRT‑LLM, llama.cpp, etc.) and can pull models from HuggingFace, ModelScope, or object storage:
helm install llmaz oci://registry-1.docker.io/inftyai/llmaz \
--namespace llmaz-system --create-namespace --version 0.0.10If a model requires a HuggingFace token, create a secret:
kubectl create secret generic modelhub-secret \
--from-literal=HF_TOKEN=<your_token>Deploy Two vLLM Models with llmaz
The example deploys Qwen2.5‑1.5B‑Instruct and Google Gemma‑2‑2B‑IT , each allocated one GPU. Two custom resources are defined: OpenModel – describes the model source (model hub ID). Playground – defines runtime configuration, including backendRuntimeConfig.backendName: vllm and resource limits/requests.
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: qwen2-1-5b
spec:
familyName: qwen2
source:
modelHub:
modelID: Qwen/Qwen2.5-1.5B-Instruct
---
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: qwen2-1-5b
spec:
replicas: 1
modelClaim:
modelName: qwen2-1-5b
backendRuntimeConfig:
backendName: vllm
resources:
limits:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
---
# Similar YAML for gemma-2-2b omitted for brevityApply the YAML files and verify the pods and services:
kubectl get pods
kubectl get svcConfigure Higress to Proxy vLLM Services
Create an AI Service Provider for each model using the OpenAI‑compatible endpoint exposed by the vLLM pods:
http://qwen2-1-5b-lb.default.svc.cluster.local:8080/v1
http://gemma-2-2b-lb.default.svc.cluster.local:8080/v1Define an AI Route that matches the model field in the request body and routes to the appropriate provider. The full YAML can be found at
https://github.com/cr7258/hands-on-lab/blob/main/gateway/higress/ai-proxy/on-premises/ai-route.yaml.
Test the Inference Endpoints
Forward the Higress gateway to a local port:
kubectl port-forward -n higress-system svc/higress-gateway 18000:80Send a chat completion request to the Qwen model:
curl http://localhost:18000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen2-1-5b",
"messages": [{"role": "user", "content": "Who are you?"}]
}'The response includes the model answer, token usage, and metadata. Replace the model field with gemma-2-2b to query the Gemma model. Adding "stream": true to the JSON payload enables streaming responses.
Observability
Enable Higress observability (input/output token counts, first‑token latency, total request latency):
helm upgrade --install higress -n higress-system higress.io/higress \
--create-namespace --render-subchart-notes \
--set global.o11y.enabled=true \
--set global.pvc.rwxSupported=falseMetrics become visible in the Higress Console AI Dashboard.
Fallback Model Switching
Configure the AI Route for the Qwen model to fall back to the Gemma model when the primary returns a 5xx error. Deleting the Qwen pod simulates a failure; subsequent requests are automatically served by Gemma, demonstrating seamless high‑availability. The fallback YAML is available at
https://github.com/cr7258/hands-on-lab/blob/main/gateway/higress/ai-proxy/on-premises/fallback.yaml.
Conclusion
This workflow demonstrates how to quickly deploy vLLM‑based LLM inference services with llmaz, expose them via Higress for traffic management, observability, and automatic fallback, and verify end‑to‑end functionality using curl commands.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
