Deploy NVIDIA NIM LLM Inference on Alibaba Cloud ACK with Auto‑Scaling and Monitoring
This guide walks you through deploying NVIDIA NIM for LLM inference on Alibaba Cloud ACK, integrating the Cloud Native AI Suite, configuring KServe, setting up Prometheus and Grafana monitoring, and implementing custom autoscaling based on request queue metrics.
Introduction
Large Language Models (LLMs) are rapidly becoming essential across industries, and open‑source models now enable enterprises to embed AI inference directly into their infrastructure. Deploying such models in production, however, can be complex and time‑consuming.
Solution Overview
The article presents a step‑by‑step solution that combines NVIDIA NIM—a pre‑built, containerized service for high‑performance AI inference—with Alibaba Cloud Container Service for Kubernetes (ACK) and the Cloud Native AI Suite. The result is a high‑performance, observable, and elastically scalable LLM inference service.
Prerequisites
Create an ACK cluster with GPU nodes.
Install the Cloud Native AI Suite and the ack‑kserve component.
Obtain an NVIDIA NGC API key to pull the NIM container image.
Deployment Steps
Create the ACK cluster and install the AI suite.
Submit a KServe inference service using Arena, referencing the NVIDIA NIM container for the Llama‑3‑8B‑instruct model.
Configure monitoring to expose Prometheus metrics.
Set up a custom autoscaling policy based on the num_requests_waiting metric.
Step‑by‑Step Commands
Generate the NGC API key and create a Docker secret:
export NGC_API_KEY=<your-ngc-api-key></code><code>kubectl create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY}Create a secret for the NIM image pull:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: nvidia-nim-secrets
stringData:
NGC_API_KEY: <your-ngc-api-key>
EOFDefine a PersistentVolume (PV) and PersistentVolumeClaim (PVC) for model storage (NAS type, ReadWriteMany).
Deploy the Inference Service
arena serve kserve \
--name=llama3-8b-instruct \
--image=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 \
--image-pull-secret=ngc-secret \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--share-memory=32Gi \
--port=8000 \
--security-context runAsUser=0 \
--annotation=serving.kserve.io/autoscalerClass=external \
--env NIM_CACHE_PATH=/mnt/models \
--env-from-secret NGC_API_KEY=nvidia-nim-secrets \
--enable-prometheus=true \
--metrics-port=8000 \
--data=nim-model:/mnt/modelsExpected output confirms successful submission.
INFO[0004] The Job llama3-8b-instruct has been submitted successfullyCheck the service status: arena serve get llama3-8b-instruct The service becomes reachable at http://llama3-8b-instruct-default.example.com.
Testing the Service
# Obtain Ingress IP and service hostname
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama3-8b-instruct -o jsonpath='{.status.url}' | cut -d '/' -f 3)
curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v1/chat/completions \
-d '{"model":"meta/llama3-8b-instruct","messages":[{"role":"user","content":"Once upon a time"}],"max_tokens":64,"temperature":0.7,"top_p":0.9,"seed":10}'The response contains a generated completion, confirming the model is serving correctly.
Monitoring with Prometheus & Grafana
Enable Alibaba Cloud Prometheus, create a Grafana workspace, and import the NVIDIA NIM dashboard JSON (provided by NVIDIA). The dashboard displays metrics such as token latency, current request count, and generated token count.
Custom Autoscaling Based on Queue Length
Install the ack‑alibaba‑cloud‑metrics‑adapter to expose custom metrics to the Kubernetes HPA. Add the following rule to the adapter configuration:
- seriesQuery: num_requests_waiting{namespace!="",pod!=""}
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)Create an HPA manifest that scales when num_requests_waiting exceeds 10:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama3-8b-instruct-hpa
namespace: default
spec:
minReplicas: 1
maxReplicas: 3
metrics:
- type: Pods
pods:
metric:
name: num_requests_waiting
target:
type: AverageValue
averageValue: 10
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama3-8b-instruct-predictorApply the HPA and verify scaling:
kubectl apply -f hpa.yaml
kubectl get hpa llama3-8b-instruct-hpaDuring a load test with hey, the pod count expands to the maximum defined, then contracts after traffic subsides.
Conclusion
By combining NVIDIA NIM, ACK, the Cloud Native AI Suite, Prometheus, and Grafana, you can rapidly build a high‑performance, observable, and elastically scalable LLM inference service that adapts to dynamic workloads through custom metric‑driven autoscaling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
