Artificial Intelligence 15 min read

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

This guide explains how to quickly build a high‑performance, observable, and elastically scalable LLM inference service by deploying NVIDIA NIM on an Alibaba Cloud ACK cluster using the Cloud‑Native AI Suite, KServe, Prometheus, Grafana, and custom autoscaling based on request‑queue metrics.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

Large Language Models (LLMs) are increasingly used in production, and deploying them efficiently requires a streamlined workflow. This article shows how to deploy NVIDIA NIM on Alibaba Cloud Container Service for Kubernetes (ACK) together with the Cloud‑Native AI Suite to create a high‑performance, observable inference service.

First, create an ACK cluster with GPU nodes and install the Cloud‑Native AI Suite and the ack‑kserve component. Then generate an NGC API key and create the necessary image pull secret:

export NGC_API_KEY=<your-ngc-api-key>
kubectl create secret docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY}

Create a secret for the NIM container to access the private NGC repository:

kubectl apply -f-<<EOF
apiVersion: v1
kind: Secret
metadata:
name: nvidia-nim-secrets
stringData:
NGC_API_KEY: <your-ngc-api-key>
EOF

Provision a NAS Persistent Volume (PV) and Persistent Volume Claim (PVC) to store model files, then deploy the Llama‑3‑8B model with NVIDIA NIM using Arena:

arena serve kserve \
--name=llama3-8b-instruct \
--image=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 \
--image-pull-secret=ngc-secret \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--share-memory=32Gi \
--port=8000 \
--security-context runAsUser=0 \
--annotation=serving.kserve.io/autoscalerClass=external \
--env NIM_CACHE_PATH=/mnt/models \
--env-from-secret NGC_API_KEY=nvidia-nim-secrets \
--enable-prometheus=true \
--metrics-port=8000 \
--data=nim-model:/mnt/models

Enable Alibaba Cloud Prometheus and Grafana, import the NVIDIA NIM dashboard, and create a custom metric rule for num_requests_waiting . Define an HPA that scales the deployment when the average number of waiting requests exceeds 10:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama3-8b-instruct-hpa
namespace: default
spec:
minReplicas: 1
maxReplicas: 3
metrics:
- pods:
metric:
name: num_requests_waiting
target:
averageValue: 10
type: AverageValue
type: Pods
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama3-8b-instruct-predictor

Apply the HPA with kubectl apply -f hpa.yaml and verify scaling using kubectl describe hpa llama3-8b-instruct-hpa . Perform a load test with the hey tool:

hey -z 5m -c 400 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "meta/llama3-8b-instruct", "messages": [{"role": "user", "content": "Once upon a time"}], "max_tokens": 64}' http://$NGINX_INGRESS_IP:80/v1/chat/completions

After the test, observe the pod count scaling up to the configured maximum and then scaling back down when traffic subsides. Finally, access the inference service via the Nginx Ingress address and confirm that it returns valid LLM completions.

By combining NVIDIA NIM, ACK, the Cloud‑Native AI Suite, and Alibaba Cloud monitoring tools, you can rapidly build a high‑performance, observable, and elastically scalable LLM inference service suitable for production workloads.

autoscalingPrometheusLLM inferenceGrafanaKServeAlibaba Cloud ACKNVIDIA NIM
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.