Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide
This guide explains how to quickly build a high‑performance, observable, and elastically scalable LLM inference service by deploying NVIDIA NIM on an Alibaba Cloud ACK cluster using the Cloud‑Native AI Suite, KServe, Prometheus, Grafana, and custom autoscaling based on request‑queue metrics.
Large Language Models (LLMs) are increasingly used in production, and deploying them efficiently requires a streamlined workflow. This article shows how to deploy NVIDIA NIM on Alibaba Cloud Container Service for Kubernetes (ACK) together with the Cloud‑Native AI Suite to create a high‑performance, observable inference service.
First, create an ACK cluster with GPU nodes and install the Cloud‑Native AI Suite and the ack‑kserve component. Then generate an NGC API key and create the necessary image pull secret:
export NGC_API_KEY=<your-ngc-api-key></code><code>kubectl create secret docker-registry ngc-secret \</code><code> --docker-server=nvcr.io \</code><code> --docker-username='$oauthtoken' \</code><code> --docker-password=${NGC_API_KEY}Create a secret for the NIM container to access the private NGC repository:
kubectl apply -f-<<EOF</code><code>apiVersion: v1</code><code>kind: Secret</code><code>metadata:</code><code> name: nvidia-nim-secrets</code><code>stringData:</code><code> NGC_API_KEY: <your-ngc-api-key></code><code>EOFProvision a NAS Persistent Volume (PV) and Persistent Volume Claim (PVC) to store model files, then deploy the Llama‑3‑8B model with NVIDIA NIM using Arena:
arena serve kserve \</code><code> --name=llama3-8b-instruct \</code><code> --image=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 \</code><code> --image-pull-secret=ngc-secret \</code><code> --gpus=1 \</code><code> --cpu=8 \</code><code> --memory=32Gi \</code><code> --share-memory=32Gi \</code><code> --port=8000 \</code><code> --security-context runAsUser=0 \</code><code> --annotation=serving.kserve.io/autoscalerClass=external \</code><code> --env NIM_CACHE_PATH=/mnt/models \</code><code> --env-from-secret NGC_API_KEY=nvidia-nim-secrets \</code><code> --enable-prometheus=true \</code><code> --metrics-port=8000 \</code><code> --data=nim-model:/mnt/modelsEnable Alibaba Cloud Prometheus and Grafana, import the NVIDIA NIM dashboard, and create a custom metric rule for num_requests_waiting. Define an HPA that scales the deployment when the average number of waiting requests exceeds 10:
apiVersion: autoscaling/v2</code><code>kind: HorizontalPodAutoscaler</code><code>metadata:</code><code> name: llama3-8b-instruct-hpa</code><code> namespace: default</code><code>spec:</code><code> minReplicas: 1</code><code> maxReplicas: 3</code><code> metrics:</code><code> - pods:</code><code> metric:</code><code> name: num_requests_waiting</code><code> target:</code><code> averageValue: 10</code><code> type: AverageValue</code><code> type: Pods</code><code> scaleTargetRef:</code><code> apiVersion: apps/v1</code><code> kind: Deployment</code><code> name: llama3-8b-instruct-predictorApply the HPA with kubectl apply -f hpa.yaml and verify scaling using kubectl describe hpa llama3-8b-instruct-hpa. Perform a load test with the hey tool:
hey -z 5m -c 400 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "meta/llama3-8b-instruct", "messages": [{"role": "user", "content": "Once upon a time"}], "max_tokens": 64}' http://$NGINX_INGRESS_IP:80/v1/chat/completionsAfter the test, observe the pod count scaling up to the configured maximum and then scaling back down when traffic subsides. Finally, access the inference service via the Nginx Ingress address and confirm that it returns valid LLM completions.
By combining NVIDIA NIM, ACK, the Cloud‑Native AI Suite, and Alibaba Cloud monitoring tools, you can rapidly build a high‑performance, observable, and elastically scalable LLM inference service suitable for production workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
