Artificial Intelligence 15 min read

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

This guide explains how to quickly build a high‑performance, observable, and elastically scalable LLM inference service by deploying NVIDIA NIM on an Alibaba Cloud ACK cluster using the Cloud‑Native AI Suite, KServe, Prometheus, Grafana, and custom autoscaling based on request‑queue metrics.

Alibaba Cloud Infrastructure

Sep 5, 2024

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

Large Language Models (LLMs) are increasingly used in production, and deploying them efficiently requires a streamlined workflow. This article shows how to deploy NVIDIA NIM on Alibaba Cloud Container Service for Kubernetes (ACK) together with the Cloud‑Native AI Suite to create a high‑performance, observable inference service.

First, create an ACK cluster with GPU nodes and install the Cloud‑Native AI Suite and the ack‑kserve component. Then generate an NGC API key and create the necessary image pull secret:

export NGC_API_KEY=<your-ngc-api-key></code><code>kubectl create secret docker-registry ngc-secret \</code><code>  --docker-server=nvcr.io \</code><code>  --docker-username='$oauthtoken' \</code><code>  --docker-password=${NGC_API_KEY}

Create a secret for the NIM container to access the private NGC repository:

kubectl apply -f-<<EOF</code><code>apiVersion: v1</code><code>kind: Secret</code><code>metadata:</code><code>  name: nvidia-nim-secrets</code><code>stringData:</code><code>  NGC_API_KEY: <your-ngc-api-key></code><code>EOF

Provision a NAS Persistent Volume (PV) and Persistent Volume Claim (PVC) to store model files, then deploy the Llama‑3‑8B model with NVIDIA NIM using Arena:

arena serve kserve \</code><code>    --name=llama3-8b-instruct \</code><code>    --image=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 \</code><code>    --image-pull-secret=ngc-secret \</code><code>    --gpus=1 \</code><code>    --cpu=8 \</code><code>    --memory=32Gi \</code><code>    --share-memory=32Gi \</code><code>    --port=8000 \</code><code>    --security-context runAsUser=0 \</code><code>    --annotation=serving.kserve.io/autoscalerClass=external \</code><code>    --env NIM_CACHE_PATH=/mnt/models \</code><code>    --env-from-secret NGC_API_KEY=nvidia-nim-secrets \</code><code>    --enable-prometheus=true \</code><code>    --metrics-port=8000 \</code><code>    --data=nim-model:/mnt/models

Enable Alibaba Cloud Prometheus and Grafana, import the NVIDIA NIM dashboard, and create a custom metric rule for num_requests_waiting. Define an HPA that scales the deployment when the average number of waiting requests exceeds 10:

apiVersion: autoscaling/v2</code><code>kind: HorizontalPodAutoscaler</code><code>metadata:</code><code>  name: llama3-8b-instruct-hpa</code><code>  namespace: default</code><code>spec:</code><code>  minReplicas: 1</code><code>  maxReplicas: 3</code><code>  metrics:</code><code>  - pods:</code><code>      metric:</code><code>        name: num_requests_waiting</code><code>      target:</code><code>        averageValue: 10</code><code>        type: AverageValue</code><code>    type: Pods</code><code>  scaleTargetRef:</code><code>    apiVersion: apps/v1</code><code>    kind: Deployment</code><code>    name: llama3-8b-instruct-predictor

Apply the HPA with kubectl apply -f hpa.yaml and verify scaling using kubectl describe hpa llama3-8b-instruct-hpa. Perform a load test with the hey tool:

hey -z 5m -c 400 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "meta/llama3-8b-instruct", "messages": [{"role": "user", "content": "Once upon a time"}], "max_tokens": 64}' http://$NGINX_INGRESS_IP:80/v1/chat/completions

After the test, observe the pod count scaling up to the configured maximum and then scaling back down when traffic subsides. Finally, access the inference service via the Nginx Ingress address and confirm that it returns valid LLM completions.

By combining NVIDIA NIM, ACK, the Cloud‑Native AI Suite, and Alibaba Cloud monitoring tools, you can rapidly build a high‑performance, observable, and elastically scalable LLM inference service suitable for production workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

autoscaling prometheus LLM Inference grafana KServe Alibaba Cloud ACK NVIDIA NIM

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.