Cloud Native 15 min read

Deploy NVIDIA NIM LLM Inference on Alibaba Cloud ACK with Auto‑Scaling and Monitoring

This guide walks you through deploying NVIDIA NIM for LLM inference on Alibaba Cloud ACK, integrating the Cloud Native AI Suite, configuring KServe, setting up Prometheus and Grafana monitoring, and implementing custom autoscaling based on request queue metrics.

Alibaba Cloud Native

Sep 4, 2024

Deploy NVIDIA NIM LLM Inference on Alibaba Cloud ACK with Auto‑Scaling and Monitoring

Introduction

Large Language Models (LLMs) are rapidly becoming essential across industries, and open‑source models now enable enterprises to embed AI inference directly into their infrastructure. Deploying such models in production, however, can be complex and time‑consuming.

Solution Overview

The article presents a step‑by‑step solution that combines NVIDIA NIM—a pre‑built, containerized service for high‑performance AI inference—with Alibaba Cloud Container Service for Kubernetes (ACK) and the Cloud Native AI Suite. The result is a high‑performance, observable, and elastically scalable LLM inference service.

Prerequisites

Create an ACK cluster with GPU nodes.

Install the Cloud Native AI Suite and the ack‑kserve component.

Obtain an NVIDIA NGC API key to pull the NIM container image.

Deployment Steps

Create the ACK cluster and install the AI suite.

Submit a KServe inference service using Arena, referencing the NVIDIA NIM container for the Llama‑3‑8B‑instruct model.

Configure monitoring to expose Prometheus metrics.

Set up a custom autoscaling policy based on the num_requests_waiting metric.

Step‑by‑Step Commands

Generate the NGC API key and create a Docker secret:

export NGC_API_KEY=<your-ngc-api-key></code><code>kubectl create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=${NGC_API_KEY}

Create a secret for the NIM image pull:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: nvidia-nim-secrets
stringData:
  NGC_API_KEY: <your-ngc-api-key>
EOF

Define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) for model storage (NAS type, ReadWriteMany).

Deploy the Inference Service

arena serve kserve \
    --name=llama3-8b-instruct \
    --image=nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 \
    --image-pull-secret=ngc-secret \
    --gpus=1 \
    --cpu=8 \
    --memory=32Gi \
    --share-memory=32Gi \
    --port=8000 \
    --security-context runAsUser=0 \
    --annotation=serving.kserve.io/autoscalerClass=external \
    --env NIM_CACHE_PATH=/mnt/models \
    --env-from-secret NGC_API_KEY=nvidia-nim-secrets \
    --enable-prometheus=true \
    --metrics-port=8000 \
    --data=nim-model:/mnt/models

Expected output confirms successful submission.

INFO[0004] The Job llama3-8b-instruct has been submitted successfully

Check the service status: arena serve get llama3-8b-instruct The service becomes reachable at http://llama3-8b-instruct-default.example.com.

Testing the Service

# Obtain Ingress IP and service hostname
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama3-8b-instruct -o jsonpath='{.status.url}' | cut -d '/' -f 3)

curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
  http://$NGINX_INGRESS_IP:80/v1/chat/completions \
  -d '{"model":"meta/llama3-8b-instruct","messages":[{"role":"user","content":"Once upon a time"}],"max_tokens":64,"temperature":0.7,"top_p":0.9,"seed":10}'

The response contains a generated completion, confirming the model is serving correctly.

Monitoring with Prometheus & Grafana

Enable Alibaba Cloud Prometheus, create a Grafana workspace, and import the NVIDIA NIM dashboard JSON (provided by NVIDIA). The dashboard displays metrics such as token latency, current request count, and generated token count.

Custom Autoscaling Based on Queue Length

Install the ack‑alibaba‑cloud‑metrics‑adapter to expose custom metrics to the Kubernetes HPA. Add the following rule to the adapter configuration:

- seriesQuery: num_requests_waiting{namespace!="",pod!=""}
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)

Create an HPA manifest that scales when num_requests_waiting exceeds 10:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama3-8b-instruct-hpa
  namespace: default
spec:
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: num_requests_waiting
      target:
        type: AverageValue
        averageValue: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama3-8b-instruct-predictor

Apply the HPA and verify scaling:

kubectl apply -f hpa.yaml
kubectl get hpa llama3-8b-instruct-hpa

During a load test with hey, the pod count expands to the maximum defined, then contracts after traffic subsides.

Conclusion

By combining NVIDIA NIM, ACK, the Cloud Native AI Suite, Prometheus, and Grafana, you can rapidly build a high‑performance, observable, and elastically scalable LLM inference service that adapts to dynamic workloads through custom metric‑driven autoscaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM autoscaling prometheus grafana ACK KServe NVIDIA NIM

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.