Cloud Native 12 min read

How to Deploy NVIDIA NIM AI Models on Volcengine VKE in Minutes

This guide walks you through deploying large language models with NVIDIA NIM on Volcengine's Kubernetes Engine (VKE), covering environment setup, model optimization, Helm chart deployment, monitoring integration, and the key advantages of using NIM as a cloud‑native AI micro‑service.

ByteDance Cloud Native

Aug 12, 2024

How to Deploy NVIDIA NIM AI Models on Volcengine VKE in Minutes

Deploying NVIDIA NIM on Volcengine VKE

Large language model (LLM) deployment is moving to production, requiring low latency, high throughput, and observability. This guide outlines a practical workflow using Volcengine Kubernetes Engine (VKE) and NVIDIA NIM micro‑service.

Typical deployment steps

Environment setup : install CUDA, Python, PyTorch and other dependencies.

Model optimization & packaging : use NVIDIA TensorRT or TensorRT‑LLM to optimise inference.

Model deployment : push the optimized container to VKE; for non‑container environments manage resources manually.

Volcengine’s cloud‑native team provides a best‑practice solution that combines NIM’s one‑stop model service with VKE’s cost‑effective, low‑ops Kubernetes clusters.

NVIDIA NIM overview

NVIDIA NIM delivers enterprise‑grade generative AI micro‑services built on Triton Inference Server, TensorRT, TensorRT‑LLM and PyTorch. It supports LLM, VLM, speech, image, video, 3D, drug discovery and medical imaging workloads.

To decouple model and runtime, NIM is packaged as a container image that can be deployed in Kubernetes:

Deployment workflow on VKE

Prerequisites:

VKE cluster with csi‑nas, prometheus‑agent, vci‑virtual‑kubelet, cr‑credential‑controller installed.

GPU‑compatible VCI instance.

NAS storage class for model files.

Image registry (CR) for NIM images.

VMP Prometheus service enabled.

NGC API key for pulling NIM images.

1. Pull the official NIM image and push it to your CR:

$ export NGC_API_KEY=<value> # ngc api key
$ echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
$ docker pull nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
$ docker tag nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct:1.0.0
$ echo "<your-cr-password>" | docker login --username=<account-name>@<account-id> <your-cr-host>
$ docker push <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct:1.0.0

2. Clone the Helm chart and push it to the OCI registry:

$ git clone https://github.com/NVIDIA/nim-deploy.git
$ cd nim-deploy/helm/nim-llm
$ helm registry login --username=<account-name>@<account-id> <your-cr-host>
$ helm package ./ --version 0.2.1
$ helm push nim-llm-0.2.1.tgz oci://<your-cr-host>/<your-cr-namespace>

3. In the VKE console create a Helm application, select the chart, and edit values.yaml with the following configuration (adjust storage class, image repository, GPU limits, etc.):

image:
  repository: <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct
  tag: 1.0.0
model:
  name: meta/llama3-8b-instruct
  ngcAPISecret: ngc-api
  ngcAPIKey: "<your-ngc-api-key>"
persistence:
  enabled: true
  storageClass: "<your-nas-storage-class>"
  annotations:
    helm.sh/resource-policy: keep
statefulSet:
  enabled: false
metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      volcengine.vmp: "true"
service:
  type: LoadBalancer
podAnnotations:
  vci.vke.volcengine.com/preferred-instance-family: vci.gni2
  vke.volcengine.com/burst-to-vci: enforce
resources:
  limits:
    nvidia.com/gpu: 1

4. After the Helm release becomes Ready, obtain the LoadBalancer service address from the VKE console and test the endpoint:

$ curl -X POST http://<lb-ip>:8000/v1/chat/completions \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
  "messages": [
    {"content":"You are a polite chatbot...", "role":"system"},
    {"content":"What should I do for a 4‑day vacation in Spain?", "role":"user"}
  ],
  "model":"meta/llama3-8b-instruct",
  "max_tokens":16,
  "top_p":1,
  "n":1,
  "stream":false,
  "stop":"
",
  "frequency_penalty":0.0
}'

The response contains the generated text from the model.

Observability

NIM exposes Prometheus metrics that can be visualised in Grafana. Follow Volcengine documentation to install Grafana, import the NIM dashboard, and enable the VMP monitoring service.

Advantages of using NIM on VKE

Ease of use : pre‑built container images eliminate manual environment setup.

Performance : optimized for NVIDIA GPUs, leveraging VCI hardware.

Model selection : multiple LLMs available; switch by editing values.yaml.

Automatic updates : NGC handles model version upgrades.

Observability : built‑in metrics integrate with VKE and VMP.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes AI Deployment GPU helm NVIDIA NIM VKE

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.