How to Deploy NVIDIA NIM AI Models on Volcengine VKE in Minutes
This guide walks you through deploying large language models with NVIDIA NIM on Volcengine's Kubernetes Engine (VKE), covering environment setup, model optimization, Helm chart deployment, monitoring integration, and the key advantages of using NIM as a cloud‑native AI micro‑service.
Deploying NVIDIA NIM on Volcengine VKE
Large language model (LLM) deployment is moving to production, requiring low latency, high throughput, and observability. This guide outlines a practical workflow using Volcengine Kubernetes Engine (VKE) and NVIDIA NIM micro‑service.
Typical deployment steps
Environment setup : install CUDA, Python, PyTorch and other dependencies.
Model optimization & packaging : use NVIDIA TensorRT or TensorRT‑LLM to optimise inference.
Model deployment : push the optimized container to VKE; for non‑container environments manage resources manually.
Volcengine’s cloud‑native team provides a best‑practice solution that combines NIM’s one‑stop model service with VKE’s cost‑effective, low‑ops Kubernetes clusters.
NVIDIA NIM overview
NVIDIA NIM delivers enterprise‑grade generative AI micro‑services built on Triton Inference Server, TensorRT, TensorRT‑LLM and PyTorch. It supports LLM, VLM, speech, image, video, 3D, drug discovery and medical imaging workloads.
To decouple model and runtime, NIM is packaged as a container image that can be deployed in Kubernetes:
Deployment workflow on VKE
Prerequisites:
VKE cluster with csi‑nas, prometheus‑agent, vci‑virtual‑kubelet, cr‑credential‑controller installed.
GPU‑compatible VCI instance.
NAS storage class for model files.
Image registry (CR) for NIM images.
VMP Prometheus service enabled.
NGC API key for pulling NIM images.
1. Pull the official NIM image and push it to your CR:
<code>$ export NGC_API_KEY=<value> # ngc api key
$ echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
$ docker pull nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
$ docker tag nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct:1.0.0
$ echo "<your-cr-password>" | docker login --username=<account-name>@<account-id> <your-cr-host>
$ docker push <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct:1.0.0</code>2. Clone the Helm chart and push it to the OCI registry:
<code>$ git clone https://github.com/NVIDIA/nim-deploy.git
$ cd nim-deploy/helm/nim-llm
$ helm registry login --username=<account-name>@<account-id> <your-cr-host>
$ helm package ./ --version 0.2.1
$ helm push nim-llm-0.2.1.tgz oci://<your-cr-host>/<your-cr-namespace></code>3. In the VKE console create a Helm application, select the chart, and edit
values.yamlwith the following configuration (adjust storage class, image repository, GPU limits, etc.):
<code>image:
repository: <your-cr-host>/<your-cr-namespace>/llama3-8b-instruct
tag: 1.0.0
model:
name: meta/llama3-8b-instruct
ngcAPISecret: ngc-api
ngcAPIKey: "<your-ngc-api-key>"
persistence:
enabled: true
storageClass: "<your-nas-storage-class>"
annotations:
helm.sh/resource-policy: keep
statefulSet:
enabled: false
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
volcengine.vmp: "true"
service:
type: LoadBalancer
podAnnotations:
vci.vke.volcengine.com/preferred-instance-family: vci.gni2
vke.volcengine.com/burst-to-vci: enforce
resources:
limits:
nvidia.com/gpu: 1</code>4. After the Helm release becomes Ready, obtain the LoadBalancer service address from the VKE console and test the endpoint:
<code>$ curl -X POST http://<lb-ip>:8000/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"content":"You are a polite chatbot...", "role":"system"},
{"content":"What should I do for a 4‑day vacation in Spain?", "role":"user"}
],
"model":"meta/llama3-8b-instruct",
"max_tokens":16,
"top_p":1,
"n":1,
"stream":false,
"stop":"\n",
"frequency_penalty":0.0
}'</code>The response contains the generated text from the model.
Observability
NIM exposes Prometheus metrics that can be visualised in Grafana. Follow Volcengine documentation to install Grafana, import the NIM dashboard, and enable the VMP monitoring service.
Advantages of using NIM on VKE
Ease of use : pre‑built container images eliminate manual environment setup.
Performance : optimized for NVIDIA GPUs, leveraging VCI hardware.
Model selection : multiple LLMs available; switch by editing
values.yaml.
Automatic updates : NGC handles model version upgrades.
Observability : built‑in metrics integrate with VKE and VMP.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.