Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

Background

KServe is an open‑source cloud‑native model serving platform that simplifies deploying machine‑learning models on Kubernetes. Triton Inference Server (by NVIDIA) provides a high‑performance inference runtime supporting multiple frameworks, and its TensorRT‑LLM backend accelerates large language model (LLM) inference on GPUs.

Prerequisites

Kubernetes cluster with GPU nodes (GPU memory ≥ 24 GB).

KServe installed on the cluster.

1. Prepare model data and compilation script

Download the Llama-2-7b-hf model from HuggingFace/ModelScope. Create a shell script trtllm-llama-2-7b.sh that clones the tensorrtllm_backend repository (v0.9.0), converts the checkpoint, builds the TensorRT‑LLM engine, and configures the model repository for Triton.

#!/bin/sh
set -e
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend

# Clone tensorrtllm_backend
if [ -d "$TRT_BACKEND_DIR" ]; then
  echo "directory $TRT_BACKEND_DIR exists, skip clone"
else
  cd /root
  git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  cd $TRT_BACKEND_DIR
  git submodule update --init --recursive
  git lfs install
  git lfs pull
fi

# Convert checkpoint
if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
  echo "checkpoint already exists, skip conversion"
else
  python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \
    --output_dir $OUTPUT_DIR/llama-2-7b-ckpt \
    --dtype float16
fi

# Build TensorRT‑LLM engine
if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
  echo "engine already exists, skip build"
else
  trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \
    --remove_input_padding enable \
    --gpt_attention_plugin float16 \
    --context_fmha enable \
    --gemm_plugin float16 \
    --output_dir $OUTPUT_DIR/llama-2-7b-engine \
    --paged_kv_cache enable \
    --max_batch_size 8
fi

# Configure model repository
cd $TRT_BACKEND_DIR
cp -r all_models/inflight_batcher_llm/ llama_ifb
export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf
export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

# Start Triton server
pip install SentencePiece
tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_

2. Create ClusterServingRuntime

Define a ClusterServingRuntime named triton-trtllm that uses the Triton image with TensorRT‑LLM support and allocates GPU and memory resources.

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: triton-trtllm
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
  - name: kserve-container
    image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
    args:
    - tritonserver
    - --model-store=/mnt/models
    - --grpc-port=9000
    - --http-port=8080
    - --allow-grpc=true
    - --allow-http=true
    resources:
      requests:
        cpu: "4"
        memory: 12Gi
  protocolVersions:
  - v2
  - grpc-v2
  supportedModelFormats:
  - name: triton
    version: "2"

3. Deploy InferenceService

Create an InferenceService that points to the PVC llm-model, uses the runtime defined above, and requests one GPU.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    model:
      modelFormat:
        name: triton
        version: "2"
      runtime: triton-trtllm
      storageUri: pvc://llm-model/
      name: kserve-container
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          cpu: "4"
          memory: 12Gi
          nvidia.com/gpu: "1"
      command:
      - sh
      - -c
      - /mnt/models/trtllm-llama-2-7b.sh

Check readiness with kubectl get isvc llama-2-7b; the READY column should show True.

4. Access the service

Three ways to send inference requests:

Inside the container

curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input":"What is machine learning?","max_tokens":20}'

From another pod in the cluster

NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -o jsonpath='{.spec.clusterIP}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d '/' -f3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${NGINX_INGRESS_IP}:80/v2/models/ensemble/generate -d '{"text_input":"What is machine learning?","max_tokens":20}'

From outside the cluster (replace the ingress IP with the load‑balancer IP) – same curl command.

5. Common issue

If the Triton image fails to pull with a 401 error, authentication to the NVIDIA registry has failed. Pull the image manually on a machine with proper credentials, push it to a private registry, and update the image field in the ClusterServingRuntime to reference the private registry.

References

https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md

https://github.com/kserve/kserve

https://github.com/triton-inference-server/server

model deploymentKubernetesAI inferenceTritonTensorRT-LLMLlama-2KServe
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.