Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide
This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.
Background
KServe is an open‑source cloud‑native model serving platform that simplifies deploying machine‑learning models on Kubernetes. Triton Inference Server (by NVIDIA) provides a high‑performance inference runtime supporting multiple frameworks, and its TensorRT‑LLM backend accelerates large language model (LLM) inference on GPUs.
Prerequisites
Kubernetes cluster with GPU nodes (GPU memory ≥ 24 GB).
KServe installed on the cluster.
1. Prepare model data and compilation script
Download the Llama-2-7b-hf model from HuggingFace/ModelScope. Create a shell script trtllm-llama-2-7b.sh that clones the tensorrtllm_backend repository (v0.9.0), converts the checkpoint, builds the TensorRT‑LLM engine, and configures the model repository for Triton.
#!/bin/sh
set -e
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend
# Clone tensorrtllm_backend
if [ -d "$TRT_BACKEND_DIR" ]; then
echo "directory $TRT_BACKEND_DIR exists, skip clone"
else
cd /root
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd $TRT_BACKEND_DIR
git submodule update --init --recursive
git lfs install
git lfs pull
fi
# Convert checkpoint
if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
echo "checkpoint already exists, skip conversion"
else
python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \
--output_dir $OUTPUT_DIR/llama-2-7b-ckpt \
--dtype float16
fi
# Build TensorRT‑LLM engine
if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
echo "engine already exists, skip build"
else
trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir $OUTPUT_DIR/llama-2-7b-engine \
--paged_kv_cache enable \
--max_batch_size 8
fi
# Configure model repository
cd $TRT_BACKEND_DIR
cp -r all_models/inflight_batcher_llm/ llama_ifb
export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf
export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
# Start Triton server
pip install SentencePiece
tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_2. Create ClusterServingRuntime
Define a ClusterServingRuntime named triton-trtllm that uses the Triton image with TensorRT‑LLM support and allocates GPU and memory resources.
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: triton-trtllm
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8002"
containers:
- name: kserve-container
image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
args:
- tritonserver
- --model-store=/mnt/models
- --grpc-port=9000
- --http-port=8080
- --allow-grpc=true
- --allow-http=true
resources:
requests:
cpu: "4"
memory: 12Gi
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- name: triton
version: "2"3. Deploy InferenceService
Create an InferenceService that points to the PVC llm-model, uses the runtime defined above, and requests one GPU.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-2-7b
spec:
predictor:
model:
modelFormat:
name: triton
version: "2"
runtime: triton-trtllm
storageUri: pvc://llm-model/
name: kserve-container
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 12Gi
nvidia.com/gpu: "1"
command:
- sh
- -c
- /mnt/models/trtllm-llama-2-7b.shCheck readiness with kubectl get isvc llama-2-7b; the READY column should show True.
4. Access the service
Three ways to send inference requests:
Inside the container
curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input":"What is machine learning?","max_tokens":20}'From another pod in the cluster
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -o jsonpath='{.spec.clusterIP}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d '/' -f3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${NGINX_INGRESS_IP}:80/v2/models/ensemble/generate -d '{"text_input":"What is machine learning?","max_tokens":20}'From outside the cluster (replace the ingress IP with the load‑balancer IP) – same curl command.
5. Common issue
If the Triton image fails to pull with a 401 error, authentication to the NVIDIA registry has failed. Pull the image manually on a machine with proper credentials, push it to a private registry, and update the image field in the ClusterServingRuntime to reference the private registry.
References
https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md
https://github.com/kserve/kserve
https://github.com/triton-inference-server/server
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
