Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM
This guide walks through enabling KServe on Alibaba Cloud ASM, preparing the Llama‑2‑7B model with TensorRT‑LLM, creating the necessary Kubernetes resources, and deploying a serverless AI inference service that can be queried via a simple curl request.
Background
KServe (formerly KFServing) is a cloud‑native model server supporting autoscaling, zero‑scale, and canary deployments. It can run models via runtimes such as mlserver, TensorFlow Serving, Triton, and TorchServe. TensorRT‑LLM provides a Python API to convert large language models into optimized TensorRT engines for NVIDIA GPUs and can be used as a Triton backend within KServe.
Enable ACK Knative and KServe on ASM
Create an ACK Serverless cluster.
In the Alibaba Cloud Service Mesh (ASM) console, create a mesh instance and associate it with a GPU‑enabled Kubernetes cluster.
Create an ASM ingress gateway with default settings.
Enable Knative integration in the ASM console.
In the ACK console, go to the Knative page, select ASM and click the one‑click Knative deployment button.
Enable KServe on ASM (skip cert‑manager installation if already present). Do not enable the “Install Model Mesh” option for this example.
Prepare Model Data and Compilation Script
Download the Llama‑2‑7B‑hf model from ModelScope and ensure git‑lfs is installed.
# Install git‑lfs if needed
git lfs install
# Clone the model repository without downloading LFS objects
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/shakechen/Llama-2-7b-hf.git
cd Llama-2-7b-hf/
git lfs pullCreate a shell script trtllm-llama-2-7b.sh that performs the following steps:
#!/bin/sh
set -e
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend
# Clone TensorRT‑LLM backend (v0.9.0)
if [ ! -d "$TRT_BACKEND_DIR" ]; then
cd /root
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd "$TRT_BACKEND_DIR"
git submodule update --init --recursive
git lfs install
git lfs pull
fi
# Convert the HuggingFace checkpoint to TensorRT‑LLM format
if [ ! -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
python3 "$TRT_BACKEND_DIR"/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir "$MODEL_MOUNT_PATH"/Llama-2-7b-hf \
--output_dir "$OUTPUT_DIR"/llama-2-7b-ckpt \
--dtype float16
fi
# Build the TensorRT engine
if [ ! -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
trtllm-build --checkpoint_dir "$OUTPUT_DIR"/llama-2-7b-ckpt \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir "$OUTPUT_DIR"/llama-2-7b-engine \
--paged_kv_cache enable \
--max_batch_size 8
fi
# Prepare Triton model repository
cd "$TRT_BACKEND_DIR"
cp -r all_models/inflight_batcher_llm/ llama_ifb
export HF_LLAMA_MODEL="$MODEL_MOUNT_PATH"/Llama-2-7b-hf
export ENGINE_PATH="$OUTPUT_DIR"/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt \
tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt \
tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt \
triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt \
triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt \
triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,\
max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,\
max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,\
exclude_input_in_output:True,enable_kv_cache_reuse:False,\
batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
# Start Triton server
pip install SentencePiece
tritonserver --model-repository="$TRT_BACKEND_DIR"/llama_ifb \
--http-port=8080 --grpc-port=9000 --metrics-port=8002 \
--disable-auto-complete-config \
--backend-config=python,shm-region-prefix-name=prefix0_Upload Model and Script to OSS, Create PV/PVC
# Create OSS directory (replace YOUR_BUCKET_NAME with your bucket)
ossutil mkdir oss://YOUR_BUCKET_NAME/Llama-2-7b-hf
# Upload model files
ossutil cp -r ./Llama-2-7b-hf oss://YOUR_BUCKET_NAME/Llama-2-7b-hf
# Upload the compilation script
chmod +x trtllm-llama-2-7b.sh
ossutil cp -r ./trtllm-llama-2-7b.sh oss://YOUR_BUCKET_NAME/trtllm-llama-2-7b.shCreate a Kubernetes Secret for OSS credentials and define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that mount the OSS bucket.
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: ${YOUR_ACCESSKEY_ID}
akSecret: ${YOUR_ACCESSKEY_SECRET}
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: model-oss
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: ${YOUR_BUCKET_NAME}
url: ${YOUR_BUCKET_ENDPOINT}
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-modelCreate ClusterServingRuntime
Define a custom runtime that uses the Triton‑based TensorRT‑LLM image.
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: triton-trtllm
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8002"
k8s.aliyun.com/eci-auto-imc: 'true'
k8s.aliyun.com/eci-use-specs: "ecs.gn7i-c8g1.2xlarge,ecs.gn7i-c16g1.4xlarge,ecs.gn7i-c32g1.8xlarge,ecs.gn7i-c48g1.12xlarge"
k8s.aliyun.com/eci-extra-ephemeral-storage: 100Gi
containers:
- name: kserve-container
image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
args:
- tritonserver
- --model-store=/mnt/models
- --grpc-port=9000
- --http-port=8080
- --allow-grpc=true
- --allow-http=true
resources:
requests:
cpu: "4"
memory: 12Gi
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- name: triton
version: "2"Deploy the InferenceService
Bind the runtime, reference the PVC, and request one GPU.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-2-7b
spec:
predictor:
model:
modelFormat:
name: triton
version: "2"
runtime: triton-trtllm
storageUri: pvc://llm-model/
name: kserve-container
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 12Gi
nvidia.com/gpu: "1"
command:
- sh
- -c
- /mnt/models/trtllm-llama-2-7b.shApply the manifest; KServe will create the necessary pods and expose the model.
Verify Deployment
kubectl get isvc llama-2-7bThe service should show READY and provide a URL.
Invoke the LLM Service
# Get the ASM ingress gateway IP
ASM_GATEWAY_IP=$(kubectl -n istio-system get svc istio-ingressgateway -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
# Send a generation request
curl -H "Host: llama-2-7b.default.example.com" -H "Content-Type: application/json" \
http://$ASM_GATEWAY_IP:80/v2/models/ensemble/generate \
-d '{"text_input":"What is machine learning?","max_tokens":20,"bad_words":"","stop_words":"","pad_id":2,"end_id":2}'Sample response (truncated):
{"text_output":"
Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate", ...}Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
