Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

This guide walks through enabling KServe on Alibaba Cloud ASM, preparing the Llama‑2‑7B model with TensorRT‑LLM, creating the necessary Kubernetes resources, and deploying a serverless AI inference service that can be queried via a simple curl request.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

Background

KServe (formerly KFServing) is a cloud‑native model server supporting autoscaling, zero‑scale, and canary deployments. It can run models via runtimes such as mlserver, TensorFlow Serving, Triton, and TorchServe. TensorRT‑LLM provides a Python API to convert large language models into optimized TensorRT engines for NVIDIA GPUs and can be used as a Triton backend within KServe.

Enable ACK Knative and KServe on ASM

Create an ACK Serverless cluster.

In the Alibaba Cloud Service Mesh (ASM) console, create a mesh instance and associate it with a GPU‑enabled Kubernetes cluster.

Create an ASM ingress gateway with default settings.

Enable Knative integration in the ASM console.

In the ACK console, go to the Knative page, select ASM and click the one‑click Knative deployment button.

Enable KServe on ASM (skip cert‑manager installation if already present). Do not enable the “Install Model Mesh” option for this example.

Prepare Model Data and Compilation Script

Download the Llama‑2‑7B‑hf model from ModelScope and ensure git‑lfs is installed.

# Install git‑lfs if needed
git lfs install

# Clone the model repository without downloading LFS objects
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/shakechen/Llama-2-7b-hf.git
cd Llama-2-7b-hf/
git lfs pull

Create a shell script trtllm-llama-2-7b.sh that performs the following steps:

#!/bin/sh
set -e

MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend

# Clone TensorRT‑LLM backend (v0.9.0)
if [ ! -d "$TRT_BACKEND_DIR" ]; then
  cd /root
  git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  cd "$TRT_BACKEND_DIR"
  git submodule update --init --recursive
  git lfs install
  git lfs pull
fi

# Convert the HuggingFace checkpoint to TensorRT‑LLM format
if [ ! -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
  python3 "$TRT_BACKEND_DIR"/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir "$MODEL_MOUNT_PATH"/Llama-2-7b-hf \
    --output_dir "$OUTPUT_DIR"/llama-2-7b-ckpt \
    --dtype float16
fi

# Build the TensorRT engine
if [ ! -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
  trtllm-build --checkpoint_dir "$OUTPUT_DIR"/llama-2-7b-ckpt \
               --remove_input_padding enable \
               --gpt_attention_plugin float16 \
               --context_fmha enable \
               --gemm_plugin float16 \
               --output_dir "$OUTPUT_DIR"/llama-2-7b-engine \
               --paged_kv_cache enable \
               --max_batch_size 8
fi

# Prepare Triton model repository
cd "$TRT_BACKEND_DIR"
cp -r all_models/inflight_batcher_llm/ llama_ifb

export HF_LLAMA_MODEL="$MODEL_MOUNT_PATH"/Llama-2-7b-hf
export ENGINE_PATH="$OUTPUT_DIR"/llama-2-7b-engine

python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt \
  tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt \
  tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt \
  triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt \
  triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt \
  triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,\
  max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,\
  max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,\
  exclude_input_in_output:True,enable_kv_cache_reuse:False,\
  batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

# Start Triton server
pip install SentencePiece
tritonserver --model-repository="$TRT_BACKEND_DIR"/llama_ifb \
            --http-port=8080 --grpc-port=9000 --metrics-port=8002 \
            --disable-auto-complete-config \
            --backend-config=python,shm-region-prefix-name=prefix0_

Upload Model and Script to OSS, Create PV/PVC

# Create OSS directory (replace YOUR_BUCKET_NAME with your bucket)
ossutil mkdir oss://YOUR_BUCKET_NAME/Llama-2-7b-hf

# Upload model files
ossutil cp -r ./Llama-2-7b-hf oss://YOUR_BUCKET_NAME/Llama-2-7b-hf

# Upload the compilation script
chmod +x trtllm-llama-2-7b.sh
ossutil cp -r ./trtllm-llama-2-7b.sh oss://YOUR_BUCKET_NAME/trtllm-llama-2-7b.sh

Create a Kubernetes Secret for OSS credentials and define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that mount the OSS bucket.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: ${YOUR_ACCESSKEY_ID}
  akSecret: ${YOUR_ACCESSKEY_SECRET}
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-oss
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: ${YOUR_BUCKET_NAME}
      url: ${YOUR_BUCKET_ENDPOINT}
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Create ClusterServingRuntime

Define a custom runtime that uses the Triton‑based TensorRT‑LLM image.

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: triton-trtllm
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
    k8s.aliyun.com/eci-auto-imc: 'true'
    k8s.aliyun.com/eci-use-specs: "ecs.gn7i-c8g1.2xlarge,ecs.gn7i-c16g1.4xlarge,ecs.gn7i-c32g1.8xlarge,ecs.gn7i-c48g1.12xlarge"
    k8s.aliyun.com/eci-extra-ephemeral-storage: 100Gi
  containers:
  - name: kserve-container
    image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
    args:
      - tritonserver
      - --model-store=/mnt/models
      - --grpc-port=9000
      - --http-port=8080
      - --allow-grpc=true
      - --allow-http=true
    resources:
      requests:
        cpu: "4"
        memory: 12Gi
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - name: triton
      version: "2"

Deploy the InferenceService

Bind the runtime, reference the PVC, and request one GPU.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    model:
      modelFormat:
        name: triton
        version: "2"
      runtime: triton-trtllm
      storageUri: pvc://llm-model/
      name: kserve-container
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          cpu: "4"
          memory: 12Gi
          nvidia.com/gpu: "1"
      command:
        - sh
        - -c
        - /mnt/models/trtllm-llama-2-7b.sh

Apply the manifest; KServe will create the necessary pods and expose the model.

Verify Deployment

kubectl get isvc llama-2-7b

The service should show READY and provide a URL.

Invoke the LLM Service

# Get the ASM ingress gateway IP
ASM_GATEWAY_IP=$(kubectl -n istio-system get svc istio-ingressgateway -ojsonpath='{.status.loadBalancer.ingress[0].ip}')

# Send a generation request
curl -H "Host: llama-2-7b.default.example.com" -H "Content-Type: application/json" \
  http://$ASM_GATEWAY_IP:80/v2/models/ensemble/generate \
  -d '{"text_input":"What is machine learning?","max_tokens":20,"bad_words":"","stop_words":"","pad_id":2,"end_id":2}'

Sample response (truncated):

{"text_output":"
Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate", ...}
serverlessLLMKubernetesAI inferenceTensorRT-LLMKServe
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.