13 min read

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.

Alibaba Cloud Native

Jan 17, 2024

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

Background

Large language models (LLMs) are deep‑learning models pretrained on massive text corpora. During inference they are often limited by GPU memory, so most acceleration frameworks aim to reduce peak memory usage and improve GPU utilization.

TensorRT‑LLM Overview

TensorRT‑LLM is NVIDIA's inference‑optimization framework for LLMs. It provides Python APIs to define models and compiles them into TensorRT engines that incorporate a set of advanced optimizations.

Core Optimizations

Quantization – Lowers model precision to cut memory consumption. Supported precisions include:

INT8 (W8A8) with SmoothQuant.

INT4/INT8 weights with FP16 activations (W4A16, W4A16‑AWQ, W4A16‑GPTQ).

In‑Flight (Continuous) Batching – Unlike static batching, which waits for all sequences in a batch to finish, continuous batching inserts new sequences as soon as a slot becomes free, increasing throughput and reducing latency.

Attention Variants – Standard Multi‑Head Attention (MHA) stores a separate KV cache per query, consuming more memory. Multi‑Query Attention (MQA) shares a single KV cache across all queries, while Group‑Query Attention (GQA) groups queries to share KV caches, balancing memory use and accuracy. Implementations are available in tensorrt_llm.functional.gpt_attention.

Graph Rewriting – During engine compilation TensorRT‑LLM rewrites the computational graph to fuse operations and eliminate redundancies, further boosting execution efficiency.

Practical Deployment on Alibaba Cloud ACK

The following workflow shows how to run TensorRT‑LLM on an Alibaba Cloud Container Service (ACK) notebook using the Cloud‑Native AI Suite.

1. Environment Configuration

Install the Cloud‑Native AI Suite according to the official documentation.

In the ACK console, create a Notebook with the specifications:

CPU: 12 cores

Memory: 40 GB

GPU: 1 × 24 GB VRAM (instance type ecs.gn7i-c16g1.4xlarge)

2. Build a TensorRT‑LLM Docker Image

FROM docker.io/nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    libgl1 libglib2.0 wget git curl vim \
    python3.10 python3-pip python3-dev build-essential \
    openmpi-bin libopenmpi-dev jupyter-notebook jupyter
RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
RUN pip3 install --upgrade jinja2==3.0.3 pynvml>=11.5.0
RUN rm -rf /var/cache/apt/* && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
    rm -rf /root/.cache/pip/ && rm -rf /*.whl
WORKDIR /root
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git --branch v0.7.1
ENTRYPOINT ["sh","-c","jupyter notebook --allow-root --notebook-dir=/root --port=8888 --ip=0.0.0.0 --ServerApp.token=''" ]

3. Model Setup (Baichuan2‑7B‑Chat example)

Verify the TensorRT‑LLM installation:

python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
# Expected output: 0.7.1

Install example dependencies and fetch the model:

cd /root/TensorRT-LLM/examples/baichuan
pip3 install -r requirements.txt
yum install -y git-lfs
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/baichuan-inc/Baichuan2-7B-Chat.git
cd Baichuan2-7B-Chat
git lfs pull

Compile the model to TensorRT engines with INT8 weight‑only quantization (≈5 min):

python3 build.py \
    --model_version v2_7b \
    --model_dir ./Baichuan2-7B-Chat \
    --dtype float16 \
    --use_gemm_plugin float16 \
    --use_gpt_attention_plugin float16 \
    --use_weight_only \
    --output_dir ./tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu/

Run inference with the generated engine:

python3 ../run.py \
    --input_text "世界上第二高的山峰是哪座？" \
    --max_output_len 50 \
    --tokenizer_dir ./Baichuan2-7B-Chat \
    --engine_dir ./tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu/

Expected output:

世界上第二高的山峰是喀喇昆仑山脉的乔戈里峰（K2），海拔高度为8611米。

4. Performance Testing

Add a configuration for the Baichuan2 model in benchmarks/python/allowed_configs.py (excerpt):

"baichuan2_7b_chat": ModelConfig(
    name="baichuan2_7b_chat",
    family="baichuan_7b",
    benchmark_type="gpt",
    build_config=BuildConfig(
        num_layers=32,
        num_heads=32,
        hidden_size=4096,
        vocab_size=125696,
        hidden_act='silu',
        n_positions=4096,
        inter_size=11008,
        max_batch_size=128,
        max_input_len=512,
        max_output_len=200,
    ),
),

Run the built‑in benchmark (single‑GPU, batch size 1):

python3 benchmark.py \
    -m baichuan2_7b_chat \
    --mode plugin \
    --engine_dir /root/TensorRT-LLM/examples/baichuan/tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu \
    --batch_size 1 \
    --input_output_len "32,50;128,50"

Sample results (tokens / sec ≈ 60, latency ≈ 820 ms for 32‑token input).

Compare with a baseline PyTorch inference script. The INT8 TensorRT‑LLM engine reduces peak GPU memory by **43.8 %** and latency by **61.1 %** relative to the original Baichuan2‑7B‑Chat model.

Key Takeaways

TensorRT‑LLM combines quantization, continuous batching, and memory‑efficient attention mechanisms to dramatically lower GPU memory consumption and accelerate LLM inference. The workflow described above demonstrates a reproducible end‑to‑end pipeline on Alibaba Cloud ACK.

References

TensorRT‑LLM architecture: https://nvidia.github.io/TensorRT-LLM/architecture.html

Continuous batching for LLM inference: https://www.anyscale.com/blog/continuous-batching-llm-inference

SmoothQuant paper: https://arxiv.org/abs/2211.10438

AWQ paper: https://arxiv.org/abs/2306.00978

GPTQ paper: https://arxiv.org/abs/2210.17323

Original Transformer paper: https://arxiv.org/abs/1911.02150

Group‑Query Attention paper: https://arxiv.org/abs/2307.09288

TensorRT‑LLM GitHub repository: https://github.com/NVIDIA/TensorRT-LLM

Quantization benchmark LLM inference TensorRT-LLM Cloud Native AI In‑Flight Batching

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.