Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide
This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.
Background
Large language models (LLMs) are deep‑learning models pretrained on massive text corpora. During inference they are often limited by GPU memory, so most acceleration frameworks aim to reduce peak memory usage and improve GPU utilization.
TensorRT‑LLM Overview
TensorRT‑LLM is NVIDIA's inference‑optimization framework for LLMs. It provides Python APIs to define models and compiles them into TensorRT engines that incorporate a set of advanced optimizations.
Core Optimizations
Quantization – Lowers model precision to cut memory consumption. Supported precisions include:
INT8 (W8A8) with SmoothQuant.
INT4/INT8 weights with FP16 activations (W4A16, W4A16‑AWQ, W4A16‑GPTQ).
In‑Flight (Continuous) Batching – Unlike static batching, which waits for all sequences in a batch to finish, continuous batching inserts new sequences as soon as a slot becomes free, increasing throughput and reducing latency.
Attention Variants – Standard Multi‑Head Attention (MHA) stores a separate KV cache per query, consuming more memory. Multi‑Query Attention (MQA) shares a single KV cache across all queries, while Group‑Query Attention (GQA) groups queries to share KV caches, balancing memory use and accuracy. Implementations are available in tensorrt_llm.functional.gpt_attention.
Graph Rewriting – During engine compilation TensorRT‑LLM rewrites the computational graph to fuse operations and eliminate redundancies, further boosting execution efficiency.
Practical Deployment on Alibaba Cloud ACK
The following workflow shows how to run TensorRT‑LLM on an Alibaba Cloud Container Service (ACK) notebook using the Cloud‑Native AI Suite.
1. Environment Configuration
Install the Cloud‑Native AI Suite according to the official documentation.
In the ACK console, create a Notebook with the specifications:
CPU: 12 cores
Memory: 40 GB
GPU: 1 × 24 GB VRAM (instance type ecs.gn7i-c16g1.4xlarge)
2. Build a TensorRT‑LLM Docker Image
FROM docker.io/nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
libgl1 libglib2.0 wget git curl vim \
python3.10 python3-pip python3-dev build-essential \
openmpi-bin libopenmpi-dev jupyter-notebook jupyter
RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
RUN pip3 install --upgrade jinja2==3.0.3 pynvml>=11.5.0
RUN rm -rf /var/cache/apt/* && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
rm -rf /root/.cache/pip/ && rm -rf /*.whl
WORKDIR /root
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git --branch v0.7.1
ENTRYPOINT ["sh","-c","jupyter notebook --allow-root --notebook-dir=/root --port=8888 --ip=0.0.0.0 --ServerApp.token=''" ]3. Model Setup (Baichuan2‑7B‑Chat example)
Verify the TensorRT‑LLM installation:
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
# Expected output: 0.7.1Install example dependencies and fetch the model:
cd /root/TensorRT-LLM/examples/baichuan
pip3 install -r requirements.txt
yum install -y git-lfs
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/baichuan-inc/Baichuan2-7B-Chat.git
cd Baichuan2-7B-Chat
git lfs pullCompile the model to TensorRT engines with INT8 weight‑only quantization (≈5 min):
python3 build.py \
--model_version v2_7b \
--model_dir ./Baichuan2-7B-Chat \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu/Run inference with the generated engine:
python3 ../run.py \
--input_text "世界上第二高的山峰是哪座?" \
--max_output_len 50 \
--tokenizer_dir ./Baichuan2-7B-Chat \
--engine_dir ./tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu/Expected output:
世界上第二高的山峰是喀喇昆仑山脉的乔戈里峰(K2),海拔高度为8611米。4. Performance Testing
Add a configuration for the Baichuan2 model in benchmarks/python/allowed_configs.py (excerpt):
"baichuan2_7b_chat": ModelConfig(
name="baichuan2_7b_chat",
family="baichuan_7b",
benchmark_type="gpt",
build_config=BuildConfig(
num_layers=32,
num_heads=32,
hidden_size=4096,
vocab_size=125696,
hidden_act='silu',
n_positions=4096,
inter_size=11008,
max_batch_size=128,
max_input_len=512,
max_output_len=200,
),
),Run the built‑in benchmark (single‑GPU, batch size 1):
python3 benchmark.py \
-m baichuan2_7b_chat \
--mode plugin \
--engine_dir /root/TensorRT-LLM/examples/baichuan/tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu \
--batch_size 1 \
--input_output_len "32,50;128,50"Sample results (tokens / sec ≈ 60, latency ≈ 820 ms for 32‑token input).
Compare with a baseline PyTorch inference script. The INT8 TensorRT‑LLM engine reduces peak GPU memory by **43.8 %** and latency by **61.1 %** relative to the original Baichuan2‑7B‑Chat model.
Key Takeaways
TensorRT‑LLM combines quantization, continuous batching, and memory‑efficient attention mechanisms to dramatically lower GPU memory consumption and accelerate LLM inference. The workflow described above demonstrates a reproducible end‑to‑end pipeline on Alibaba Cloud ACK.
References
TensorRT‑LLM architecture: https://nvidia.github.io/TensorRT-LLM/architecture.html
Continuous batching for LLM inference: https://www.anyscale.com/blog/continuous-batching-llm-inference
SmoothQuant paper: https://arxiv.org/abs/2211.10438
AWQ paper: https://arxiv.org/abs/2306.00978
GPTQ paper: https://arxiv.org/abs/2210.17323
Original Transformer paper: https://arxiv.org/abs/1911.02150
Group‑Query Attention paper: https://arxiv.org/abs/2307.09288
TensorRT‑LLM GitHub repository: https://github.com/NVIDIA/TensorRT-LLM
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
