How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK
This guide explains how to set up Alibaba Cloud's ACK environment, install the Cloud Native AI Suite, configure TensorRT, and run Stable Diffusion with dramatically reduced latency and memory usage, including detailed commands, performance metrics, and reproducible code snippets.
Overview
Stable Diffusion generates images from text by encoding the prompt with CLIP‑Text, applying a UNet + Scheduler in latent space, and decoding with an auto‑encoder. The diffusion process is computationally heavy, typically taking ~4 seconds per image on a modern GPU. TensorRT, NVIDIA’s high‑performance inference framework, can accelerate the entire pipeline (encoder, UNet, decoder) through mixed‑precision, kernel auto‑selection, dynamic tensor memory, multi‑stream execution, and time‑fusion for recurrent steps.
Prerequisites
A Linux notebook with a GPU that has at least 16 GB VRAM (e.g., NVIDIA A10) and Python 3.9+ is required. The steps below assume a Jupyter‑style notebook environment.
Install Required Packages
!pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade "torch<2.0.0"
!pip install --upgrade "tensorrt>=8.6"
!pip install --upgrade "accelerate" "diffusers==0.21.4" "transformers"
!pip install --extra-index-url https://pypi.ngc.nvidia.com --upgrade "onnx-graphsurgeon" "onnxruntime" "polygraphy"
!pip install polygraphy==0.47.1 -i https://pypi.ngc.nvidia.comPrepare the Stable Diffusion Model
import diffusers, torch, tensorrt
from diffusers.pipelines.stable_diffusion import StableDiffusionPipeline
from diffusers import DDIMScheduler
# Model identifier on HuggingFace Hub (replace with a local path if needed)
model_path = "runwayml/stable-diffusion-v1-5"
# Load the scheduler used for inference
scheduler = DDIMScheduler.from_pretrained(model_path, subfolder="scheduler")Build the TensorRT Engine
# Create a pipeline that uses the TensorRT custom implementation
pipe_trt = StableDiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="stable_diffusion_tensorrt_txt2img",
revision="fp16",
torch_dtype=torch.float16,
scheduler=scheduler,
)
# Cache the compiled TensorRT engines (first build may take ~35 min on A10)
pipe_trt.set_cached_folder(model_path, revision="fp16")
# Move pipeline to GPU
pipe_trt = pipe_trt.to("cuda")Run Inference and Measure Latency
prompt = "A beautiful ship is floating in the clouds, unreal engine, cozy indoor lighting, artstation, detailed, digital painting, cinematic"
neg_prompt = "ugly"
import time
start_time = time.time()
image = pipe_trt(prompt, negative_prompt=neg_prompt).images[0]
end_time = time.time()
print("time: " + str(round(end_time - start_time, 2)) + "s")
# display(image) # In a notebook this renders the resultOn an NVIDIA A10 GPU a single image is generated in approximately 2.31 seconds , compared with ~4 seconds for the unoptimized pipeline.
Performance Benchmark
The benchmark uses the lambda‑diffusers repository (https://github.com/LambdaLabsML/lambda-diffusers). A single prompt is evaluated with batch size = 50, repeated 100 times on an A10 instance (ecs.gn7i-c8g1.2xlarge). Results show:
Average inference time reduced by 44.7 % when TensorRT and xformers are enabled.
GPU memory consumption reduced by 37.6 % .
References
TensorRT repository: https://github.com/NVIDIA/TensorRT
Stable Diffusion WebUI: https://github.com/AUTOMATIC1111/stable-diffusion-webui
NVIDIA AI Acceleration Webinar: https://www.nvidia.cn/webinars/sessions/?session_id=230919-29256
lambda‑diffusers repository: https://github.com/LambdaLabsML/lambda-diffusers
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
