Artificial Intelligence 7 min read

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

This guide explains how to set up Alibaba Cloud's ACK environment, install the Cloud Native AI Suite, configure TensorRT, and run Stable Diffusion with dramatically reduced latency and memory usage, including detailed commands, performance metrics, and reproducible code snippets.

Alibaba Cloud Native

Dec 30, 2023

How to Accelerate Stable Diffusion with TensorRT on Alibaba Cloud ACK

Overview

Stable Diffusion generates images from text by encoding the prompt with CLIP‑Text, applying a UNet + Scheduler in latent space, and decoding with an auto‑encoder. The diffusion process is computationally heavy, typically taking ~4 seconds per image on a modern GPU. TensorRT, NVIDIA’s high‑performance inference framework, can accelerate the entire pipeline (encoder, UNet, decoder) through mixed‑precision, kernel auto‑selection, dynamic tensor memory, multi‑stream execution, and time‑fusion for recurrent steps.

Prerequisites

A Linux notebook with a GPU that has at least 16 GB VRAM (e.g., NVIDIA A10) and Python 3.9+ is required. The steps below assume a Jupyter‑style notebook environment.

Install Required Packages

!pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade "torch<2.0.0"
!pip install --upgrade "tensorrt>=8.6"
!pip install --upgrade "accelerate" "diffusers==0.21.4" "transformers"
!pip install --extra-index-url https://pypi.ngc.nvidia.com --upgrade "onnx-graphsurgeon" "onnxruntime" "polygraphy"
!pip install polygraphy==0.47.1 -i https://pypi.ngc.nvidia.com

Prepare the Stable Diffusion Model

import diffusers, torch, tensorrt
from diffusers.pipelines.stable_diffusion import StableDiffusionPipeline
from diffusers import DDIMScheduler

# Model identifier on HuggingFace Hub (replace with a local path if needed)
model_path = "runwayml/stable-diffusion-v1-5"
# Load the scheduler used for inference
scheduler = DDIMScheduler.from_pretrained(model_path, subfolder="scheduler")

Build the TensorRT Engine

# Create a pipeline that uses the TensorRT custom implementation
pipe_trt = StableDiffusionPipeline.from_pretrained(
    model_path,
    custom_pipeline="stable_diffusion_tensorrt_txt2img",
    revision="fp16",
    torch_dtype=torch.float16,
    scheduler=scheduler,
)
# Cache the compiled TensorRT engines (first build may take ~35 min on A10)
pipe_trt.set_cached_folder(model_path, revision="fp16")
# Move pipeline to GPU
pipe_trt = pipe_trt.to("cuda")

Run Inference and Measure Latency

prompt = "A beautiful ship is floating in the clouds, unreal engine, cozy indoor lighting, artstation, detailed, digital painting, cinematic"
neg_prompt = "ugly"
import time
start_time = time.time()
image = pipe_trt(prompt, negative_prompt=neg_prompt).images[0]
end_time = time.time()
print("time: " + str(round(end_time - start_time, 2)) + "s")
# display(image)  # In a notebook this renders the result

On an NVIDIA A10 GPU a single image is generated in approximately 2.31 seconds , compared with ~4 seconds for the unoptimized pipeline.

Performance Benchmark

The benchmark uses the lambda‑diffusers repository (https://github.com/LambdaLabsML/lambda-diffusers). A single prompt is evaluated with batch size = 50, repeated 100 times on an A10 instance (ecs.gn7i-c8g1.2xlarge). Results show:

Average inference time reduced by 44.7 % when TensorRT and xformers are enabled.

GPU memory consumption reduced by 37.6 % .

References

TensorRT repository: https://github.com/NVIDIA/TensorRT

Stable Diffusion WebUI: https://github.com/AUTOMATIC1111/stable-diffusion-webui

NVIDIA AI Acceleration Webinar: https://www.nvidia.cn/webinars/sessions/?session_id=230919-29256

lambda‑diffusers repository: https://github.com/LambdaLabsML/lambda-diffusers

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU inference TensorRT Stable Diffusion AI acceleration

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.