How PAI‑Blade Supercharges Stable Diffusion Inference on GPUs
This article explains how PAI‑Blade, built on the BladeDISC compiler and BlaDNN library, dramatically reduces latency and memory usage for Stable Diffusion inference, provides step‑by‑step usage examples with code, shows performance gains on A100 and A10 GPUs, and outlines future optimization directions.
Background
AIGC is a rapidly growing field in AI computing, and Stable Diffusion is the most popular open‑source model, attracting wide attention. As its application scenarios expand, inference latency and computational cost have become critical challenges.
Introduction
PAI‑Blade is a universal inference‑optimization tool released by PAI. It jointly optimizes models through system‑level techniques to achieve optimal inference performance. PAI‑Blade relies on the fully dynamic‑size AI compiler BladeDISC and the high‑performance computing library BlaDNN, which is based on deep‑learning‑driven automatic scheduling. It provides automatic high‑performance inference optimization for many models, including Stable Diffusion, large language models (LLM), large‑scale sparse recommendation models (CTR), and speech recognition models (ASR).
BladeDISC
BladeDISC is an AI compiler that supports fully dynamic shapes. Its front‑end accepts PyTorch and TensorFlow models; for PyTorch it supports both TorchScript and TorchDynamo input modes. The back‑end uses the AStitch large‑scale operator‑fusion technique and efficient code generation to improve execution efficiency of memory‑intensive operators.
BladeDISC is open‑source on GitHub: https://github.com/alibaba/BladeDISC
BlaDNN
BlaDNN is a high‑performance computing library based on deep‑learning automatic scheduling. As an upgraded version of Ansor, it generates kernels that outperform Ansor and can rely entirely on DNN automatic scheduling without manual tuning. In dynamic‑shape scenarios, the average performance of GPU‑intensive operators reaches 99.39% of the best tuned performance, and inference latency can be reduced to 2 µs while using only one CPU core, avoiding any jitter on the GPU model.
Advantages of Using PAI‑Blade for Stable Diffusion
High performance: Blade reduces end‑to‑end latency of Text2Img, Img2Img, etc., by 2.42‑3.05× and cuts memory usage up to 5.27×, surpassing SOTA solutions such as TensorRT‑8.5.
Full dynamic‑shape support: a single optimization works for any input shape or batch size.
Ease of use and extensibility: only a few lines of code are needed to enable Blade optimization across multiple pipelines, and it also supports LoRA‑based inference.
Usage Example
The following example uses the popular runwayml/stable-diffusion-v1-5 Text2Img pipeline to demonstrate PAI‑Blade optimization.
Environment Installation
The complete script and environment are packaged in the Docker image registry.cn-beijing.aliyuncs.com/blade_demo/blade_diffusion. Run the inference example with:
python /blade/blade_diffusion.pyModel Optimization Steps
1. Load the pretrained model:
from diffusers import StableDiffusionPipeline
import torch
device = torch.device("cuda:0")
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
revision="fp16",
torch_dtype=torch.float16
).to(device)2. Optimize with PAI‑Blade (supports any shape after optimization):
import torch_blade
opt_cfg = torch_blade.Config()
opt_cfg.enable_fp16 = True
with opt_cfg, torch.no_grad():
encoder = blade_optimize(pipe.text_encoder, model_inputs=encoder_inputs, allow_tracing=True)
unet = blade_optimize(pipe.unet, model_inputs=unet_inputs, allow_tracing=True)
decoder = blade_optimize(pipe.vae.decoder, model_inputs=decoder_inputs, allow_tracing=True)3. Replace the original components with the optimized ones and run inference as usual:
@dataclass
class UNet2DConditionOutput:
sample: torch.FloatTensor
class TracedUNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.config = pipe.unet.config
self.in_channels = pipe.unet.in_channels
self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states, **kwargs):
sample = unet(latent_model_input.half(), t.half(), encoder_hidden_states.half())["sample"]
return UNet2DConditionOutput(sample=sample)
class TracedEncoder(torch.nn.Module):
def __init__(self):
super().__init__()
self.config = pipe.text_encoder.config
self.device = pipe.text_encoder.device
self.dtype = torch.half
def forward(self, input_ids, **kwargs):
embeddings = encoder(input_ids.long())
return [embeddings["last_hidden_state"]]
class TracedDecoder(torch.nn.Module):
def forward(self, input):
return decoder(input.half())
pipe.text_encoder = TracedEncoder()
pipe.unet = TracedUNet()
pipe.vae.decoder = TracedDecoder()Performance Comparison
GPU A100:
GPU A10:
Inference Result Verification
After PAI‑Blade optimization, the generated images match the original PyTorch eager outputs. The left image shows the PyTorch result, and the right image shows the PAI‑Blade optimized result.
Supported Pipelines
StableDiffusionPipeline
StableDiffusionImg2ImgPipeline
StableDiffusionInpaintPipeline
AltDiffusionPipeline
LoRA Optimization
LoRA adds low‑rank matrices to a pretrained model, fine‑tuning only the new weights and greatly reducing fine‑tuning cost. PAI‑Blade already supports LoRA in the HuggingFace diffusers library; the same Blade‑optimized pipeline can run with any LoRA weights without re‑optimizing.
Outlook
Stable Diffusion technology continues to evolve. The PAI‑Blade team plans to integrate optimizations into stable‑diffusion‑webui and to accelerate fine‑tuning training speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
