How PAI‑Blade Supercharges Stable Diffusion Inference on GPUs

This article explains how PAI‑Blade, built on the BladeDISC compiler and BlaDNN library, dramatically reduces latency and memory usage for Stable Diffusion inference, provides step‑by‑step usage examples with code, shows performance gains on A100 and A10 GPUs, and outlines future optimization directions.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How PAI‑Blade Supercharges Stable Diffusion Inference on GPUs

Background

AIGC is a rapidly growing field in AI computing, and Stable Diffusion is the most popular open‑source model, attracting wide attention. As its application scenarios expand, inference latency and computational cost have become critical challenges.

Introduction

PAI‑Blade is a universal inference‑optimization tool released by PAI. It jointly optimizes models through system‑level techniques to achieve optimal inference performance. PAI‑Blade relies on the fully dynamic‑size AI compiler BladeDISC and the high‑performance computing library BlaDNN, which is based on deep‑learning‑driven automatic scheduling. It provides automatic high‑performance inference optimization for many models, including Stable Diffusion, large language models (LLM), large‑scale sparse recommendation models (CTR), and speech recognition models (ASR).

BladeDISC

BladeDISC is an AI compiler that supports fully dynamic shapes. Its front‑end accepts PyTorch and TensorFlow models; for PyTorch it supports both TorchScript and TorchDynamo input modes. The back‑end uses the AStitch large‑scale operator‑fusion technique and efficient code generation to improve execution efficiency of memory‑intensive operators.

BladeDISC is open‑source on GitHub: https://github.com/alibaba/BladeDISC

BlaDNN

BlaDNN is a high‑performance computing library based on deep‑learning automatic scheduling. As an upgraded version of Ansor, it generates kernels that outperform Ansor and can rely entirely on DNN automatic scheduling without manual tuning. In dynamic‑shape scenarios, the average performance of GPU‑intensive operators reaches 99.39% of the best tuned performance, and inference latency can be reduced to 2 µs while using only one CPU core, avoiding any jitter on the GPU model.

Advantages of Using PAI‑Blade for Stable Diffusion

High performance: Blade reduces end‑to‑end latency of Text2Img, Img2Img, etc., by 2.42‑3.05× and cuts memory usage up to 5.27×, surpassing SOTA solutions such as TensorRT‑8.5.

Full dynamic‑shape support: a single optimization works for any input shape or batch size.

Ease of use and extensibility: only a few lines of code are needed to enable Blade optimization across multiple pipelines, and it also supports LoRA‑based inference.

Usage Example

The following example uses the popular runwayml/stable-diffusion-v1-5 Text2Img pipeline to demonstrate PAI‑Blade optimization.

Environment Installation

The complete script and environment are packaged in the Docker image registry.cn-beijing.aliyuncs.com/blade_demo/blade_diffusion. Run the inference example with:

python /blade/blade_diffusion.py

Model Optimization Steps

1. Load the pretrained model:

from diffusers import StableDiffusionPipeline
import torch
device = torch.device("cuda:0")
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    revision="fp16",
    torch_dtype=torch.float16
).to(device)

2. Optimize with PAI‑Blade (supports any shape after optimization):

import torch_blade
opt_cfg = torch_blade.Config()
opt_cfg.enable_fp16 = True
with opt_cfg, torch.no_grad():
    encoder = blade_optimize(pipe.text_encoder, model_inputs=encoder_inputs, allow_tracing=True)
    unet = blade_optimize(pipe.unet, model_inputs=unet_inputs, allow_tracing=True)
    decoder = blade_optimize(pipe.vae.decoder, model_inputs=decoder_inputs, allow_tracing=True)

3. Replace the original components with the optimized ones and run inference as usual:

@dataclass
class UNet2DConditionOutput:
    sample: torch.FloatTensor

class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.config = pipe.unet.config
        self.in_channels = pipe.unet.in_channels
        self.device = pipe.unet.device
    def forward(self, latent_model_input, t, encoder_hidden_states, **kwargs):
        sample = unet(latent_model_input.half(), t.half(), encoder_hidden_states.half())["sample"]
        return UNet2DConditionOutput(sample=sample)

class TracedEncoder(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.config = pipe.text_encoder.config
        self.device = pipe.text_encoder.device
        self.dtype = torch.half
    def forward(self, input_ids, **kwargs):
        embeddings = encoder(input_ids.long())
        return [embeddings["last_hidden_state"]]

class TracedDecoder(torch.nn.Module):
    def forward(self, input):
        return decoder(input.half())

pipe.text_encoder = TracedEncoder()
pipe.unet = TracedUNet()
pipe.vae.decoder = TracedDecoder()

Performance Comparison

GPU A100:

GPU A10:

Inference Result Verification

After PAI‑Blade optimization, the generated images match the original PyTorch eager outputs. The left image shows the PyTorch result, and the right image shows the PAI‑Blade optimized result.

Supported Pipelines

StableDiffusionPipeline

StableDiffusionImg2ImgPipeline

StableDiffusionInpaintPipeline

AltDiffusionPipeline

LoRA Optimization

LoRA adds low‑rank matrices to a pretrained model, fine‑tuning only the new weights and greatly reducing fine‑tuning cost. PAI‑Blade already supports LoRA in the HuggingFace diffusers library; the same Blade‑optimized pipeline can run with any LoRA weights without re‑optimizing.

Outlook

Stable Diffusion technology continues to evolve. The PAI‑Blade team plans to integrate optimizations into stable‑diffusion‑webui and to accelerate fine‑tuning training speed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationGPUPAI-Blade
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.