Artificial Intelligence 8 min read

Boost Stable Diffusion Inference with PAI-Blade: LoRA & ControlNet Optimization

This article explains how to use PAI-Blade to accelerate Stable Diffusion inference by optimizing LoRA and ControlNet components, detailing configuration steps, code modifications, benchmark results on A100/A10 GPUs, and integration with both Diffusers and the popular Stable-Diffusion-webui, highlighting performance gains and memory savings.

Alibaba Cloud Big Data AI Platform

May 30, 2023

Boost Stable Diffusion Inference with PAI-Blade: LoRA & ControlNet Optimization

Background

In the previous article we optimized Stable Diffusion models in diffusers using PAI-Blade. This article continues with inference optimization for LoRA and ControlNet, and shows how to integrate PAI-Blade into Stable-Diffusion-webui.

LoRA Optimization

PAI-Blade optimizes LoRA similarly to the earlier method: load the model, load LoRA weights, and replace the original model. The only difference is that freeze_module=False is set so that the optimizer does not compile the LoRA weights, allowing pipe.unet.load_attn_procs() to load them at runtime.

Because the weights are not compiled, some constant‑folding optimizations are lost. To mitigate this, PAI-Blade applies a monkey‑patch using torch_blade.monkey_patch to replace parts of the UNet and VAE with Python‑level patches before optimization.

from torch_blade.monkey_patch import patch_utils
patch_utils.patch_conv2d(pipe.vae.decoder)
patch_utils.patch_conv2d(pipe.unet)

opt_cfg = torch_blade.Config()
opt_cfg.freeze_module = False
with opt_cfg, torch.no_grad():
    ...

If LoRA weight switching is not required, the above steps can be omitted for faster inference.

ControlNet Adaptation

ControlNet inference can be split into two parts: the input and mid blocks share the same structure as the first half of Stable Diffusion UNet, while the remaining part consists of convolutions. All ControlNet outputs are fed into the Stable Diffusion UNet as additional inputs.

Optimization steps:

controlnet = torch_blade.optimize(pipe.controlnet, model_inputs=tuple(controlnet_inputs), allow_tracing=True)

Because earlier torch‑jit.trace versions do not support dict inputs, a wrapper class is used for the UNet to enable tracing and optimization.

class UnetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
    def forward(self, sample, timestep, encoder_hidden_states,
                down_block_additional_residuals, mid_block_additional_residual):
        return self.unet(sample, timestep,
                         encoder_hidden_states=encoder_hidden_states,
                         down_block_additional_residuals=down_block_additional_residuals,
                         mid_block_additional_residual=mid_block_additional_residual)

down_block_res_samples, mid_block_res_sample = controlnet(*controlnet_inputs)
unet_inputs += [tuple(down_block_res_samples), mid_block_res_sample]
unet = torch_blade.optimize(UnetWrapper(pipe.unet).eval(),
                            model_inputs=tuple(unet_inputs), allow_tracing=True)

These optimizations enable simultaneous LoRA weight replacement and ControlNet weight replacement.

Benchmark Results

Tests on A100/A10 GPUs using the runwayml/stable-diffusion-v1-5 model with 50 sampling steps show the performance impact of LoRA and ControlNet optimizations.

WebUI Integration

PAI-Blade also supports the popular stable-diffusion-webui. The optimization is applied per sub‑module of UNet and ControlNet rather than the whole model, preserving inference speed while allowing weight switching. LoRA weights are fused with their scales during loading, eliminating runtime overhead.

Benchmarks on an A10 GPU (batch size 1, 512×512 resolution) demonstrate that PAI-Blade inference time is independent of the number of LoRA adapters, unlike eager or xformers modes.

Summary

PAI-Blade significantly reduces inference latency and memory consumption for Stable Diffusion by optimizing the encoder, UNet, and decoder, while fully supporting LoRA, ControlNet, and webui workflows. Comparative tables show PAI-Blade offers the most comprehensive feature support and best performance among existing solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Inference Optimization LoRA ControlNet GPU Benchmark PAI-Blade

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.