How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup
This article explains how PAI‑Blade uses compiler optimizations, TorchDynamo, MHLO conversion, and aggressive kernel fusion to accelerate PyTorch training, provides simple two‑line integration code, showcases benchmark results on A10 and A100 GPUs, and details deployment steps on PAI‑DSW.
Background
Since its release, the Stable Diffusion model has grown rapidly online, generating images from textual prompts and allowing fine‑tuning with user‑provided style photos. For example, the prompt "A photo of sks dog in a bucket" produces the image below.
PAI‑Blade Accelerates PyTorch Training
PAI‑Blade applies compilation‑time optimizations to improve the execution efficiency of PyTorch programs. Its source code is open‑source at https://github.com/alibaba/BladeDISC .
PAI‑Blade API
Accelerating a PyTorch program with PAI‑Blade is straightforward: just add two lines to the original script.
# 1. import PAI‑Blade Python package
import torch_blade
# 2. compile and accelerate 'model' performance
model = torch.compile(backend='aot_disc')(model)
for batch, label in data_loader():
output = model(**batch)
loss = compute_loss(output, label)
loss.backward()
optimizer.step()The call torch.compile(backend='aot_disc')(model) uses BladeDISC as the TorchDynamo backend, speeding up both forward and backward passes. The model can also be a plain Python function implemented with PyTorch.
PAI‑Blade Compilation Pipeline
TorchDynamo records the PyTorch program into one or more FX graphs; PAI‑Blade then applies a series of passes to optimize graph execution. More details are available at PyTorch compiler deep‑dive .
MHLO Conversion
PAI‑Blade integrates the Torch‑MLIR Project to convert PyTorch IR into the MHLO dialect of MLIR, enabling further optimizations by BladeDISC. The conversion code has been contributed back to the community ( https://github.com/llvm/torch-mlir ).
BlaDNN Library
The BlaDNN library provides high‑performance, compute‑intensive operators. PAI‑Blade automatically replaces sub‑graphs that match typical patterns with equivalent BlaDNN kernels for maximal speed.
Memory‑Intensive Kernel Fusion
Operator fusion is a major source of performance gain. A typical workload may contain element‑wise, dynamic‑shape broadcast/reshape/reduce, and compute‑heavy kernels (e.g., GEMM). In PyTorch each operator launches a separate kernel, causing cache thrashing and kernel‑launch overhead. BladeDISC adopts an aggressive fusion strategy that merges multiple kernels into a single kernel using shared‑memory stitching (AStitch) and index/value caching.
For the illustrated memory‑bound workload, BladeDISC reduces kernel count from seven to one, approaching hardware peak performance.
Inplace Mutation Optimization
In eager mode, PyTorch can update a tensor in‑place with operators like aten.add_, avoiding an extra output tensor. However, MLIR requires SSA form, so a naive translation inserts a D2D memcpy, incurring an extra memory copy. BladeDISC marks the input and output buffers as the same in MHLO IR, allowing the generated gpu.store to write directly back to the original buffer, eliminating the copy.
Benchmark
On NVIDIA A10 and A100 GPUs, PAI‑Blade achieves up to 41.6% and 28.4% performance gains respectively (batch size = 1).
Using PAI‑Blade on DSW
Create a DSW instance on the PAI platform and use the custom Docker image
pai-blade-registry.cn-hangzhou.cr.aliyuncs.com/pai-blade/aicompiler:latest-stablediffusion-torch-2.0.1-cu118. See the official documentation for details.
Launch a Jupyter Notebook and start the fine‑tuning task:
!cd /opt/StableDiffusion && bash launch_dreambooth_train.shWhen the log indicates completion, run the inference task and view the generated image:
!cd /opt/StableDiffusion && python inference.py && cp dog-bucket.png /mnt/workspaceReferences
BladeDISC: https://github.com/alibaba/BladeDISC
TorchDynamo: https://pytorch.org/docs/2.1/torch.compiler_deepdive.html
Torch‑MLIR Project: https://github.com/llvm/torch-mlir
PAI DSW documentation: https://help.aliyun.com/zh/pai/user-guide/overview-5
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
