How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup

This article explains how PAI‑Blade uses compiler optimizations, TorchDynamo, MHLO conversion, and aggressive kernel fusion to accelerate PyTorch training, provides simple two‑line integration code, showcases benchmark results on A10 and A100 GPUs, and details deployment steps on PAI‑DSW.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup

Background

Since its release, the Stable Diffusion model has grown rapidly online, generating images from textual prompts and allowing fine‑tuning with user‑provided style photos. For example, the prompt "A photo of sks dog in a bucket" produces the image below.

PAI‑Blade Accelerates PyTorch Training

PAI‑Blade applies compilation‑time optimizations to improve the execution efficiency of PyTorch programs. Its source code is open‑source at https://github.com/alibaba/BladeDISC .

PAI‑Blade API

Accelerating a PyTorch program with PAI‑Blade is straightforward: just add two lines to the original script.

# 1. import PAI‑Blade Python package
import torch_blade
# 2. compile and accelerate 'model' performance
model = torch.compile(backend='aot_disc')(model)
for batch, label in data_loader():
  output = model(**batch)
  loss = compute_loss(output, label)
  loss.backward()
  optimizer.step()

The call torch.compile(backend='aot_disc')(model) uses BladeDISC as the TorchDynamo backend, speeding up both forward and backward passes. The model can also be a plain Python function implemented with PyTorch.

PAI‑Blade Compilation Pipeline

TorchDynamo records the PyTorch program into one or more FX graphs; PAI‑Blade then applies a series of passes to optimize graph execution. More details are available at PyTorch compiler deep‑dive .

MHLO Conversion

PAI‑Blade integrates the Torch‑MLIR Project to convert PyTorch IR into the MHLO dialect of MLIR, enabling further optimizations by BladeDISC. The conversion code has been contributed back to the community ( https://github.com/llvm/torch-mlir ).

BlaDNN Library

The BlaDNN library provides high‑performance, compute‑intensive operators. PAI‑Blade automatically replaces sub‑graphs that match typical patterns with equivalent BlaDNN kernels for maximal speed.

Memory‑Intensive Kernel Fusion

Operator fusion is a major source of performance gain. A typical workload may contain element‑wise, dynamic‑shape broadcast/reshape/reduce, and compute‑heavy kernels (e.g., GEMM). In PyTorch each operator launches a separate kernel, causing cache thrashing and kernel‑launch overhead. BladeDISC adopts an aggressive fusion strategy that merges multiple kernels into a single kernel using shared‑memory stitching (AStitch) and index/value caching.

For the illustrated memory‑bound workload, BladeDISC reduces kernel count from seven to one, approaching hardware peak performance.

Inplace Mutation Optimization

In eager mode, PyTorch can update a tensor in‑place with operators like aten.add_, avoiding an extra output tensor. However, MLIR requires SSA form, so a naive translation inserts a D2D memcpy, incurring an extra memory copy. BladeDISC marks the input and output buffers as the same in MHLO IR, allowing the generated gpu.store to write directly back to the original buffer, eliminating the copy.

Benchmark

On NVIDIA A10 and A100 GPUs, PAI‑Blade achieves up to 41.6% and 28.4% performance gains respectively (batch size = 1).

Using PAI‑Blade on DSW

Create a DSW instance on the PAI platform and use the custom Docker image

pai-blade-registry.cn-hangzhou.cr.aliyuncs.com/pai-blade/aicompiler:latest-stablediffusion-torch-2.0.1-cu118

. See the official documentation for details.

Launch a Jupyter Notebook and start the fine‑tuning task:

!cd /opt/StableDiffusion && bash launch_dreambooth_train.sh

When the log indicates completion, run the inference task and view the generated image:

!cd /opt/StableDiffusion && python inference.py && cp dog-bucket.png /mnt/workspace

References

BladeDISC: https://github.com/alibaba/BladeDISC

TorchDynamo: https://pytorch.org/docs/2.1/torch.compiler_deepdive.html

Torch‑MLIR Project: https://github.com/llvm/torch-mlir

PAI DSW documentation: https://help.aliyun.com/zh/pai/user-guide/overview-5

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PyTorchmodel accelerationGPU Optimizationkernel fusionBladeDISCinplace mutationPAI-Blade
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.