Artificial Intelligence 11 min read

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

Alibaba Cloud Big Data AI Platform

Dec 9, 2022

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC released version 0.3.0, adding comprehensive support for PyTorch 2.0 compilation and deepening collaboration with the Torch‑MLIR community. The update brings CPU quantization, new hardware (AArch64 – Yitian) support, and a host of compiler optimizations.

1 ► PyTorch 2.0 and Dynamic Compilation Support

The team adjusted the TorchBlade compilation architecture to better support PyTorch dynamic compilation and training.

1. TorchDynamo Optimization

With the nightly PyTorch build, BladeDISC can accelerate compilation with only two extra lines of code:

import torch_blade  # one more extra line
model = ...
compiled_model = torch.compile(model, backend='disc')

2. TorchBenchmark

BladeDISC uses TorchBenchmark as a guide to evaluate and continuously improve robustness and optimization across diverse models.

3. TorchMLIR (MHLO) and Dynamic Shape Contributions

BladeDISC contributed a Torch‑to‑MHLO conversion module to the Torch‑MLIR project, enhancing dynamic‑shape support. The new pipeline places Torch‑MLIR as the foundational module within BladeDISC.

4. PyTorch Training Support

BladeDISC now supports training‑time optimizations for models such as BERT, leveraging TorchDynamo and Lazy Tensor Core (the latter pending final API decisions).

5. EasyCV/NLP Inference Acceleration

BEVFormer: visual‑only autonomous driving perception algorithm gains 1.42× end‑to‑end performance.

PAI‑Diffusion Model: AIGC diffusion models receive up to 3× speed‑up.

More details are available in the EasyCV and EasyNLP repositories.

2 ► BladeDISC Quantization (Experimental)

Initial experiments combine compilation and int8 quantization on X86 and ARM platforms, showing significant latency reductions for bert‑mini models.

3 ► BladeDISC Compiler Optimizations

1. New Hardware Support: AArch64 (Yitian)

Added BF16/int8 GEMM/Conv support to exploit Yitian capabilities.

Customized Arm Compute Library for dynamic shape and high‑concurrency scenarios.

Improved code generation for memory‑intensive operators (Stitch‑CPU, reshape support, op duplication).

2. GPU Memory‑Intensive Compute Codegen Enhancements

Fused independent control‑flow blocks to reduce redundant work and increase ILP.

Optimized row‑reduce schedule selection based on shape.

Vectorized element‑wise fusion via instruction interleaving.

Loop unroll, instruction interleave, and loop‑invariant code motion.

Removed unused kernel arguments to lower launch overhead.

Enable these experiments with DISC_MEM_INTENSIVE_OPT_EXPERIMENTAL=true.

3. Shape Constraint IR

A new IR treats shape constraints as first‑class citizens, enabling richer optimizations for dynamic‑shape graphs.

4. Enhanced Custom Library Call Support

Leveraging MLIR PDL, developers can provide a pattern description and a runtime‑compatible kernel to replace patterns without recompiling BladeDISC.

5. Runtime Abstraction Layer Improvements

Redesigned constant storage to support large‑model weights.

Concurrent performance gains of over 20 % in high‑throughput scenarios.

4 ► Ongoing Work

1. CUTLASS GEMM CodeGen

Integrated CUTLASS for compute‑intensive operator fusion and code generation, achieving acceleration on BERT models (enable with DISC_ENABLE_COMPUTE_INTENSIVE_FUSE=true).

2. MLIR Transform Dialect Based CodeGen

Developing pattern‑based code generation for dynamic‑shape workloads, targeting performance parity with ACL on AArch64.

3. Sparse Model Support for Recommendation Systems

Initial CPU codegen and fusion for sparse operators in TensorFlow recommendation models, with plans for broader coverage and AVX optimizations.

For more information, visit the open‑source repository: https://github.com/alibaba/BladeDISC .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

compiler Quantization PyTorch GPU Optimization MLIR BladeDISC

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 ► PyTorch 2.0 and Dynamic Compilation Support

1. TorchDynamo Optimization

2. TorchBenchmark

3. TorchMLIR (MHLO) and Dynamic Shape Contributions

4. PyTorch Training Support

5. EasyCV/NLP Inference Acceleration

2 ► BladeDISC Quantization (Experimental)

3 ► BladeDISC Compiler Optimizations

1. New Hardware Support: AArch64 (Yitian)

2. GPU Memory‑Intensive Compute Codegen Enhancements

3. Shape Constraint IR

4. Enhanced Custom Library Call Support

5. Runtime Abstraction Layer Improvements

4 ► Ongoing Work

1. CUTLASS GEMM CodeGen

2. MLIR Transform Dialect Based CodeGen

3. Sparse Model Support for Recommendation Systems

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

1 ► PyTorch 2.0 and Dynamic Compilation Support

1. TorchDynamo Optimization

2. TorchBenchmark

3. TorchMLIR (MHLO) and Dynamic Shape Contributions

4. PyTorch Training Support

5. EasyCV/NLP Inference Acceleration

2 ► BladeDISC Quantization (Experimental)

3 ► BladeDISC Compiler Optimizations

1. New Hardware Support: AArch64 (Yitian)

2. GPU Memory‑Intensive Compute Codegen Enhancements

3. Shape Constraint IR

4. Enhanced Custom Library Call Support

5. Runtime Abstraction Layer Improvements

4 ► Ongoing Work

1. CUTLASS GEMM CodeGen

2. MLIR Transform Dialect Based CodeGen

3. Sparse Model Support for Recommendation Systems