What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization
BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.
BladeDISC released version 0.3.0, adding comprehensive support for PyTorch 2.0 compilation and deepening collaboration with the Torch‑MLIR community. The update brings CPU quantization, new hardware (AArch64 – Yitian) support, and a host of compiler optimizations.
1 ► PyTorch 2.0 and Dynamic Compilation Support
The team adjusted the TorchBlade compilation architecture to better support PyTorch dynamic compilation and training.
1. TorchDynamo Optimization
With the nightly PyTorch build, BladeDISC can accelerate compilation with only two extra lines of code:
import torch_blade # one more extra line
model = ...
compiled_model = torch.compile(model, backend='disc')2. TorchBenchmark
BladeDISC uses TorchBenchmark as a guide to evaluate and continuously improve robustness and optimization across diverse models.
3. TorchMLIR (MHLO) and Dynamic Shape Contributions
BladeDISC contributed a Torch‑to‑MHLO conversion module to the Torch‑MLIR project, enhancing dynamic‑shape support. The new pipeline places Torch‑MLIR as the foundational module within BladeDISC.
4. PyTorch Training Support
BladeDISC now supports training‑time optimizations for models such as BERT, leveraging TorchDynamo and Lazy Tensor Core (the latter pending final API decisions).
5. EasyCV/NLP Inference Acceleration
BEVFormer: visual‑only autonomous driving perception algorithm gains 1.42× end‑to‑end performance.
PAI‑Diffusion Model: AIGC diffusion models receive up to 3× speed‑up.
More details are available in the EasyCV and EasyNLP repositories.
2 ► BladeDISC Quantization (Experimental)
Initial experiments combine compilation and int8 quantization on X86 and ARM platforms, showing significant latency reductions for bert‑mini models.
3 ► BladeDISC Compiler Optimizations
1. New Hardware Support: AArch64 (Yitian)
Added BF16/int8 GEMM/Conv support to exploit Yitian capabilities.
Customized Arm Compute Library for dynamic shape and high‑concurrency scenarios.
Improved code generation for memory‑intensive operators (Stitch‑CPU, reshape support, op duplication).
2. GPU Memory‑Intensive Compute Codegen Enhancements
Fused independent control‑flow blocks to reduce redundant work and increase ILP.
Optimized row‑reduce schedule selection based on shape.
Vectorized element‑wise fusion via instruction interleaving.
Loop unroll, instruction interleave, and loop‑invariant code motion.
Removed unused kernel arguments to lower launch overhead.
Enable these experiments with DISC_MEM_INTENSIVE_OPT_EXPERIMENTAL=true.
3. Shape Constraint IR
A new IR treats shape constraints as first‑class citizens, enabling richer optimizations for dynamic‑shape graphs.
4. Enhanced Custom Library Call Support
Leveraging MLIR PDL, developers can provide a pattern description and a runtime‑compatible kernel to replace patterns without recompiling BladeDISC.
5. Runtime Abstraction Layer Improvements
Redesigned constant storage to support large‑model weights.
Concurrent performance gains of over 20 % in high‑throughput scenarios.
4 ► Ongoing Work
1. CUTLASS GEMM CodeGen
Integrated CUTLASS for compute‑intensive operator fusion and code generation, achieving acceleration on BERT models (enable with DISC_ENABLE_COMPUTE_INTENSIVE_FUSE=true).
2. MLIR Transform Dialect Based CodeGen
Developing pattern‑based code generation for dynamic‑shape workloads, targeting performance parity with ACL on AArch64.
3. Sparse Model Support for Recommendation Systems
Initial CPU codegen and fusion for sparse operators in TensorFlow recommendation models, with plans for broader coverage and AVX optimizations.
For more information, visit the open‑source repository: https://github.com/alibaba/BladeDISC .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
