Artificial Intelligence 15 min read

BladeDISC++: A Dynamic‑Shape AI Compiler for Memory‑Peak Optimization in Deep Learning Training

The article introduces BladeDISC++, a dynamic‑shape AI compiler from Alibaba Cloud PAI, explains the memory‑peak challenges of dynamic‑shape deep‑learning workloads, describes its symbolic‑shape graph, joint compile‑time/runtime optimizations such as operation fusion, scheduling and just‑in‑time rematerialization, and presents Llama2 experiments showing significant GPU memory savings and throughput gains.

DataFunSummit
DataFunSummit
DataFunSummit
BladeDISC++: A Dynamic‑Shape AI Compiler for Memory‑Peak Optimization in Deep Learning Training

Recent advances in deep learning have produced models with highly dynamic shapes, creating new challenges for AI compilers that must manage uncertain tensor dimensions, dynamic memory allocation, and increased optimization complexity. Existing compilers like TVM and OpenXLA focus on static shapes and lack effective memory‑optimizations for dynamic workloads.

To address these issues, Alibaba Cloud PAI released BladeDISC++, a dynamic‑shape AI compiler built on MLIR that works as the backend for TorchAcc. BladeDISC++ introduces three core innovations: (1) operation fusion to reduce kernel launch overhead, (2) operation scheduling that reorders operators based on symbolic shape analysis to lower memory peaks, and (3) just‑in‑time (JIT) auto‑rematerialization that combines compile‑time and runtime decisions for offloading or recomputation of tensors.

The compiler constructs a Symbolic Shape Graph to represent unknown dimensions and their relationships, enabling the creation of Symbolic Expressions that estimate tensor element counts. By simplifying these expressions, BladeDISC++ can compare memory usage of alternative schedules even without concrete shapes.

In the dynamic‑shape environment, BladeDISC++ uses the symbolic graph to guide both compile‑time insertion of potential rematerialization points (EvictOp, RegenerateOp) and runtime monitoring of actual tensor shapes to decide whether to offload tensors to CPU memory or recompute them, thereby keeping memory usage below a predefined threshold.

Performance evaluation on a trimmed Llama2‑7B model (1B parameters) shows that BladeDISC++ achieves memory‑peak reductions comparable to static‑shape optimizations while maintaining or improving throughput. At batch size 14, memory usage matches static‑shape results; at batch size 18, BladeDISC++ avoids out‑of‑memory failures and improves throughput by about 11% compared to static‑shape padding strategies.

The work demonstrates that symbolic‑shape‑driven scheduling and joint compile‑time/runtime rematerialization can effectively mitigate the core challenges of dynamic‑shape training, offering a practical solution for large‑scale LLM training with limited GPU memory.

Memory OptimizationLlama2AI CompilerBladeDISC++Dynamic ShapeOperation SchedulingSymbolic Shape Graph
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.