Artificial Intelligence 13 min read

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

PAI‑TorchAcc, an Alibaba Cloud AI platform accelerator, offers a seamless PyTorch interface that integrates HuggingFace models and employs LazyTensor‑based static graph conversion, multi‑strategy distributed training, and extensive GPU optimizations to dramatically boost throughput for 1B‑175B parameter models, surpassing PyTorch native and Megatron‑LM performance.

Alibaba Cloud Big Data AI Platform

Feb 23, 2024

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

01 Introduction

PAI‑TorchAcc (Torch Accelerator) is a large‑model training acceleration framework developed by Alibaba Cloud AI platform PAI on PyTorch.

It provides a concise, easy‑to‑use interface that can directly import models from HuggingFace without conversion and accelerate training with multiple distributed strategies.

Leveraging the community PyTorch/XLA and LazyTensor technology, PAI‑TorchAcc converts PyTorch code into a static execution graph, applies extensive GPU‑level distributed and compute optimizations based on the underlying Alibaba Cloud resources.

Thanks to the simple model integration and graph‑based optimizations, it flexibly supports various large models ranging from 1 B to 175 B parameters, compatible with different hardware. Compared with native PyTorch and Megatron‑LM, it improves throughput (e.g., LLaMA series gains 140 % over PyTorch and 5 % over Megatron‑LM), reaches 70 % MFU on A100, and achieves linear scaling up to 15.6× from 8 to 128 GPUs.

02 Background and Requirements

Large language models and multimodal models have grown to billions or trillions of parameters, delivering unprecedented performance but incurring huge training costs. Training an OPT‑175B model with Megatron‑LM on thousands of A100 GPUs can take two months with low hardware utilization, and fine‑tuning LLaMA‑2‑70B with PyTorch FSDP still requires many high‑end GPUs.

Accelerating pre‑training, continued training, and fine‑tuning across diverse hardware while improving resource utilization is an effective way to reduce costs.

Existing frameworks such as Megatron‑LM, DeepSpeed, and PyTorch/XLA each have limitations: inflexible model conversion, manual operator optimizations, lack of automatic adaptation to different compute patterns, and limited GPU‑specific support.

03 Core Technical Features of PAI‑TorchAcc

Flexible model integration : supports common large models (LLaMA, Qwen, BaiChuan, ChatGLM, OLMo, Bloom) from 1 B to 175 B, seamless HuggingFace import, one‑click acceleration.

Parameter‑scale support : already enables training of models up to 175 B parameters.

Comprehensive training modes : mixed‑precision (FP32, FP16, BFloat16), pre‑training, fine‑tuning, and continued training.

Combined distributed strategies : Data Parallel, Tensor Parallel, Sequence Parallel, Fully Sharded Data Parallel, Pipeline and their combinations.

Automatic compute and memory optimizations : gradient checkpointing, automatic rematerialization, memory planning, kernel compilation, and integration of state‑of‑the‑art kernels.

Hardware compatibility : supports NVIDIA A100/800, H100/800, V100 and Alibaba Cloud Lingjun clusters.

04 PAI‑TorchAcc Architecture

The architecture is layered from top to bottom:

Model layer: accelerates vision, NLP, speech synthesis models.

Algorithm libraries: HuggingFace Transformers, PAI‑EasyNLP, TIMM, etc.

Front‑end: PyTorch‑based model definition.

Lowering: transforms front‑end code to static graphs via LazyTensor, Symbolic Trace.

IR: high‑level device‑agnostic IR and low‑level device‑specific IR for graph and backend optimizations.

Compilation engine: TorchAcc Compiler, BladeDISC, OpenXLA for distributed, memory, communication, and compute optimizations.

Hardware: generates device‑specific code executed on various GPU configurations.

05 Interface and Usage

PAI‑TorchAcc provides a concise API that accelerates any PyTorch model without code changes.

Typical workflow (three steps):

Define torchacc.Config with desired acceleration options.

Call torchacc.accelerate(model, config) to prepare the model.

Wrap the data loader with torchacc.AsyncLoader for asynchronous loading.

model = ...
 dloader = ...

 # One‑line acceleration (Config can be passed for richer options)
 model = torchacc.accelerate(model)

 # Asynchronous data loading
 dloader = torchacc.AsyncLoader(dloader, model.device)

 model.train()
 for source, labels in dloader:
     ...

The compiler converts the PyTorch code into a static graph, then performs graph‑level optimizations (distributed, memory, compute, communication) and backend compilation (OpenXLA, BladeDISC) to generate efficient device code.

06 Performance and Practice Cases

On A100 GPUs, PAI‑TorchAcc reaches 70 % MFU and scales almost linearly from 8 to 128 GPUs (15.6× speedup). Compared with native PyTorch and Megatron‑LM, it delivers higher throughput on common open‑source large models; for example, LLaMA series gains 140 % over PyTorch and 5 % over Megatron‑LM.

Future articles will present a concrete case study of accelerating OLMo model training with detailed performance analysis.

07 Summary and Future Directions

PAI‑TorchAcc enables flexible integration of PyTorch models and accelerates training through parallelism, memory, compute, and scheduling optimizations. It has demonstrated strong results on models such as LLaMA, LLaMA‑2, BaiChuan, ChatGLM, Qwen, OLMo, and Bloom.

Planned improvements include:

Graph Capture and sub‑graph compilation to handle unsupported operators.

Automatic distributed strategy selection based on static graphs and hardware characteristics.

AutoGC for automatic checkpoint placement.

Dynamic‑shape performance optimization to reduce recompilation overhead.

Further enhancements to the proprietary BladeDISC compiler.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch GPU Optimization Alibaba Cloud AI acceleration large model training

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.