How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup
PAI‑TorchAcc, Alibaba Cloud’s PyTorch accelerator, integrates the open‑source OLMo large language model and delivers up to 1.64× faster training on OLMo‑1B and 1.52× on OLMo‑7B by leveraging graph capture, distributed, compute, communication, and memory optimizations, with detailed usage steps and performance analysis.
01 PAI‑TorchAcc Overview
PAI‑TorchAcc (Torch Accelerator) is an Alibaba Cloud Machine Learning Platform framework that accelerates large‑model training on PyTorch. It uses PyTorch/XLA’s GraphCapture to convert dynamic graphs into static computation graphs, enabling distributed, compute, and memory optimizations for models including large language models.
02 Fully Open‑Source OLMo Model
OLMo (Open Language Model) is a completely open‑source LLM released by the Allen Institute for AI and partners. It provides full training data, code, and checkpoints, and matches or exceeds LLAMA2 on several core metrics.
03 How to Use PAI‑TorchAcc to Accelerate OLMo Training
Accelerating training with PAI‑TorchAcc involves three steps:
Define torchacc.Config and set acceleration options.
Call torchacc.accelerate with the model and config to prepare accelerated training.
Wrap the data loader with torchacc.AsyncLoader to speed up data loading.
# Define model and dataloader
model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B", use_fast=False, trust_remote_code=True)
train_loader = get_dataloader(tokenizer)
# Define TorchAcc Config
config = torchacc.Config()
config.compute.bf16 = True # enable bf16
config.compute.acc_scaled_dot_attn = True # replace ScaledDot with flash attention
config.dist.fsdp.size = torchacc.dist.world_size() # enable FSDP
config.dist.fsdp.wrap_layer_cls = {"OlmoSequentialBlock"} # wrap OLMo decoder layers
# One‑line model acceleration
model = torchacc.accelerate(model, config)
# Asynchronous data loading
train_loader = torchacc.AsyncLoader(train_loader, model.device)
# training loop
...More complete OLMo acceleration examples are available in Alibaba Cloud DSW Gallery.
04 Performance of PAI‑TorchAcc
On a single node with 8 × A100 GPUs, PAI‑TorchAcc achieves a 1.64× speedup over PyTorch FSDP for OLMo‑1B and a 1.52× speedup for OLMo‑7B.
05 Why Is PAI‑TorchAcc So Fast?
Both PAI‑TorchAcc and PyTorch use the same FSDP (ZeRO‑3) distributed strategy. PAI‑TorchAcc gains additional speed through compute optimization, communication overlap, and memory optimization.
With micro‑batch size = 2, compute optimizations reduce the time of memory‑intensive operators to 45.56% of PyTorch’s, yielding an overall 1.25× speedup. Communication overlap lowers non‑overlapped communication from 8.19% to 2.43% of total time, resulting in a 1.32× overall speedup.
Static graph conversion enables aggressive memory optimizations: operator reordering, better allocation algorithms, and flash attention replace traditional attention, reducing peak memory usage. Consequently, PAI‑TorchAcc can use a micro‑batch size of 4 (vs. 2 for PyTorch), further increasing throughput.
06 Summary
This article demonstrates how to use PAI‑TorchAcc to accelerate OLMo model training, analyzes the sources of its performance gains, and notes that the framework also supports other open‑source LLMs such as LLaMA, LLaMA‑2, BaiChuan, ChatGLM, and QWen, as well as vision and speech models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
