Artificial Intelligence 7 min read

How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup

PAI‑TorchAcc, Alibaba Cloud’s PyTorch accelerator, integrates the open‑source OLMo large language model and delivers up to 1.64× faster training on OLMo‑1B and 1.52× on OLMo‑7B by leveraging graph capture, distributed, compute, communication, and memory optimizations, with detailed usage steps and performance analysis.

Alibaba Cloud Big Data AI Platform

Feb 28, 2024

How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup

01 PAI‑TorchAcc Overview

PAI‑TorchAcc (Torch Accelerator) is an Alibaba Cloud Machine Learning Platform framework that accelerates large‑model training on PyTorch. It uses PyTorch/XLA’s GraphCapture to convert dynamic graphs into static computation graphs, enabling distributed, compute, and memory optimizations for models including large language models.

02 Fully Open‑Source OLMo Model

OLMo (Open Language Model) is a completely open‑source LLM released by the Allen Institute for AI and partners. It provides full training data, code, and checkpoints, and matches or exceeds LLAMA2 on several core metrics.

03 How to Use PAI‑TorchAcc to Accelerate OLMo Training

Accelerating training with PAI‑TorchAcc involves three steps:

Define torchacc.Config and set acceleration options.

Call torchacc.accelerate with the model and config to prepare accelerated training.

Wrap the data loader with torchacc.AsyncLoader to speed up data loading.

# Define model and dataloader
model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B", use_fast=False, trust_remote_code=True)
train_loader = get_dataloader(tokenizer)

# Define TorchAcc Config
config = torchacc.Config()
config.compute.bf16 = True  # enable bf16
config.compute.acc_scaled_dot_attn = True  # replace ScaledDot with flash attention
config.dist.fsdp.size = torchacc.dist.world_size()  # enable FSDP
config.dist.fsdp.wrap_layer_cls = {"OlmoSequentialBlock"}  # wrap OLMo decoder layers

# One‑line model acceleration
model = torchacc.accelerate(model, config)

# Asynchronous data loading
train_loader = torchacc.AsyncLoader(train_loader, model.device)

# training loop
...

More complete OLMo acceleration examples are available in Alibaba Cloud DSW Gallery.

04 Performance of PAI‑TorchAcc

On a single node with 8 × A100 GPUs, PAI‑TorchAcc achieves a 1.64× speedup over PyTorch FSDP for OLMo‑1B and a 1.52× speedup for OLMo‑7B.

Performance comparison of PAI‑TorchAcc vs PyTorch FSDP on OLMo models

05 Why Is PAI‑TorchAcc So Fast?

Both PAI‑TorchAcc and PyTorch use the same FSDP (ZeRO‑3) distributed strategy. PAI‑TorchAcc gains additional speed through compute optimization, communication overlap, and memory optimization.

With micro‑batch size = 2, compute optimizations reduce the time of memory‑intensive operators to 45.56% of PyTorch’s, yielding an overall 1.25× speedup. Communication overlap lowers non‑overlapped communication from 8.19% to 2.43% of total time, resulting in a 1.32× overall speedup.

Static graph conversion enables aggressive memory optimizations: operator reordering, better allocation algorithms, and flash attention replace traditional attention, reducing peak memory usage. Consequently, PAI‑TorchAcc can use a micro‑batch size of 4 (vs. 2 for PyTorch), further increasing throughput.

06 Summary

This article demonstrates how to use PAI‑TorchAcc to accelerate OLMo model training, analyzes the sources of its performance gains, and notes that the framework also supports other open‑source LLMs such as LLaMA, LLaMA‑2, BaiChuan, ChatGLM, and QWen, as well as vision and speech models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch LLM training OLMo PAI‑TorchAcc

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.