Artificial Intelligence 17 min read

How TorchAcc Accelerates Large‑Model Training with TorchXLA

This article examines Alibaba Cloud's TorchAcc framework, a TorchXLA‑based distributed training solution that automates parallel strategies, optimizes memory, computation, and communication, and delivers up to three‑fold speedups for large models such as Llama 2‑7B.

Alibaba Cloud Big Data AI Platform

Mar 25, 2024

How TorchAcc Accelerates Large‑Model Training with TorchXLA

Speaker: Lin Wei, Alibaba Cloud researcher and technical lead of the AI Platform PAI.

The article introduces TorchAcc, a distributed training framework for large models built on PyTorch/XLA, addressing the growing gap between model size and GPU compute/memory capabilities.

Recent AI advances have been driven by larger models, but training them demands massive compute resources; single‑GPU memory and performance cannot keep pace, creating a need for sophisticated model‑parallel strategies beyond simple data parallelism.

Data parallelism relies on AllReduce to synchronize gradients across workers, but as models exceed a single GPU's memory, model parallelism becomes necessary, requiring careful model partitioning, selection of communication primitives, and scheduling to maximize compute‑communication overlap.

TorchAcc automates the exploration and integration of various parallel strategies—including data parallelism, Fully Sharded Data Parallel (FSDP/ZeRO), tensor parallelism, and pipeline parallelism—while also offering semi‑automatic controls for advanced users.

It features a memory‑aware allocator that intelligently schedules tensor placement to overcome GPU memory limits, and applies multiple optimization passes on an intermediate representation (IR Graph) to improve compute, storage, communication, and distributed strategy efficiency.

Performance results show up to 3× speedup on several models, with communication optimizations raising the acceleration factor from 88 to 116 on 128 GPUs for Llama 2‑7B, and memory optimizations reducing usage by up to 30%.

Key techniques include FlashAttention integration via XLA custom calls, collective communication merging and asynchronous execution on separate CUDA streams, and a latency‑hiding scheduler from OpenXLA.

The framework converts front‑end models from PyTorch or TensorFlow into a unified Model IR, using symbolic tracing and LazyTensor for PyTorch and direct graph conversion for TensorFlow, then applies optimization passes to generate an optimal execution plan for the backend.

Memory optimization (ROAM) splits the computation graph into memory‑independent sub‑graphs, applies a memory‑aware weight‑update scheduler, and solves local ordering problems to reduce peak memory consumption, outperforming baseline methods by 13‑27% in memory savings.

Overall, TorchAcc delivers comprehensive improvements in memory, compute, communication, and parallel strategy optimization, significantly enhancing the efficiency of large‑model distributed training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management ai-optimization TorchAcc TorchXLA

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.