Artificial Intelligence 6 min read

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group’s newly open‑sourced ATorch library extends PyTorch with a layered architecture and automated resource‑aware strategies, boosting large‑model training efficiency up to 60% utilization, enhancing stability, and delivering significant throughput gains across multi‑node, multi‑GPU deployments.

AntTech

Jan 9, 2024

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group has open‑sourced ATorch, a distributed training acceleration library built on PyTorch, designed to improve the efficiency and stability of large‑scale AI model training.

ATorch can achieve up to 60% compute utilization, making it especially friendly for trillion‑parameter model training and providing a strong performance boost comparable to installing a high‑power engine on a sports car.

The library follows a layered architecture: the Interface layer offers a concise one‑API or trainer mode for users; the IO & preprocessing layer optimizes storage and data handling; the Elastic fault‑tolerance layer ensures training stability when integrated with DLRover; the Core layer unifies optimization strategies, automatic strategy search, and dynamic memory management; the bottom layer connects to the PyTorch ecosystem and underlying AIDC hardware.

Key functionalities include a unified distributed optimizer configuration interface, automatic distributed strategy search, automatic elastic fault tolerance, the GLake high‑efficiency dynamic memory management library, hardware‑specific adaptations, and self‑developed optimizers such as AGD (accelerated convergence) and WSAM (enhanced generalization, accepted at KDD ’23).

Benchmark results show substantial throughput improvements on models such as Tsinghua’s GLM‑65B (A100 × 1536), Meta’s LLaMA2‑70B (H800 × 1536), Stable Diffusion (A100 × 256), and a 2B vision generation model (A100 × 128). Training stability metrics indicate daily pure training time increasing from 40.7% to 95%, checkpoint save time reduced from 10 min to 1 min, and restart time cut from 90 min to 5 min.

Future plans for ATorch include RLHF support for trillion‑parameter models, further speed‑up of checkpoint handling, additional hardware optimizations, Lynx full‑graph compilation, and more distributed optimization strategies. The source code is available at https://github.com/intelligent-machine-learning/dlrover .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large models PyTorch distributed training AI acceleration

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.