Artificial Intelligence 6 min read

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group’s newly open‑sourced ATorch library extends PyTorch with a layered architecture and automated resource‑aware strategies, boosting large‑model training efficiency up to 60% utilization, enhancing stability, and delivering significant throughput gains across multi‑node, multi‑GPU deployments.

AntTech
AntTech
AntTech
ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group has open‑sourced ATorch, a distributed training acceleration library built on PyTorch, designed to improve the efficiency and stability of large‑scale AI model training.

ATorch can achieve up to 60% compute utilization, making it especially friendly for trillion‑parameter model training and providing a strong performance boost comparable to installing a high‑power engine on a sports car.

The library follows a layered architecture: the Interface layer offers a concise one‑API or trainer mode for users; the IO & preprocessing layer optimizes storage and data handling; the Elastic fault‑tolerance layer ensures training stability when integrated with DLRover; the Core layer unifies optimization strategies, automatic strategy search, and dynamic memory management; the bottom layer connects to the PyTorch ecosystem and underlying AIDC hardware.

Key functionalities include a unified distributed optimizer configuration interface, automatic distributed strategy search, automatic elastic fault tolerance, the GLake high‑efficiency dynamic memory management library, hardware‑specific adaptations, and self‑developed optimizers such as AGD (accelerated convergence) and WSAM (enhanced generalization, accepted at KDD ’23).

Benchmark results show substantial throughput improvements on models such as Tsinghua’s GLM‑65B (A100 × 1536), Meta’s LLaMA2‑70B (H800 × 1536), Stable Diffusion (A100 × 256), and a 2B vision generation model (A100 × 128). Training stability metrics indicate daily pure training time increasing from 40.7% to 95%, checkpoint save time reduced from 10 min to 1 min, and restart time cut from 90 min to 5 min.

Future plans for ATorch include RLHF support for trillion‑parameter models, further speed‑up of checkpoint handling, additional hardware optimizations, Lynx full‑graph compilation, and more distributed optimization strategies. The source code is available at https://github.com/intelligent-machine-learning/dlrover .

deep learningOpen-sourceLarge ModelsPyTorchdistributed trainingAI acceleration
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.