How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations
This article analyzes Tencent's AngelPTM framework, detailing its ZeRO-Cache strategy, unified storage management, multi‑stream async execution, SSD tiered storage, and performance benchmarks that show up to 95% larger model capacity and over 44% speedup compared to community solutions.
Background
With the rise of ChatGPT and other large‑scale language models, the AI community has been pushing model parameters from hundreds of millions to trillions, dramatically increasing compute and storage demands. Training a trillion‑parameter model can require more than 1.7 TB of parameter and optimizer state storage and hundreds of A100‑40G GPUs.
Challenges of Scaling Large Models
GPU memory growth lags far behind parameter growth (a 100 000× increase in parameters versus only a 4× increase in GPU memory). Redundant copies of model state in CPU and GPU memory, excessive pin‑memory usage, and fragmented allocations further limit feasible model sizes.
Existing Solutions
Open‑source projects such as Microsoft DeepSpeed (with ZeRO optimizer) and NVIDIA Megatron‑LM (3‑D parallelism) address memory bottlenecks by partitioning parameters across devices, but they still face limits when scaling to the largest models.
AngelPTM Overview
AngelPTM is Tencent’s internally‑developed training framework that builds on DeepSpeed and Megatron‑LM, adding deep customizations to reduce cost and boost performance for massive models. It powers the “混元 AI” large model released in April 2022.
ZeRO‑Cache Optimization Strategy
ZeRO‑Cache introduces a unified view of CPU memory and GPU VRAM, eliminating redundant copies of model state. A Contiguous Memory manager allocates and releases GPU memory in large blocks to reduce fragmentation, while multi‑stream pipelines balance hardware utilization. An SSD tier provides third‑level storage for fp16 parameters and gradients, keeping compute paths on fast memory.
Unified Storage Management
AngelPTM chunks model tensors and stores them either in CPU memory or GPU memory, ensuring each piece of state exists only once. This heterogeneous storage breaks the traditional CPU‑GPU barrier and dramatically expands the usable storage per node.
Zero‑Cache Memory Manager
Standard PyTorch allocators struggle with ultra‑large models, leading to high fragmentation and performance drops. ZeRO‑Cache’s Contiguous Memory layer sits atop the PyTorch allocator, handling allocation and release in a unified pool, which yields smoother scaling and higher throughput.
Pipeline Optimizer
ZeRO‑Infinity originally updates parameters on CPU, causing GPU idle time. AngelPTM pipelines H2D/D2H transfers, parameter updates, and optimizer state copies, allowing CPU and GPU to update simultaneously and keeping hardware fully utilized.
Heterogeneous Adafactor Optimizer
AngelPTM’s custom Adafactor runs on both CPU and GPU, cutting optimizer‑state storage by roughly 33 % while improving training accuracy.
Multi‑Stream Asynchrony
Training large models involves heavy compute and communication (GPU kernels, H2D/D2H transfers, NCCL multi‑node traffic). AngelPTM employs multiple streams to overlap these operations, uses a time‑synchronised prefetch mechanism for parameters, and adopts a multi‑buffer strategy for gradient post‑processing, dramatically reducing idle time.
SSD Tiered Storage
To further lower cost, ZeRO‑Cache adds an SSD tier as a third storage level. fp16 parameters and gradients reside in DRAM, while the SSD holds compressed optimizer states. This design mitigates the bandwidth gap between GPU compute and SSD I/O, preserving training speed even for trillion‑parameter models.
Performance Evaluation
Benchmarks were run on identical environments (OS, Python, CUDA, cuDNN, PyTorch). The community baseline used DeepSpeed 0.8.1 and Megatron‑DeepSpeed 7212b58; AngelPTM used version 0.6.1. Results:
Single‑node A100‑40G model capacity increased by 94.71 %.
Overall training speed improved by 44.42 % compared with the community stack.
For a 100‑billion‑parameter model, multi‑node scaling approached linear speedup; a 4‑node 32‑GPU run achieved a 26.8 % speedup, limited by 100 Gbps RDMA network bandwidth.
Future hardware upgrades (higher‑bandwidth RDMA clusters) are expected to close the remaining gap.
Conclusion and Outlook
AngelPTM has been integrated into the TACO Train suite, delivering substantial gains in model size limits and training throughput. Upcoming features include the TCCL collective communication library and dynamic compilation support, promising even higher efficiency for large‑scale AI workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
