Artificial Intelligence 14 min read

Optimizing Large-Scale Model Training with Tencent's AngelPTM and ZeRO-Cache

This article presents Tencent's latest advancements in large‑scale model training, detailing the AngelPTM framework and its ZeRO‑Cache optimization techniques that reduce memory and storage costs, improve hardware utilization, and achieve high‑performance training for trillion‑parameter AI models across various applications.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
Optimizing Large-Scale Model Training with Tencent's AngelPTM and ZeRO-Cache

The paper introduces Tencent's "one platform, two models" strategy, highlighting the release of the Hunyuan AI trillion‑parameter model and the AngelPTM training framework built on the Taiji machine‑learning platform. AngelPTM leverages a custom training engine, AngelPTM, capable of fitting 55B‑parameter models on a single node and scaling to trillion‑parameter models across 20 A100‑40G nodes, saving 45% of training resources and doubling training speed.

It reviews the rapid growth of Transformer‑based models—from BERT (340M) to GPT‑3 (175B) and beyond—while noting that GPU memory growth has lagged far behind model size, creating severe storage and bandwidth bottlenecks for large‑model training.

To address these challenges, the paper describes the ZeRO‑Cache system, which partitions model parameters, gradients, and optimizer states across GPUs, offloads them to CPU memory, and optionally to SSD storage. By unifying memory and GPU memory views, ZeRO‑Cache maximizes usable storage, reduces redundancy, and mitigates fragmentation through a Contiguous Memory manager.

Key optimization techniques include multi‑level storage bandwidth balancing, elimination of redundant model‑state copies, a custom contiguous‑memory allocator, chunk‑based communication, pipeline optimization for overlapping computation and data transfer, a heterogeneous Adafactor optimizer, and multi‑stream asynchronous execution that keeps GPUs, CPUs, and PCIe links fully utilized.

Experimental results on a 4‑node A100‑40G cluster with 1.6 TB RDMA network show that ZeRO‑Cache achieves higher TFLOPs/s than competing frameworks across model sizes, enabling training of a 175B GPT‑3‑scale model with only 32 GPUs, where other frameworks fail.

Business applications demonstrate that the Hunyuan AI models have been deployed in WeChat, QQ, gaming, advertising, and cloud services, delivering significant GMV growth and validating the commercial viability of large‑model deployment.

The authors conclude with a roadmap to further lower training costs, incorporate model‑parallel and pipeline‑parallel strategies, and apply lossless compression for SSD‑based storage, continuing the push toward ever larger and more efficient AI models.

Memory Optimizationdistributed trainingAI modelsAngelPTMLarge-Scale TrainingZeRO-Cache
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.