Artificial Intelligence 24 min read

Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record

Tencent’s intelligent machine‑learning platform achieved a world‑record by training AlexNet in 4 minutes and ResNet‑50 in 6.6 minutes on ImageNet, using large batch sizes, mixed‑precision, LARS optimization, hierarchical synchronization, gradient fusion, and pipeline I/O techniques to overcome accuracy and scalability challenges.

Tencent Architect
Tencent Architect
Tencent Architect
Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record

Background: Recent advances in deep learning have dramatically reduced ImageNet error rates, but training large models such as AlexNet and ResNet‑50 still requires hours to days on conventional hardware. Tencent’s intelligent machine‑learning platform (referred to as the “Smart Team”) collaborated with Hong Kong Baptist University to address the challenges of large batch size convergence, multi‑node scalability, and hyper‑parameter tuning.

Machine‑Learning Training Landscape: Training speed is limited by massive data volumes, increasingly complex network architectures, and huge parameter counts. Traditional training on a single NVIDIA M40 GPU can take 14 days for ResNet‑50, highlighting the need for faster, more efficient methods.

Key Challenges: (1) Large batch sizes cause accuracy loss because reduced stochasticity makes gradient descent behave like full‑gradient descent. (2) Data‑parallel distributed training suffers from parameter‑server bottlenecks and inefficient All‑Reduce communication at scale. (3) Hyper‑parameter search is costly, especially for massive datasets like ImageNet.

Critical Technologies Implemented:

Mixed‑precision training combined with the Layer‑wise Adaptive Rate Scaling (LARS) algorithm to maintain convergence while using 16‑bit floats, supplemented by loss‑scaling to avoid underflow.

Model and parameter improvements, including selective weight regularization and adding BatchNorm after Pool5 to stabilize feature distributions.

Hierarchical synchronization and optimized Ring All‑Reduce (layered grouping, gradient fusion, and GPU Direct RDMA) to achieve near‑linear scaling on clusters of 1024+ GPUs.

Pipeline I/O mechanism with lock‑free queues and prefetching to hide data loading latency.

Systematic hyper‑parameter tuning strategies such as coarse‑to‑fine step sizes, low‑precision tuning, and progressive initialization.

Results: The platform trained AlexNet to baseline accuracy in 4 minutes and ResNet‑50 in 6.6 minutes on ImageNet with batch size 64K, surpassing previous best records (15 min for ResNet‑50, 11 min for AlexNet). Scaling experiments showed ~99 % efficiency on 1024 GPUs and ~97 % on 2048 GPUs.

Platform Value: By dramatically reducing training time, the system enables rapid model iteration for AI services such as game AI, computer vision, and large‑scale data analytics. Future work includes extending acceleration to inference, integrating AutoML for automated hyper‑parameter search, and providing end‑to‑end managed services for training, deployment, and model hosting.

Acknowledgements: The authors thank collaborators from TEG’s Architecture and Operations teams, as well as Prof. Chu Xiaowen’s group at Hong Kong Baptist University, for their contributions to this breakthrough.

deep learningdistributed trainingAI accelerationMixed PrecisionImageNetLarge Batch Training
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.