Tag

Large-Scale Training

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

AI trainingLarge-Scale Trainingcheckpointing
0 likes · 22 min read
Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
DataFunTalk
DataFunTalk
Apr 3, 2023 · Artificial Intelligence

Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding

This article explains how Tencent’s AI team leverages the PyTorch‑based TorchRec library and a custom dynamic embedding solution to train billion‑scale recommendation models efficiently, detailing the benefits of TorchRec, GPU embedding, optimized kernels, embedding partition strategies, experimental results, and practical deployment guidance.

Dynamic EmbeddingGPU EmbeddingLarge-Scale Training
0 likes · 15 min read
Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding
DataFunSummit
DataFunSummit
Apr 2, 2023 · Artificial Intelligence

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

This article introduces the challenges of scaling deep‑learning model training, explains the design and components of the open‑source Easy Parallel Library (EPL) that unifies data, pipeline, and operator‑split parallelism, and demonstrates its best‑practice results on large‑scale classification, BERT‑large, and massive multimodal models.

EPLLarge-Scale TrainingParallelism
0 likes · 15 min read
Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)
Tencent Advertising Technology
Tencent Advertising Technology
Mar 10, 2023 · Artificial Intelligence

Optimizing Large-Scale Model Training with Tencent's AngelPTM and ZeRO-Cache

This article presents Tencent's latest advancements in large‑scale model training, detailing the AngelPTM framework and its ZeRO‑Cache optimization techniques that reduce memory and storage costs, improve hardware utilization, and achieve high‑performance training for trillion‑parameter AI models across various applications.

AI modelsAngelPTMLarge-Scale Training
0 likes · 14 min read
Optimizing Large-Scale Model Training with Tencent's AngelPTM and ZeRO-Cache
DataFunSummit
DataFunSummit
Sep 9, 2022 · Artificial Intelligence

Wuliang: Tencent's Deep Learning Framework for Real‑Time Large‑Scale Recommendation

The presentation by Tencent expert Yuan Yi details the Wuliang deep learning system for recommendation, covering its background, technical challenges such as massive data and real‑time requirements, the parameter‑server based solutions for training and inference, model compression techniques, and continuous online deployment strategies.

Large-Scale TrainingRecommendation systemsdeep learning
0 likes · 14 min read
Wuliang: Tencent's Deep Learning Framework for Real‑Time Large‑Scale Recommendation
DataFunSummit
DataFunSummit
Feb 10, 2022 · Artificial Intelligence

Baidu's PGL2.2: A Graph Neural Network Framework, Techniques, and Real‑World Applications

This article introduces Baidu's PGL2.2 graph learning platform, explains graph modeling and message‑passing GNN techniques, details training strategies for small, medium and large graphs, showcases node classification and link‑prediction methods, and describes how the framework is applied in search, recommendation, risk control, and knowledge‑graph competitions.

Graph Neural NetworksKnowledge GraphsLarge-Scale Training
0 likes · 15 min read
Baidu's PGL2.2: A Graph Neural Network Framework, Techniques, and Real‑World Applications
Ctrip Technology
Ctrip Technology
Apr 9, 2021 · Artificial Intelligence

Algorithm Optimization for Hotel Recommendation and Large‑Scale Discrete DNN Training at Ctrip

This article describes how Ctrip improved hotel recommendation by iterating from logistic regression to GBDT and deep neural networks, designing continuous and discrete features, adopting multi‑task learning with click and conversion signals, and building a large‑scale distributed DNN training and unified feature‑processing framework to boost model accuracy and engineering efficiency.

CtripDNNFeature Engineering
0 likes · 15 min read
Algorithm Optimization for Hotel Recommendation and Large‑Scale Discrete DNN Training at Ctrip