Tagged articles
164 articles
Page 2 of 2
Baidu Geek Talk
Baidu Geek Talk
Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

This article systematically examines the major performance bottlenecks in AI model training, explains the underlying hardware and software causes, and presents a comprehensive set of acceleration strategies—including data‑loading optimizations, compute‑side enhancements, communication tricks, and the AIAK‑Training suite—backed by real‑world case studies and quantitative results.

AI trainingAIAK-TrainingDistributed Training
0 likes · 33 min read
How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 22, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

This article systematically analyzes the main performance bottlenecks in AI model training, explains why acceleration is essential, and presents current hardware‑ and software‑based solutions—including data‑loading optimizations, operator fusion, mixed‑precision and Tensor Core usage, as well as distributed communication strategies—followed by real‑world case studies of Baidu's AIAK‑Training suite that demonstrate significant speed‑ups.

AI trainingDistributed TrainingGPU Acceleration
0 likes · 31 min read
How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques
vivo Internet Technology
vivo Internet Technology
Oct 9, 2022 · Artificial Intelligence

vivo Machine Learning Platform: Architecture Design and Practice

vivo’s machine‑learning platform, built for its massive app‑store and e‑commerce ecosystem, streamlines data processing, model training, and deployment through quota‑based resource management, a custom ultra‑large‑scale TensorFlow‑vlps framework, OpenAPI‑driven training, and Jupyter‑integrated interactive development, boosting efficiency for billions of samples and features.

Distributed TrainingMLOpsMachine Learning Platform
0 likes · 12 min read
vivo Machine Learning Platform: Architecture Design and Practice
Bilibili Tech
Bilibili Tech
Aug 30, 2022 · Artificial Intelligence

Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights

The article reviews reinforcement learning applied to Neural MMO—a large‑scale, multi‑agent MMO environment—detailing its competitive IJCAI 2022 track, the winning LastOrder solution with transformer‑CNN‑LSTM architecture, reward shaping, a Fictitious Self‑Play meta‑solver, and Bilibili’s scalable Newton training framework.

AI in GamesDistributed TrainingMeta Solver
0 likes · 9 min read
Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights
DataFunTalk
DataFunTalk
Jul 29, 2022 · Artificial Intelligence

Tencent Music Cloud‑Native One‑Stop Machine Learning Platform: Features and Future Roadmap

This article introduces Tencent Music's cloud‑native, one‑stop machine learning platform, detailing its engineering workflow, distributed acceleration, inference closed‑loop, edge computing capabilities, and future plans, while highlighting challenges of traditional ML pipelines and the platform's solutions for resource orchestration, storage, scheduling, and GPU utilization.

AI PlatformDistributed TrainingPipeline
0 likes · 17 min read
Tencent Music Cloud‑Native One‑Stop Machine Learning Platform: Features and Future Roadmap
DataFunTalk
DataFunTalk
Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataDistributed TrainingSpark
0 likes · 13 min read
Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster
Laiye Technology Team
Laiye Technology Team
Jul 22, 2022 · Cloud Native

Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions

This article examines the pain points of distributed training orchestration and scheduling, presents a layered cloud‑native architecture built on Kubernetes, explains key components such as pipeline orchestrators, training job operators, schedulers, and topology managers, and discusses practical solutions using Argo, Kubeflow Pipelines, and the Volcano scheduler.

Distributed TrainingKubernetesML Platform
0 likes · 38 min read
Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions
Youzan Coder
Youzan Coder
Jul 11, 2022 · Artificial Intelligence

How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce

By leveraging contrastive learning and large‑scale click‑through data, the article details a dual‑tower model that encodes product titles and queries, explains loss functions, batch‑negative sampling, distributed training tricks, and demonstrates how this approach outperforms traditional NER for product term and category prediction.

Distributed TrainingE-commerce AIInfoNCE
0 likes · 16 min read
How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce
DataFunTalk
DataFunTalk
Jul 8, 2022 · Artificial Intelligence

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

This article presents an in‑depth overview of Tencent's Wuliang deep learning platform for recommendation systems, detailing the real‑time data challenges, high‑throughput requirements, parameter‑server architecture, model compression techniques, multi‑level caching, and answers to common technical questions.

Distributed TrainingInference ServiceParameter Server
0 likes · 14 min read
Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions
Baidu Geek Talk
Baidu Geek Talk
Jul 6, 2022 · Artificial Intelligence

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

The article examines the industry trend toward ever‑larger AI models, compares their parameter scale to the human brain, outlines the computational and memory challenges of training such models, and details advanced parallelism techniques and Baidu's high‑performance cluster solutions that enable efficient, stable large‑scale model training.

AI InfrastructureBaiduCluster Computing
0 likes · 17 min read
Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies
Alimama Tech
Alimama Tech
Jun 22, 2022 · Artificial Intelligence

Graph Deep Learning: Methods, Frameworks, and Industrial Applications

Graph deep learning, extending deep models to irregular graph data via spatial and spectral GNNs such as GCN, GAT, and GraphSAGE, has matured into frameworks like Alibaba’s open‑source Euler, which scales to billions of nodes, powers a heterogeneous query‑item‑ad graph for search advertising, and demonstrably boosts click‑through rates by over 1.5%.

Distributed TrainingEuler frameworkgraph embeddings
0 likes · 17 min read
Graph Deep Learning: Methods, Frameworks, and Industrial Applications
Tencent Cloud Developer
Tencent Cloud Developer
May 12, 2022 · Backend Development

Practical Guide to PyTorch Distributed Training: DP, DDP, Groups, and IO Considerations

This guide explains PyTorch’s distributed training, contrasting single‑node DataParallel with multi‑node DistributedDataParallel, detailing essential parameters, group communication setup, proper use of DistributedSampler for data loading, handling IO bottlenecks, and avoiding common pitfalls such as memory imbalance, unsynchronized buffers, and unused‑parameter errors.

DDPDataParallelDistributed Training
0 likes · 15 min read
Practical Guide to PyTorch Distributed Training: DP, DDP, Groups, and IO Considerations
DataFunSummit
DataFunSummit
May 7, 2022 · Artificial Intelligence

Advances in Click‑Through Rate Prediction: Model Evolution, Feature Interaction, Continuous Feature Embedding, and Distributed Training

This article reviews the development of CTR prediction models from early collaborative‑filtering methods to modern deep‑learning approaches, discusses core challenges such as feature interaction and continuous‑feature embedding, introduces recent Huawei solutions like AutoDis and ScaleFreeCTR for efficient large‑embedding training, and outlines future research directions.

Distributed TrainingEmbeddingRecommendation Systems
0 likes · 21 min read
Advances in Click‑Through Rate Prediction: Model Evolution, Feature Interaction, Continuous Feature Embedding, and Distributed Training
DataFunSummit
DataFunSummit
Apr 26, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

AI PlatformDistributed TrainingGPU scheduling
0 likes · 11 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Apr 14, 2022 · Artificial Intelligence

PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications

The article presents a comprehensive overview of Baidu's open‑source deep learning platform PaddlePaddle, detailing its full‑stack architecture, core technologies such as unified dynamic‑static graph, large‑scale distributed training, multi‑platform inference, an extensive model zoo, hardware adaptation, and showcases a real‑world deployment case in power‑grid monitoring.

AI FrameworkDistributed TrainingInference Engine
0 likes · 15 min read
PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications
DataFunSummit
DataFunSummit
Apr 7, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

This article explains how Google Cloud Vertex AI improves large‑scale distributed machine learning training performance by addressing the memory‑wall challenge with Fast Socket network stack enhancements for NCCL and a Reduction Server that accelerates gradient aggregation, delivering higher throughput and lower TCO for AI workloads.

Cloud AIDistributed TrainingFast Socket
0 likes · 19 min read
Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server
Volcano Engine Developer Services
Volcano Engine Developer Services
Mar 16, 2022 · Artificial Intelligence

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

The article introduces Volcano Engine's veGiantModel, a high‑performance large‑model training framework built on PyTorch, Megatron and DeepSpeed, details its distributed parallel strategies, hardware setups, benchmark results showing up to 6.9× speedup over Megatron and DeepSpeed, and provides open‑source links for further use.

ByteCCLDistributed TrainingLarge Language Models
0 likes · 6 min read
How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster
JD Retail Technology
JD Retail Technology
Jan 24, 2022 · Artificial Intelligence

Galileo: An Open‑Source Scalable Graph Deep Learning Framework for Industrial‑Scale Applications

Galileo is an open‑source, distributed graph deep‑learning framework that supports ultra‑large heterogeneous graphs, dual TensorFlow/PyTorch back‑ends, and a flexible API, enabling fast prototyping of graph neural networks such as HeteSAGE for real‑world recommendation and other AI scenarios.

AI FrameworkDistributed TrainingGalileo
0 likes · 11 min read
Galileo: An Open‑Source Scalable Graph Deep Learning Framework for Industrial‑Scale Applications
Code DAO
Code DAO
Dec 31, 2021 · Cloud Computing

How to Run Distributed PyTorch Training on AzureML with CLI v2

This article walks through the complete workflow for building, testing, and launching a distributed PyTorch training job on AzureML using the CLI v2, covering local script preparation, Accelerate configuration, Docker environment setup, dataset registration, compute target definition, job YAML creation, and job submission with monitoring.

CLIDistributed TrainingDocker
0 likes · 15 min read
How to Run Distributed PyTorch Training on AzureML with CLI v2
DataFunTalk
DataFunTalk
Dec 23, 2021 · Artificial Intelligence

Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan

This article details Meituan's internal, heavily customized TensorFlow 1.x implementation that addresses large‑scale sparse parameter support, distributed training challenges, communication bottlenecks, and pipeline optimizations, achieving over ten‑fold scalability improvements and significant per‑node performance gains in recommendation system workloads.

Distributed TrainingSparse ParametersTensorFlow
0 likes · 32 min read
Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan
Code DAO
Code DAO
Dec 17, 2021 · Artificial Intelligence

How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray

The article explains why XGBoost training can be slow despite its speed focus and presents three acceleration techniques—choosing an optimal tree_method, leveraging cloud resources for larger memory, and using Ray for distributed training—complete with code examples and benchmark results.

Distributed TrainingRayXGBoost
0 likes · 5 min read
How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray
Code DAO
Code DAO
Dec 17, 2021 · Artificial Intelligence

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

XGBoost‑Ray provides a fault‑tolerant, multi‑node, multi‑GPU backend for XGBoost that integrates seamlessly with Ray Tune, supports distributed data loading, and can be enabled with only three code changes, enabling scalable training and inference on large clusters.

Distributed TrainingGPURay
0 likes · 8 min read
How to Scale XGBoost with Ray for Distributed Multi‑GPU Training
Alimama Tech
Alimama Tech
Dec 15, 2021 · Artificial Intelligence

Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising

SMAD is a scalable graph‑based ad retrieval framework for e‑commerce search that builds a heterogeneous Query‑Item‑Ad graph, learns multi‑view embeddings with a parallel deep neural network and attention, employs category‑aware sampling for efficient distributed training, and delivers significant gains in offline relevance and online CTR, RPM, and PVR.

Distributed Trainingad retrievalattention
0 likes · 17 min read
Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising
Meituan Technology Team
Meituan Technology Team
Dec 9, 2021 · Artificial Intelligence

Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan

Meituan heavily customized TensorFlow 1.x for large‑scale sparse training, replacing variable embeddings with hash tables, improving load balancing, using RDMA communication, pipeline‑embedding graphs, high‑performance hash tables, and operator merges, achieving over ten‑fold scalability, up to 51% operator speedups, and enabling billions‑parameter models on CPU clusters with future GPU expansion.

Distributed TrainingRecommendation SystemsSparse Parameters
0 likes · 31 min read
Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan
DataFunSummit
DataFunSummit
Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelDeep LearningDistributed Training
0 likes · 8 min read
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 17, 2021 · Artificial Intelligence

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Alibaba Cloud’s PAI team and the DAMO Academy introduced the low‑carbon M6 trillion‑parameter multimodal model, demonstrating that their self‑developed Whale framework can train such massive models on just 480 V100 GPUs, reducing energy consumption by over 80% and boosting training efficiency nearly eleven‑fold.

AIDistributed TrainingGPU Optimization
0 likes · 12 min read
How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%
Tencent Architect
Tencent Architect
Jul 29, 2021 · Artificial Intelligence

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

This article analyzes the bottlenecks of advertising coarse‑ranking training on the Light framework and presents a series of optimizations—including parallel data download, thread‑queue buffering, integer‑to‑string conversion with fmt, and zlib replacement with czlib—that together achieve up to 58% QPS improvement and notable CPU efficiency gains.

AdvertisingCPU/GPU efficiencyData Parallelism
0 likes · 11 min read
Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework
Kuaishou Tech
Kuaishou Tech
Jul 16, 2021 · Artificial Intelligence

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

BaguaDeep LearningDistributed Training
0 likes · 15 min read
Bagua: An Open‑Source Distributed Training Framework for Deep Learning
360 Tech Engineering
360 Tech Engineering
Jul 2, 2021 · Artificial Intelligence

DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training

The article introduces DGL Operator, an open‑source Kubernetes‑based controller that automates the lifecycle of distributed graph neural network training with DGL, explains its terminology, challenges of native DGL distribution, and provides detailed architecture, workflow, and YAML/CLI examples for easy deployment.

AIDGLDistributed Training
0 likes · 18 min read
DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training
Alibaba Cloud Native
Alibaba Cloud Native
Jun 3, 2021 · Artificial Intelligence

How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime

Weibo’s deep learning platform faced severe latency and stability issues when accessing massive small‑file datasets via a compute‑storage‑separated architecture, so the team adopted the CNCF Fluid project with JindoRuntime, implementing a distributed cache that leverages POSIX interfaces, dramatically improving data locality, reducing HDFS load, and achieving up to 18‑fold training speedups while raising success rates from 37 % to 98 %.

Data CachingDeep LearningDistributed Training
0 likes · 15 min read
How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime
JD Tech
JD Tech
Mar 30, 2021 · Artificial Intelligence

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

The article introduces JD Retail's Jiushu Business Analytics Platform, detailing how AI, big‑data, and distributed‑training technologies address fragmented retail scenarios, high deployment barriers, large‑scale application difficulties, and cost concerns through specialized frameworks, fault‑tolerant training, and advanced cluster optimization.

AICluster ManagementDistributed Training
0 likes · 12 min read
JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 16, 2021 · Artificial Intelligence

How NetEase Cloud Music Solved Cold‑Start with Large‑Scale Graph Neural Networks

This article explains how NetEase Cloud Music tackled cold‑start recommendation challenges in live streaming by leveraging Baidu's PGL distributed graph learning framework to train massive graph neural networks that transfer user behavior from music domains to live content, achieving significant performance gains.

AIDistributed TrainingLarge-Scale Graph
0 likes · 7 min read
How NetEase Cloud Music Solved Cold‑Start with Large‑Scale Graph Neural Networks
DataFunSummit
DataFunSummit
Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

Distributed TrainingGPU clusterModel Serving
0 likes · 18 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
JD Retail Technology
JD Retail Technology
Oct 21, 2020 · Artificial Intelligence

Galileo: A Distributed Graph Deep Learning Framework for Large‑Scale Industrial Scenarios

The article introduces Galileo, JD Retail's distributed graph deep‑learning platform that supports heterogeneous and dynamic graphs, ultra‑large scale training, flexible model customization, and seamless integration with TensorFlow and PyTorch, highlighting its architecture, core challenges, built‑in algorithms, and upcoming open‑source release.

AI PlatformDistributed Traininggraph embedding
0 likes · 11 min read
Galileo: A Distributed Graph Deep Learning Framework for Large‑Scale Industrial Scenarios
360 Tech Engineering
360 Tech Engineering
Sep 14, 2020 · Artificial Intelligence

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

TensorNet is a TensorFlow‑based distributed training framework that tackles the challenges of massive data and billions of sparse parameters in advertising and recommendation systems by enabling near‑infinite sparse feature dimensions, drastically reducing synchronization overhead, and delivering up to 35% inference speed improvements.

AI InfrastructureDistributed TrainingTensorFlow
0 likes · 8 min read
TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow
DataFunTalk
DataFunTalk
Jul 1, 2020 · Artificial Intelligence

Architecture and Implementation of Autohome's Machine Learning Platform

The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.

AutoMLDeep LearningDistributed Training
0 likes · 19 min read
Architecture and Implementation of Autohome's Machine Learning Platform
Architect
Architect
May 29, 2020 · Artificial Intelligence

Integrating Flink with TensorFlow for End-to-End Machine Learning Pipelines

This article explains how to combine the Flink data‑processing engine with TensorFlow to create a unified, end‑to‑end machine‑learning workflow, covering background, challenges, the Flink‑AI‑extended architecture, ML framework and operator abstractions, and both batch and streaming training and prediction modes.

AI integrationDistributed TrainingFlink
0 likes · 9 min read
Integrating Flink with TensorFlow for End-to-End Machine Learning Pipelines
Tencent Cloud Developer
Tencent Cloud Developer
May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIDistributed TrainingHorovod
0 likes · 12 min read
Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL
360 Quality & Efficiency
360 Quality & Efficiency
Apr 17, 2020 · Artificial Intelligence

Extending APEX for Real Distributed Reinforcement Learning with tf2rl

The article examines the limitations of the single‑machine APEX framework in the tf2rl reinforcement‑learning library, proposes a cross‑machine distributed architecture using middleware such as Redis, compares alternative frameworks like EasyRL, and outlines expected performance gains and future development plans.

APEXDistributed TrainingReinforcement Learning
0 likes · 5 min read
Extending APEX for Real Distributed Reinforcement Learning with tf2rl
Tencent Tech
Tencent Tech
Feb 27, 2020 · Artificial Intelligence

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Deep LearningDistributed Trainingknowledge distillation
0 likes · 14 min read
How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques
58 Tech
58 Tech
Dec 20, 2019 · Artificial Intelligence

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

The article presents a comprehensive overview of 58.com’s AI platform built on Kubernetes, detailing its layered architecture, resource scheduling, offline training pipelines, debugging environment, distributed TensorFlow/PyTorch training, performance benchmarks, and online inference services, highlighting how the system empowers various business units with scalable AI capabilities.

Distributed TrainingKubernetesPyTorch
0 likes · 11 min read
Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference
Tencent Cloud Developer
Tencent Cloud Developer
Oct 11, 2019 · Cloud Computing

Large-Scale Distributed Reinforcement Learning Solution Based on TKE

The project replaces cumbersome manual management of thousands of heterogeneous CPU and GPU nodes for large‑scale reinforcement learning with a TKE‑based, containerized actor‑learner architecture that automates batch start/stop, provides elastic autoscaling, fault‑tolerant processes, shared model storage, and CI‑driven image deployment, cutting costs by up to two‑thirds while dramatically speeding experiment cycles.

CI/CDCloud NativeDistributed Training
0 likes · 14 min read
Large-Scale Distributed Reinforcement Learning Solution Based on TKE
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureDeep LearningDistributed Training
0 likes · 17 min read
How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs
360 Tech Engineering
360 Tech Engineering
May 10, 2019 · Artificial Intelligence

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.

Data ParallelDeep LearningDistributed Training
0 likes · 11 min read
Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 9, 2019 · Artificial Intelligence

Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide

Learn how to perform single‑machine multi‑GPU and multi‑node multi‑GPU training with MXNet, understand KVStore modes, configure workers, servers, and schedulers, and deploy large‑scale distributed training on Kubernetes using Kubeflow, including operator installation, task creation, and performance considerations.

Distributed TrainingGPUKubeflow
0 likes · 11 min read
Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 4, 2019 · Artificial Intelligence

Principles, Methodology, and Tools for Machine Learning Performance Optimization

The article presents a systematic, top‑down methodology for machine‑learning performance optimization—covering principles, benchmark‑driven loops, foundational hardware and software checks, profiling tools, throughput and latency metrics, and practical techniques for IO, compute, mixed‑precision, and distributed training to maximize resource utilization.

ComputeDistributed TrainingProfiling
0 likes · 22 min read
Principles, Methodology, and Tools for Machine Learning Performance Optimization
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 21, 2018 · Artificial Intelligence

X-DeepLearning: Alibaba’s Open‑Source Framework for Large‑Scale Sparse Deep Learning

Alibaba's X‑DeepLearning (XDL) is an open‑source deep‑learning framework optimized for high‑dimensional sparse data, offering industrial‑grade distributed training, built‑in CTR/recommendation algorithms, structured compression, and online learning capabilities, with benchmark results demonstrating superior scalability and performance.

CTR predictionDeep LearningDistributed Training
0 likes · 18 min read
X-DeepLearning: Alibaba’s Open‑Source Framework for Large‑Scale Sparse Deep Learning
MaGe Linux Operations
MaGe Linux Operations
Nov 22, 2018 · Artificial Intelligence

Accelerating TensorFlow Deep Learning: GPU & Distributed Training Techniques

This article explains how to speed up TensorFlow deep‑learning model training using single‑GPU acceleration, multi‑GPU parallelism, and distributed TensorFlow on Kubernetes, covering device placement, session parameters, synchronous vs asynchronous training modes, and practical code examples to improve performance and scalability.

Deep LearningDistributed TrainingGPU Acceleration
0 likes · 10 min read
Accelerating TensorFlow Deep Learning: GPU & Distributed Training Techniques
Meituan Technology Team
Meituan Technology Team
Oct 11, 2018 · Artificial Intelligence

Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction

Meituan’s user‑growth team built a Wide‑Deep CTR prediction model, trained offline with Spark‑generated TFRecords, and deployed it via TensorFlow Serving on YARN, then applied request‑side multithreading, offline one‑hot preprocessing, XLA JIT compilation, and dedicated loading threads to cut end‑to‑end latency from ~18 ms to ~6 ms and eliminate model‑switch spikes.

Distributed TrainingModel DeploymentTensorFlow Serving
0 likes · 15 min read
Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction
Didi Tech
Didi Tech
Jun 8, 2018 · Artificial Intelligence

DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning

DiDi PS is a custom RDMA‑based parameter server that uses a ring topology and optimized ibverbs communication to dramatically accelerate distributed deep‑learning training, consistently outperforming OpenMPI, NCCL2, TensorFlow’s built‑in RDMA, and Horovod while providing more stable and scalable synchronization for massive data workloads.

AllreduceDistributed TrainingParameter Server
0 likes · 10 min read
DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 26, 2018 · Artificial Intelligence

How TensorFlowRS Supercharges Large‑Scale Search & Recommendation with 10×‑100× Speedups

This article describes TensorFlowRS, an Alibaba‑built extension of TensorFlow that tackles the massive compute and sparse‑feature challenges of search, advertising and recommendation by redesigning the parameter server, adding fail‑over, gradient‑compensation, online‑learning support, advanced training modes and visualisation, achieving up to 100× training speedup and improved model quality.

Distributed TrainingOnline LearningParameter Server
0 likes · 16 min read
How TensorFlowRS Supercharges Large‑Scale Search & Recommendation with 10×‑100× Speedups
Meituan Technology Team
Meituan Technology Team
Apr 4, 2018 · Artificial Intelligence

Performance Optimization of Distributed TensorFlow for WDL Models at Meituan

Meituan‑Dianping identified data‑pipeline, network, and memory‑arena bottlenecks in distributed TensorFlow training of Wide & Deep recommendation models and resolved them by switching to tf.data pipelines, batching TFRecord reads, increasing MALLOC_ARENA_MAX, and moving embedding lookups to parameter servers, achieving 2–3× speedup and near‑linear scaling on up to 32 GPUs.

AFODistributed TrainingTensorFlow
0 likes · 12 min read
Performance Optimization of Distributed TensorFlow for WDL Models at Meituan
Architecture Digest
Architecture Digest
Oct 17, 2017 · Artificial Intelligence

Design and Architecture of the Weibo Deep Learning Platform

This article presents the design, architecture, and operational experience of Weibo's deep learning platform, covering its machine‑learning workflow, control center, distributed training cluster, and online prediction service, and explains how the platform accelerates development and improves business outcomes.

AIDeep LearningDistributed Training
0 likes · 17 min read
Design and Architecture of the Weibo Deep Learning Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 5, 2017 · Artificial Intelligence

Alibaba’s Distributed Training Boosts Neural Machine Translation Speed

Since its 2013 debut, Neural Machine Translation (NMT) has approached human quality, but training costs are high; Alibaba’s team developed a distributed NMT system in 2017, employing data‑parallel, model‑average, BMUF, Downpour SGD, and Ring‑allReduce techniques to cut training time from over 20 days to a few days while maintaining translation quality.

BMUFDistributed TrainingDownpour SGD
0 likes · 18 min read
Alibaba’s Distributed Training Boosts Neural Machine Translation Speed
MaGe Linux Operations
MaGe Linux Operations
Apr 19, 2017 · Artificial Intelligence

Accelerate TensorFlow Deep Learning with GPU, Multi‑GPU, and Distributed Training

This article explains how to speed up TensorFlow deep‑learning model training by using a single GPU, configuring session parameters, assigning operations to specific devices, employing multi‑GPU parallelism, and leveraging distributed TensorFlow on Kubernetes, while also discussing synchronous versus asynchronous training modes and practical best practices.

Deep LearningDistributed TrainingGPU Acceleration
0 likes · 11 min read
Accelerate TensorFlow Deep Learning with GPU, Multi‑GPU, and Distributed Training
ITPUB
ITPUB
Sep 6, 2016 · Artificial Intelligence

Deep Learning Platforms: From Google’s DistBelief to Open‑Source MXNet and TensorFlow

The article reviews the evolution, challenges, and commercial and open‑source deep learning platforms—including DistBelief, COTS, Adam, MXNet, TensorFlow, and Petuum—while highlighting real‑world applications such as image recognition, recommendation, sentiment analysis, and crowd monitoring.

AI applicationsDistributed TrainingGPU Acceleration
0 likes · 10 min read
Deep Learning Platforms: From Google’s DistBelief to Open‑Source MXNet and TensorFlow