Tagged articles

distributed training

170 articles · Page 2 of 2

Mar 30, 2023 · Artificial Intelligence

Tencent's Taiji Machine Learning Platform: End-to-End MLOps for Advertising

Tencent’s Taiji machine learning platform, a cloud‑native, distributed parameter‑server system, provides end‑to‑end MLOps for advertising by integrating data ingestion, feature engineering, model training, evaluation, deployment, and monitoring, supporting massive models up to billions of parameters while improving efficiency, scalability, and resource management.

MLOpsMachine Learning PlatformModel Deployment

0 likes · 18 min read

Tencent's Taiji Machine Learning Platform: End-to-End MLOps for Advertising

NetEase Smart Enterprise Tech+

Mar 27, 2023 · Artificial Intelligence

How Reinforcement Learning Powers AI Bots in ‘Barbarian Battle 2’

This article details NetEase Zhiji and Dianhun Network's use of reinforcement learning, a distributed training framework, and middleware to create, train, deploy, and iterate AI robots for the game "Barbarian Battle 2", highlighting technical challenges, solutions, and the impact on player experience.

AI botsGame Developmentdistributed training

0 likes · 13 min read

How Reinforcement Learning Powers AI Bots in ‘Barbarian Battle 2’

Baidu Geek Talk

Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelGPU clusters

0 likes · 27 min read

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

Alibaba Cloud Big Data AI Platform

Mar 20, 2023 · Artificial Intelligence

How HybridBackend Supercharged Ximalaya’s Recommendation Engine with GPU Acceleration

This article details how Ximalaya’s AI Cloud adopted the open‑source HybridBackend framework to overcome sparse data access and distributed training bottlenecks, achieving multi‑GPU utilization gains, faster model training, and significant cost reductions across its recommendation services.

GPU AccelerationHybridBackendSparse Data

0 likes · 9 min read

How HybridBackend Supercharged Ximalaya’s Recommendation Engine with GPU Acceleration

Hulu Beijing

Mar 16, 2023 · Artificial Intelligence

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

This article explores Hulu’s five‑year‑old machine‑learning training platform, detailing its three‑layer architecture, the shift from single‑node to distributed training, and the technical solutions—including parameter servers, Ring AllReduce, Kubernetes, Volcano, and Horovod—that enable scalable AI workloads across GPU, CPU, and storage resources.

AI InfrastructureHuluKubernetes

0 likes · 13 min read

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

DataFunSummit

Jan 14, 2023 · Artificial Intelligence

Deep Graph Library (DGL): Technical Features, Community Progress, and Challenges in Graph Deep Learning

This article provides a comprehensive overview of the Deep Graph Library (DGL), covering its technical characteristics, open‑source community developments, various graph learning tasks, message‑passing mechanisms, system design challenges, training strategies on single and multiple GPUs, inference optimization, and a Q&A comparing DGL with other frameworks.

AIDeep Graph LibraryGNN Training

0 likes · 15 min read

Deep Graph Library (DGL): Technical Features, Community Progress, and Challenges in Graph Deep Learning

Baidu Geek Talk

Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

This article systematically examines the major performance bottlenecks in AI model training, explains the underlying hardware and software causes, and presents a comprehensive set of acceleration strategies—including data‑loading optimizations, compute‑side enhancements, communication tricks, and the AIAK‑Training suite—backed by real‑world case studies and quantitative results.

AI trainingAIAK-TrainingGPU Acceleration

0 likes · 33 min read

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

Baidu Intelligent Cloud Tech Hub

Dec 22, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

This article systematically analyzes the main performance bottlenecks in AI model training, explains why acceleration is essential, and presents current hardware‑ and software‑based solutions—including data‑loading optimizations, operator fusion, mixed‑precision and Tensor Core usage, as well as distributed communication strategies—followed by real‑world case studies of Baidu's AIAK‑Training suite that demonstrate significant speed‑ups.

AI trainingGPU AccelerationPerformance Optimization

0 likes · 31 min read

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

Dada Group Technology

Nov 18, 2022 · Artificial Intelligence

JD Daojia Machine Learning Platform: Architecture and Implementation

This article introduces JD Daojia's machine learning platform, detailing its architecture, implementation principles, and practical applications in various business scenarios, achieving significant improvements in recommendation and search systems.

Deep LearningGraph Neural NetworksKubernetes

0 likes · 28 min read

JD Daojia Machine Learning Platform: Architecture and Implementation

vivo Internet Technology

Oct 9, 2022 · Artificial Intelligence

vivo Machine Learning Platform: Architecture Design and Practice

vivo’s machine‑learning platform, built for its massive app‑store and e‑commerce ecosystem, streamlines data processing, model training, and deployment through quota‑based resource management, a custom ultra‑large‑scale TensorFlow‑vlps framework, OpenAPI‑driven training, and Jupyter‑integrated interactive development, boosting efficiency for billions of samples and features.

MLOpsMachine Learning PlatformModel Deployment

0 likes · 12 min read

vivo Machine Learning Platform: Architecture Design and Practice

Bilibili Tech

Aug 30, 2022 · Artificial Intelligence

Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights

The article reviews reinforcement learning applied to Neural MMO—a large‑scale, multi‑agent MMO environment—detailing its competitive IJCAI 2022 track, the winning LastOrder solution with transformer‑CNN‑LSTM architecture, reward shaping, a Fictitious Self‑Play meta‑solver, and Bilibili’s scalable Newton training framework.

AI in GamesMeta SolverMulti-Agent Systems

0 likes · 9 min read

Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights

DataFunTalk

Jul 29, 2022 · Artificial Intelligence

Tencent Music Cloud‑Native One‑Stop Machine Learning Platform: Features and Future Roadmap

This article introduces Tencent Music's cloud‑native, one‑stop machine learning platform, detailing its engineering workflow, distributed acceleration, inference closed‑loop, edge computing capabilities, and future plans, while highlighting challenges of traditional ML pipelines and the platform's solutions for resource orchestration, storage, scheduling, and GPU utilization.

AI platformResource Managementcloud-native

0 likes · 17 min read

Tencent Music Cloud‑Native One‑Stop Machine Learning Platform: Features and Future Roadmap

DataFunTalk

Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataSparkdistributed training

0 likes · 13 min read

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

Laiye Technology Team

Jul 22, 2022 · Cloud Native

Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions

This article examines the pain points of distributed training orchestration and scheduling, presents a layered cloud‑native architecture built on Kubernetes, explains key components such as pipeline orchestrators, training job operators, schedulers, and topology managers, and discusses practical solutions using Argo, Kubeflow Pipelines, and the Volcano scheduler.

KubernetesML PlatformScheduling

0 likes · 38 min read

Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions

Youzan Coder

Jul 11, 2022 · Artificial Intelligence

How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce

By leveraging contrastive learning and large‑scale click‑through data, the article details a dual‑tower model that encodes product titles and queries, explains loss functions, batch‑negative sampling, distributed training tricks, and demonstrates how this approach outperforms traditional NER for product term and category prediction.

InfoNCEcontrastive learningdistributed training

0 likes · 16 min read

How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce

DataFunTalk

Jul 8, 2022 · Artificial Intelligence

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

This article presents an in‑depth overview of Tencent's Wuliang deep learning platform for recommendation systems, detailing the real‑time data challenges, high‑throughput requirements, parameter‑server architecture, model compression techniques, multi‑level caching, and answers to common technical questions.

Inference ServiceParameter ServerRecommendation Systems

0 likes · 14 min read

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

Baidu Geek Talk

Jul 6, 2022 · Artificial Intelligence

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

The article examines the industry trend toward ever‑larger AI models, compares their parameter scale to the human brain, outlines the computational and memory challenges of training such models, and details advanced parallelism techniques and Baidu's high‑performance cluster solutions that enable efficient, stable large‑scale model training.

AI InfrastructureBaiduCluster Computing

0 likes · 17 min read

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

Alimama Tech

Jun 22, 2022 · Artificial Intelligence

Graph Deep Learning: Methods, Frameworks, and Industrial Applications

Graph deep learning, extending deep models to irregular graph data via spatial and spectral GNNs such as GCN, GAT, and GraphSAGE, has matured into frameworks like Alibaba’s open‑source Euler, which scales to billions of nodes, powers a heterogeneous query‑item‑ad graph for search advertising, and demonstrably boosts click‑through rates by over 1.5%.

Euler frameworkdistributed traininggraph embeddings

0 likes · 17 min read

Graph Deep Learning: Methods, Frameworks, and Industrial Applications

Tencent Cloud Developer

May 12, 2022 · Backend Development

Practical Guide to PyTorch Distributed Training: DP, DDP, Groups, and IO Considerations

This guide explains PyTorch’s distributed training, contrasting single‑node DataParallel with multi‑node DistributedDataParallel, detailing essential parameters, group communication setup, proper use of DistributedSampler for data loading, handling IO bottlenecks, and avoiding common pitfalls such as memory imbalance, unsynchronized buffers, and unused‑parameter errors.

DDPDataParallelGPU

0 likes · 15 min read

Practical Guide to PyTorch Distributed Training: DP, DDP, Groups, and IO Considerations

DataFunSummit

May 7, 2022 · Artificial Intelligence

Advances in Click‑Through Rate Prediction: Model Evolution, Feature Interaction, Continuous Feature Embedding, and Distributed Training

This article reviews the development of CTR prediction models from early collaborative‑filtering methods to modern deep‑learning approaches, discusses core challenges such as feature interaction and continuous‑feature embedding, introduces recent Huawei solutions like AutoDis and ScaleFreeCTR for efficient large‑embedding training, and outlines future research directions.

EmbeddingRecommendation Systemscontinuous features

0 likes · 21 min read

Advances in Click‑Through Rate Prediction: Model Evolution, Feature Interaction, Continuous Feature Embedding, and Distributed Training

DataFunSummit

Apr 26, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

AI platformGPU schedulingHuya

0 likes · 11 min read

Elastic Distributed Training at Huya: Design, Implementation, and Results

DataFunTalk

Apr 14, 2022 · Artificial Intelligence

PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications

The article presents a comprehensive overview of Baidu's open‑source deep learning platform PaddlePaddle, detailing its full‑stack architecture, core technologies such as unified dynamic‑static graph, large‑scale distributed training, multi‑platform inference, an extensive model zoo, hardware adaptation, and showcases a real‑world deployment case in power‑grid monitoring.

AI FrameworkPaddlePaddledistributed training

0 likes · 15 min read

PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications

DataFunSummit

Apr 7, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

This article explains how Google Cloud Vertex AI improves large‑scale distributed machine learning training performance by addressing the memory‑wall challenge with Fast Socket network stack enhancements for NCCL and a Reduction Server that accelerates gradient aggregation, delivering higher throughput and lower TCO for AI workloads.

Fast SocketGPUNCCL

0 likes · 19 min read

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

Volcano Engine Developer Services

Mar 16, 2022 · Artificial Intelligence

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

The article introduces Volcano Engine's veGiantModel, a high‑performance large‑model training framework built on PyTorch, Megatron and DeepSpeed, details its distributed parallel strategies, hardware setups, benchmark results showing up to 6.9× speedup over Megatron and DeepSpeed, and provides open‑source links for further use.

ByteCCLLarge Language Modelsdistributed training

0 likes · 6 min read

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

HomeTech

Feb 15, 2022 · Artificial Intelligence

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

This article provides a comprehensive overview of Horovod, Uber's open-source distributed deep learning framework, covering its architecture, communication mechanisms, performance benchmarks, and deployment on Kubernetes and Spark for accelerated multi-GPU training.

Deep LearningGPU AccelerationHorovod

0 likes · 17 min read

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

JD Retail Technology

Jan 24, 2022 · Artificial Intelligence

Galileo: An Open‑Source Scalable Graph Deep Learning Framework for Industrial‑Scale Applications

Galileo is an open‑source, distributed graph deep‑learning framework that supports ultra‑large heterogeneous graphs, dual TensorFlow/PyTorch back‑ends, and a flexible API, enabling fast prototyping of graph neural networks such as HeteSAGE for real‑world recommendation and other AI scenarios.

AI FrameworkGalileoGraph Neural Networks

0 likes · 11 min read

Galileo: An Open‑Source Scalable Graph Deep Learning Framework for Industrial‑Scale Applications

Alibaba Cloud Native

Jan 17, 2022 · Cloud Native

Boost Distributed AI Training with KubeDL HostNetwork: Overcoming Overlay Limits

This article explains how KubeDL, Alibaba’s open-source Kubernetes-based AI workload framework, extends standard container networking with HostNetwork support to eliminate overlay overhead, detailing the benefits, challenges, configuration steps, and performance gains for large-scale distributed training.

AICloud NativeHostNetwork

0 likes · 11 min read

Boost Distributed AI Training with KubeDL HostNetwork: Overcoming Overlay Limits

Code DAO

Dec 31, 2021 · Cloud Computing

How to Run Distributed PyTorch Training on AzureML with CLI v2

This article walks through the complete workflow for building, testing, and launching a distributed PyTorch training job on AzureML using the CLI v2, covering local script preparation, Accelerate configuration, Docker environment setup, dataset registration, compute target definition, job YAML creation, and job submission with monitoring.

CLIDockerMLflow

0 likes · 15 min read

How to Run Distributed PyTorch Training on AzureML with CLI v2

DataFunTalk

Dec 23, 2021 · Artificial Intelligence

Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan

This article details Meituan's internal, heavily customized TensorFlow 1.x implementation that addresses large‑scale sparse parameter support, distributed training challenges, communication bottlenecks, and pipeline optimizations, achieving over ten‑fold scalability improvements and significant per‑node performance gains in recommendation system workloads.

Performance OptimizationSparse ParametersTensorFlow

0 likes · 32 min read

Deep Customization and Optimization of TensorFlow for Large-Scale Sparse Training at Meituan

Code DAO

Dec 17, 2021 · Artificial Intelligence

How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray

The article explains why XGBoost training can be slow despite its speed focus and presents three acceleration techniques—choosing an optimal tree_method, leveraging cloud resources for larger memory, and using Ray for distributed training—complete with code examples and benchmark results.

Cloud ComputingRayXGBoost

0 likes · 5 min read

How to Accelerate XGBoost Training with Tree Methods, Cloud Computing, and Ray

Code DAO

Dec 17, 2021 · Artificial Intelligence

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

XGBoost‑Ray provides a fault‑tolerant, multi‑node, multi‑GPU backend for XGBoost that integrates seamlessly with Ray Tune, supports distributed data loading, and can be enabled with only three code changes, enabling scalable training and inference on large clusters.

GPURayRay Tune

0 likes · 8 min read

How to Scale XGBoost with Ray for Distributed Multi‑GPU Training

Alimama Tech

Dec 15, 2021 · Artificial Intelligence

Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising

SMAD is a scalable graph‑based ad retrieval framework for e‑commerce search that builds a heterogeneous Query‑Item‑Ad graph, learns multi‑view embeddings with a parallel deep neural network and attention, employs category‑aware sampling for efficient distributed training, and delivers significant gains in offline relevance and online CTR, RPM, and PVR.

ad retrievalattentiondistributed training

0 likes · 17 min read

Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising

Meituan Technology Team

Dec 9, 2021 · Artificial Intelligence

Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan

Meituan heavily customized TensorFlow 1.x for large‑scale sparse training, replacing variable embeddings with hash tables, improving load balancing, using RDMA communication, pipeline‑embedding graphs, high‑performance hash tables, and operator merges, achieving over ten‑fold scalability, up to 51% operator speedups, and enabling billions‑parameter models on CPU clusters with future GPU expansion.

Performance OptimizationRecommendation SystemsSparse Parameters

0 likes · 31 min read

Deep Customization of TensorFlow for Large-Scale Sparse Training at Meituan

DataFunSummit

Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelDeep LearningGloo

0 likes · 8 min read

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

Alibaba Cloud Developer

Aug 17, 2021 · Artificial Intelligence

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Alibaba Cloud’s PAI team and the DAMO Academy introduced the low‑carbon M6 trillion‑parameter multimodal model, demonstrating that their self‑developed Whale framework can train such massive models on just 480 V100 GPUs, reducing energy consumption by over 80% and boosting training efficiency nearly eleven‑fold.

AIGPU OptimizationWhale framework

0 likes · 12 min read

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Tencent Architect

Jul 29, 2021 · Artificial Intelligence

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

This article analyzes the bottlenecks of advertising coarse‑ranking training on the Light framework and presents a series of optimizations—including parallel data download, thread‑queue buffering, integer‑to‑string conversion with fmt, and zlib replacement with czlib—that together achieve up to 58% QPS improvement and notable CPU efficiency gains.

AdvertisingCPU/GPU efficiencyPerformance Optimization

0 likes · 11 min read

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

Kuaishou Tech

Jul 16, 2021 · Artificial Intelligence

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

BaguaDeep LearningGPU scaling

0 likes · 15 min read

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

360 Tech Engineering

Jul 2, 2021 · Artificial Intelligence

DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training

The article introduces DGL Operator, an open‑source Kubernetes‑based controller that automates the lifecycle of distributed graph neural network training with DGL, explains its terminology, challenges of native DGL distribution, and provides detailed architecture, workflow, and YAML/CLI examples for easy deployment.

AIDGLGraph Neural Networks

0 likes · 18 min read

DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training

360 Smart Cloud

Jul 1, 2021 · Cloud Native

DGL Operator: A Kubernetes Native Controller for Distributed Graph Neural Network Training

DGL Operator is an open‑source Kubernetes controller that automates the lifecycle of distributed graph neural network training by handling configuration generation, graph partitioning, training execution, and resource cleanup, providing a cloud‑native solution for large‑scale GNN workloads.

AICloud NativeDGL

0 likes · 20 min read

DGL Operator: A Kubernetes Native Controller for Distributed Graph Neural Network Training

Alibaba Cloud Native

Jun 3, 2021 · Artificial Intelligence

How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime

Weibo’s deep learning platform faced severe latency and stability issues when accessing massive small‑file datasets via a compute‑storage‑separated architecture, so the team adopted the CNCF Fluid project with JindoRuntime, implementing a distributed cache that leverages POSIX interfaces, dramatically improving data locality, reducing HDFS load, and achieving up to 18‑fold training speedups while raising success rates from 37 % to 98 %.

Data CachingDeep LearningFluid

0 likes · 15 min read

How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime

JD Tech

Mar 30, 2021 · Artificial Intelligence

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

The article introduces JD Retail's Jiushu Business Analytics Platform, detailing how AI, big‑data, and distributed‑training technologies address fragmented retail scenarios, high deployment barriers, large‑scale application difficulties, and cost concerns through specialized frameworks, fault‑tolerant training, and advanced cluster optimization.

AIRetailcluster management

0 likes · 12 min read

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

DataFunTalk

Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureGPU clusteringMultimodal AI

0 likes · 17 min read

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

ITFLY8 Architecture Home

Mar 16, 2021 · Artificial Intelligence

How NetEase Cloud Music Solved Cold‑Start with Large‑Scale Graph Neural Networks

This article explains how NetEase Cloud Music tackled cold‑start recommendation challenges in live streaming by leveraging Baidu's PGL distributed graph learning framework to train massive graph neural networks that transfer user behavior from music domains to live content, achieving significant performance gains.

AIGraph Neural NetworksLarge-Scale Graph

0 likes · 7 min read

How NetEase Cloud Music Solved Cold‑Start with Large‑Scale Graph Neural Networks

DataFunSummit

Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

GPU ClusterMultimodal AIReal-time inference

0 likes · 18 min read

Alibaba Cloud Developer

Oct 22, 2020 · Artificial Intelligence

EasyTransfer: Alibaba’s Open‑Source Framework Boosting NLP Transfer Learning

Alibaba Cloud open‑sources EasyTransfer, a high‑performance deep transfer‑learning framework for NLP that unifies pre‑training, knowledge distillation, meta‑learning and distributed deployment, offering a rich ModelZoo, AppZoo and seamless integration with the PAI ecosystem.

EasyTransferNLPOpen-source

0 likes · 15 min read

EasyTransfer: Alibaba’s Open‑Source Framework Boosting NLP Transfer Learning

JD Retail Technology

Oct 21, 2020 · Artificial Intelligence

Galileo: A Distributed Graph Deep Learning Framework for Large‑Scale Industrial Scenarios

The article introduces Galileo, JD Retail's distributed graph deep‑learning platform that supports heterogeneous and dynamic graphs, ultra‑large scale training, flexible model customization, and seamless integration with TensorFlow and PyTorch, highlighting its architecture, core challenges, built‑in algorithms, and upcoming open‑source release.

AI platformGraph Neural Networksdistributed training

0 likes · 11 min read

Galileo: A Distributed Graph Deep Learning Framework for Large‑Scale Industrial Scenarios

360 Tech Engineering

Sep 14, 2020 · Artificial Intelligence

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

TensorNet is a TensorFlow‑based distributed training framework that tackles the challenges of massive data and billions of sparse parameters in advertising and recommendation systems by enabling near‑infinite sparse feature dimensions, drastically reducing synchronization overhead, and delivering up to 35% inference speed improvements.

AI InfrastructureTensorFlowdistributed training

0 likes · 8 min read

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

DataFunTalk

Jul 1, 2020 · Artificial Intelligence

Architecture and Implementation of Autohome's Machine Learning Platform

The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.

AutoMLDeep LearningKubernetes

0 likes · 19 min read

Architect

May 29, 2020 · Artificial Intelligence

Integrating Flink with TensorFlow for End-to-End Machine Learning Pipelines

This article explains how to combine the Flink data‑processing engine with TensorFlow to create a unified, end‑to‑end machine‑learning workflow, covering background, challenges, the Flink‑AI‑extended architecture, ML framework and operator abstractions, and both batch and streaming training and prediction modes.

AI integrationFlinkTensorFlow

0 likes · 9 min read

Integrating Flink with TensorFlow for End-to-End Machine Learning Pipelines

Tencent Cloud Developer

May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIHorovodMPI

0 likes · 12 min read

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

360 Quality & Efficiency

Apr 17, 2020 · Artificial Intelligence

Extending APEX for Real Distributed Reinforcement Learning with tf2rl

The article examines the limitations of the single‑machine APEX framework in the tf2rl reinforcement‑learning library, proposes a cross‑machine distributed architecture using middleware such as Redis, compares alternative frameworks like EasyRL, and outlines expected performance gains and future development plans.

APEXOff-PolicyTensorFlow

0 likes · 5 min read

Extending APEX for Real Distributed Reinforcement Learning with tf2rl

Tencent Tech

Feb 27, 2020 · Artificial Intelligence

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Deep LearningKnowledge DistillationQuantization

0 likes · 14 min read

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

58 Tech

Dec 20, 2019 · Artificial Intelligence

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

The article presents a comprehensive overview of 58.com’s AI platform built on Kubernetes, detailing its layered architecture, resource scheduling, offline training pipelines, debugging environment, distributed TensorFlow/PyTorch training, performance benchmarks, and online inference services, highlighting how the system empowers various business units with scalable AI capabilities.

KubernetesPyTorchTensorFlow

0 likes · 11 min read

Deep Learning Platform on Kubernetes: Architecture, Resource Management, Offline Training and Online Inference

Tencent Cloud Developer

Oct 11, 2019 · Cloud Computing

Large-Scale Distributed Reinforcement Learning Solution Based on TKE

The project replaces cumbersome manual management of thousands of heterogeneous CPU and GPU nodes for large‑scale reinforcement learning with a TKE‑based, containerized actor‑learner architecture that automates batch start/stop, provides elastic autoscaling, fault‑tolerant processes, shared model storage, and CI‑driven image deployment, cutting costs by up to two‑thirds while dramatically speeding experiment cycles.

CI/CDCloud NativeKubernetes

0 likes · 14 min read

Large-Scale Distributed Reinforcement Learning Solution Based on TKE

Alibaba Cloud Developer

Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureDeep LearningGPU Acceleration

0 likes · 17 min read

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

360 Tech Engineering

May 10, 2019 · Artificial Intelligence

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.

Data ParallelDeep LearningGPU

0 likes · 11 min read

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

360 Zhihui Cloud Developer

May 9, 2019 · Artificial Intelligence

Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide

Learn how to perform single‑machine multi‑GPU and multi‑node multi‑GPU training with MXNet, understand KVStore modes, configure workers, servers, and schedulers, and deploy large‑scale distributed training on Kubernetes using Kubeflow, including operator installation, task creation, and performance considerations.

GPUKubeflowKubernetes

0 likes · 11 min read

Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide

iQIYI Technical Product Team

Apr 4, 2019 · Artificial Intelligence

Principles, Methodology, and Tools for Machine Learning Performance Optimization

The article presents a systematic, top‑down methodology for machine‑learning performance optimization—covering principles, benchmark‑driven loops, foundational hardware and software checks, profiling tools, throughput and latency metrics, and practical techniques for IO, compute, mixed‑precision, and distributed training to maximize resource utilization.

ComputeIOPerformance Optimization

0 likes · 22 min read

Principles, Methodology, and Tools for Machine Learning Performance Optimization

Alibaba Cloud Developer

Dec 21, 2018 · Artificial Intelligence

X-DeepLearning: Alibaba’s Open‑Source Framework for Large‑Scale Sparse Deep Learning

Alibaba's X‑DeepLearning (XDL) is an open‑source deep‑learning framework optimized for high‑dimensional sparse data, offering industrial‑grade distributed training, built‑in CTR/recommendation algorithms, structured compression, and online learning capabilities, with benchmark results demonstrating superior scalability and performance.

CTR PredictionDeep LearningSparse Data

0 likes · 18 min read

X-DeepLearning: Alibaba’s Open‑Source Framework for Large‑Scale Sparse Deep Learning

MaGe Linux Operations

Nov 22, 2018 · Artificial Intelligence

Accelerating TensorFlow Deep Learning: GPU & Distributed Training Techniques

This article explains how to speed up TensorFlow deep‑learning model training using single‑GPU acceleration, multi‑GPU parallelism, and distributed TensorFlow on Kubernetes, covering device placement, session parameters, synchronous vs asynchronous training modes, and practical code examples to improve performance and scalability.

Deep LearningGPU AccelerationTensorFlow

0 likes · 10 min read

Accelerating TensorFlow Deep Learning: GPU & Distributed Training Techniques

Meituan Technology Team

Oct 11, 2018 · Artificial Intelligence

Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction

Meituan’s user‑growth team built a Wide‑Deep CTR prediction model, trained offline with Spark‑generated TFRecords, and deployed it via TensorFlow Serving on YARN, then applied request‑side multithreading, offline one‑hot preprocessing, XLA JIT compilation, and dedicated loading threads to cut end‑to‑end latency from ~18 ms to ~6 ms and eliminate model‑switch spikes.

Model DeploymentTensorFlow Servingdistributed training

0 likes · 15 min read

Deploying and Optimizing TensorFlow Serving for High‑Performance CTR Prediction

Didi Tech

Jun 8, 2018 · Artificial Intelligence

DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning

DiDi PS is a custom RDMA‑based parameter server that uses a ring topology and optimized ibverbs communication to dramatically accelerate distributed deep‑learning training, consistently outperforming OpenMPI, NCCL2, TensorFlow’s built‑in RDMA, and Horovod while providing more stable and scalable synchronization for massive data workloads.

AllreduceParameter ServerPerformance

0 likes · 10 min read

DiDi PS: High-Performance RDMA-Based Parameter Server for Distributed Deep Learning

Alibaba Cloud Developer

Apr 26, 2018 · Artificial Intelligence

How TensorFlowRS Supercharges Large‑Scale Search & Recommendation with 10×‑100× Speedups

This article describes TensorFlowRS, an Alibaba‑built extension of TensorFlow that tackles the massive compute and sparse‑feature challenges of search, advertising and recommendation by redesigning the parameter server, adding fail‑over, gradient‑compensation, online‑learning support, advanced training modes and visualisation, achieving up to 100× training speedup and improved model quality.

Parameter ServerRecommendation SystemsTensorFlow

0 likes · 16 min read

How TensorFlowRS Supercharges Large‑Scale Search & Recommendation with 10×‑100× Speedups

Meituan Technology Team

Apr 4, 2018 · Artificial Intelligence

Performance Optimization of Distributed TensorFlow for WDL Models at Meituan

Meituan‑Dianping identified data‑pipeline, network, and memory‑arena bottlenecks in distributed TensorFlow training of Wide & Deep recommendation models and resolved them by switching to tf.data pipelines, batching TFRecord reads, increasing MALLOC_ARENA_MAX, and moving embedding lookups to parameter servers, achieving 2–3× speedup and near‑linear scaling on up to 32 GPUs.

AFOPerformance OptimizationTensorFlow

0 likes · 12 min read

Performance Optimization of Distributed TensorFlow for WDL Models at Meituan

Architecture Digest

Oct 17, 2017 · Artificial Intelligence

Design and Architecture of the Weibo Deep Learning Platform

This article presents the design, architecture, and operational experience of Weibo's deep learning platform, covering its machine‑learning workflow, control center, distributed training cluster, and online prediction service, and explains how the platform accelerates development and improves business outcomes.

AIDeep Learningdistributed training

0 likes · 17 min read

Design and Architecture of the Weibo Deep Learning Platform

360 Zhihui Cloud Developer

Jun 8, 2017 · Artificial Intelligence

Getting Started with TensorFlow: Build and Run Your First Computation Graph

This article introduces TensorFlow, covering its origins, architecture, programming model, and how to construct and execute a simple computation graph with code examples, while also explaining its deployment options across devices and distributed setups.

Computation GraphTensorFlowdistributed training

0 likes · 7 min read

Getting Started with TensorFlow: Build and Run Your First Computation Graph

Alibaba Cloud Developer

Jun 5, 2017 · Artificial Intelligence

Alibaba’s Distributed Training Boosts Neural Machine Translation Speed

Since its 2013 debut, Neural Machine Translation (NMT) has approached human quality, but training costs are high; Alibaba’s team developed a distributed NMT system in 2017, employing data‑parallel, model‑average, BMUF, Downpour SGD, and Ring‑allReduce techniques to cut training time from over 20 days to a few days while maintaining translation quality.

BMUFDownpour SGDModel Averaging

0 likes · 18 min read

Alibaba’s Distributed Training Boosts Neural Machine Translation Speed

MaGe Linux Operations

Apr 19, 2017 · Artificial Intelligence

Accelerate TensorFlow Deep Learning with GPU, Multi‑GPU, and Distributed Training

This article explains how to speed up TensorFlow deep‑learning model training by using a single GPU, configuring session parameters, assigning operations to specific devices, employing multi‑GPU parallelism, and leveraging distributed TensorFlow on Kubernetes, while also discussing synchronous versus asynchronous training modes and practical best practices.

Deep LearningGPU AccelerationTensorFlow

0 likes · 11 min read

Accelerate TensorFlow Deep Learning with GPU, Multi‑GPU, and Distributed Training

ITPUB

Sep 6, 2016 · Artificial Intelligence

Deep Learning Platforms: From Google’s DistBelief to Open‑Source MXNet and TensorFlow

The article reviews the evolution, challenges, and commercial and open‑source deep learning platforms—including DistBelief, COTS, Adam, MXNet, TensorFlow, and Petuum—while highlighting real‑world applications such as image recognition, recommendation, sentiment analysis, and crowd monitoring.

AI ApplicationsGPU AccelerationMXNet

0 likes · 10 min read

Deep Learning Platforms: From Google’s DistBelief to Open‑Source MXNet and TensorFlow

Art of Distributed System Architecture Design

Oct 10, 2015 · Artificial Intelligence

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

This article describes how Yahoo integrated deep learning into its massive Hadoop ecosystem by adding GPU nodes, using YARN and Spark to run Caffe at scale, and presents performance results on AlexNet and GoogLeNet alongside open‑source contributions.

Big DataCaffeGPU

0 likes · 9 min read

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters