Tagged articles
164 articles
Page 1 of 2
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Distributed TrainingEmbodied AILWD
0 likes · 13 min read
How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning
CodeTrend
CodeTrend
Apr 26, 2026 · Artificial Intelligence

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

The article analyzes why most open‑source large models cannot run on Huawei Ascend NPU, detailing the CUDA‑centric ecosystem, Ascend's CANN stack, three core technical hurdles, and the deep collaboration and tooling that enabled DeepSeek V4’s successful adaptation.

AI model portingCANNDeepSeek-V4
0 likes · 10 min read
Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown
Machine Heart
Machine Heart
Apr 25, 2026 · Artificial Intelligence

Jeff Dean’s New Paper Shows Elastic Large‑Scale Distributed Pre‑Training Is Now Feasible

Decoupled DiLoCo, a new distributed training framework introduced by Jeff Dean and colleagues, enables resilient large‑scale AI pre‑training across heterogeneous hardware by decoupling learners, using lightweight syncers, adaptive quorum, and balanced tensor fragmentation, dramatically improving goodput and reducing bandwidth while preserving model quality.

Bandwidth ReductionDecoupled DiLoCoDistributed Training
0 likes · 10 min read
Jeff Dean’s New Paper Shows Elastic Large‑Scale Distributed Pre‑Training Is Now Feasible
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Apr 15, 2026 · Artificial Intelligence

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Asynchronous PipelineDistributed TrainingRay Serve
0 likes · 10 min read
How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Apr 13, 2026 · Artificial Intelligence

How AReaL v1.0 Enables Scalable Agentic RL on Ascend NPU with AWEX Weight Sync

The new AReaL v1.0 release brings full Ascend NPU support, detailed installation guides, and a best‑practice example for training a 30B MoE model across four nodes, while the integrated AWEX weight‑sync mechanism dramatically reduces synchronization time, improving efficiency and stability for large‑scale Agentic RL workloads.

AWEXAscend NPUDistributed Training
0 likes · 12 min read
How AReaL v1.0 Enables Scalable Agentic RL on Ascend NPU with AWEX Weight Sync
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 8, 2026 · Artificial Intelligence

Running Distributed Reinforcement Learning with Isaac Lab’s Newton Engine and Rerun Visualizer on PAI

This guide explains how to use the Newton physics engine and the lightweight Rerun visualizer with Isaac Lab on the PAI platform, covering environment setup, visualizer selection, single‑ and multi‑GPU reinforcement‑learning training, and performance analysis via TensorBoard.

Distributed TrainingIsaac LabNewton engine
0 likes · 9 min read
Running Distributed Reinforcement Learning with Isaac Lab’s Newton Engine and Rerun Visualizer on PAI
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 25, 2026 · Artificial Intelligence

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

This article explains how to use NVIDIA Isaac Lab and the TiledCamera component to run large‑scale, multimodal reinforcement learning on GPU clusters, covering environment setup, noVNC visualization, command‑line execution, distributed training with torchrun, and performance analysis across multiple GPU configurations.

Distributed TrainingGPU scalingNVIDIA Isaac Lab
0 likes · 12 min read
Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights

A 630‑line Python Autoresearch project sparked a community‑run distributed system that created over 80 autonomous AI agents, executed more than 2,300 experiments in four days, self‑organized roles and peer‑review, and uncovered ten concrete pre‑training findings.

AI agentsAutoResearchDistributed Training
0 likes · 9 min read
630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights
HyperAI Super Neural
HyperAI Super Neural
Feb 11, 2026 · Artificial Intelligence

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

Researchers at Oak Ridge National Laboratory introduced D‑CHAG, a distributed cross‑channel hierarchical aggregation method that cuts memory consumption by up to 75% and more than doubles throughput when training massive multi‑channel foundation models on up to 1,024 AMD GPUs, as demonstrated on hyperspectral imaging and weather‑forecasting workloads.

D-CHAGDistributed TrainingMemory Optimization
0 likes · 14 min read
Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation
JD Retail Technology
JD Retail Technology
Jan 30, 2026 · Artificial Intelligence

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.

Distributed TrainingGPU/NPUReinforcement Learning
0 likes · 26 min read
How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale
PaperAgent
PaperAgent
Jan 8, 2026 · Artificial Intelligence

How SOP Enables Scalable Online Post-Training for Real‑World Robots

The SOP (Scalable Online Post‑training) framework redesigns VLA post‑training from offline, single‑machine, sequential processing to a distributed, parallel online system, allowing robot fleets to continuously learn, share experiences, and scale intelligence while maintaining stability and generalization in complex real‑world environments.

Distributed TrainingOnline LearningRobotics
0 likes · 11 min read
How SOP Enables Scalable Online Post-Training for Real‑World Robots
DataFunSummit
DataFunSummit
Dec 20, 2025 · Artificial Intelligence

How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications

This article details AutoHome's end‑to‑end development of the Cangjie large model, covering the training infrastructure with distributed data, pipeline and tensor parallelism, core business use cases such as video script generation and multi‑tool Agent capabilities, inference optimizations through quantization and fast serving frameworks, and future directions for personalized automotive AI services.

Agent AIDistributed TrainingVideo Generation
0 likes · 19 min read
How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications
AntTech
AntTech
Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Agentic AIDistributed TrainingOpen-source
0 likes · 14 min read
Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design
AntTech
AntTech
Nov 27, 2025 · Artificial Intelligence

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

The article explains the design, implementation, and performance of the AMem NCCL‑Plugin, a lightweight extension to NVIDIA's NCCL that enables transparent offloading and rapid recovery of GPU memory during reinforcement‑learning training of trillion‑parameter models, detailing its architecture, APIs, benchmarks, installation steps, and integration guidelines.

ASystemDistributed TrainingGPU
0 likes · 18 min read
How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models
AntTech
AntTech
Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

Distributed TrainingHigh‑performance computingPython
0 likes · 17 min read
How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 19, 2025 · Artificial Intelligence

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

This article provides a thorough analysis of nanochat’s source code, detailing transformer component differences, precise parameter‑size formulas, FlashNorm and ReLU² innovations, scaling‑law insights, memory‑usage estimations, and the distributed optimizer and training pipelines used to build the model.

Distributed TrainingLLMTransformer
0 likes · 20 min read
Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques
Architects' Tech Alliance
Architects' Tech Alliance
Sep 28, 2025 · Artificial Intelligence

How AI Workloads Are Redefining Network Architecture: Key Requirements and Topologies

The article examines how the rapid growth of AI models and workloads is reshaping network design, highlighting the need for ultra‑high bandwidth, sub‑millisecond latency, reliability, scalable topologies like Fat‑Tree and Dragonfly, and robust security and QoS mechanisms across data‑center, cloud, and edge environments.

AI networkingDistributed TrainingHigh Bandwidth
0 likes · 11 min read
How AI Workloads Are Redefining Network Architecture: Key Requirements and Topologies
IT Architects Alliance
IT Architects Alliance
Sep 17, 2025 · Artificial Intelligence

How Distributed Scheduling Redefines AI Large-Model Training Architecture

The article examines how the explosive compute, storage, network, and fault‑tolerance demands of AI large‑model training force a fundamental redesign of system architecture, covering layered storage, optimized All‑Reduce communication, elastic resource orchestration, observability, and cost‑saving strategies.

AI ArchitectureCompute SchedulingCost Optimization
0 likes · 9 min read
How Distributed Scheduling Redefines AI Large-Model Training Architecture
Fun with Large Models
Fun with Large Models
Aug 30, 2025 · Artificial Intelligence

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

This article explains how to fine‑tune large models across multiple machines and GPUs by covering data, model, tensor, and pipeline parallelism, hybrid 3D parallel strategies, engineering details such as NCCL, PyTorch Distributed, DeepSpeed, fault‑tolerance, checkpointing, and the ZeRO optimizer stages that dramatically reduce memory usage.

Data ParallelDeepSpeedDistributed Training
0 likes · 8 min read
How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer
Kuaishou Tech
Kuaishou Tech
Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Distributed TrainingReinforcement Learningpipeline optimization
0 likes · 13 min read
How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 24, 2025 · Artificial Intelligence

Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving

This article shares practical experience training perception‑detection mini‑models on two different compute cards, covering environment setup, technical architecture, common dependency issues, performance‑boosting tricks such as CPU process pools, torch dataloader tuning, NCCL P2P handling, and CPFS storage optimization.

Distributed TrainingModel Trainingautonomous driving
0 likes · 17 min read
Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving
Tech Freedom Circle
Tech Freedom Circle
Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDistributed TrainingDualPipe
0 likes · 44 min read
DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 16, 2025 · Artificial Intelligence

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

The paper introduces ChunkFlow, an efficient training framework for variable‑length and ultra‑long sequence datasets that powers Qwen models, achieving up to 4.53× speedup over Megatron‑LM and more than 2× overall performance gains by reorganizing data into fixed‑size chunks and employing a state‑aware scheduler.

AI PerformanceChunkFlowDistributed Training
0 likes · 7 min read
ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jul 13, 2025 · Artificial Intelligence

Getting Started with Hugging Face Transformers Trainer

This guide walks through the Hugging Face Transformers Trainer library, explaining its core features such as configurable training loops, mixed‑precision and gradient‑accumulation support, seamless distributed training via Accelerate and DeepSpeed, and provides a step‑by‑step example of converting a simple PyTorch CNN model to use Trainer.

AccelerateDeepSpeedDistributed Training
0 likes · 7 min read
Getting Started with Hugging Face Transformers Trainer
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 25, 2025 · Artificial Intelligence

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

This article introduces Alibaba Cloud's PAI platform and demonstrates how open‑source reinforcement‑learning frameworks such as Cosmos‑RL, Ray, and VeRL accelerate post‑training for large language models, offering higher throughput, fault‑tolerance, and seamless integration for AI developers.

AI PlatformDistributed TrainingOpen Source Frameworks
0 likes · 9 min read
Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI
Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

Distributed TrainingInfiniBandRDMA
0 likes · 12 min read
Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
AI Cyberspace
AI Cyberspace
May 20, 2025 · Artificial Intelligence

Why SuperNode and SuperPOD Are Critical for Scaling AI Models

This article explains the scaling laws behind large language models, the explosive growth of model sizes and compute demands, and why modern AI infrastructure must adopt SuperNode and SuperPOD architectures that combine high‑bandwidth Scale‑Up networks with flexible Scale‑Out networking to overcome bandwidth, latency, and power challenges.

AI scalingDistributed TrainingSuperPoD
0 likes · 42 min read
Why SuperNode and SuperPOD Are Critical for Scaling AI Models
Baidu Geek Talk
Baidu Geek Talk
May 19, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

AI inferenceAlltoall optimizationDistributed Training
0 likes · 14 min read
How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations
Baidu Geek Talk
Baidu Geek Talk
Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI InfrastructureDistributed TrainingLarge Language Models
0 likes · 23 min read
PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2025 · Artificial Intelligence

Why NVLink and NVSwitch Are Essential for Training Massive AI Models

Training today's massive AI foundation models demands extensive GPU resources and sophisticated multi‑GPU communication, making technologies like NVLink and NVSwitch crucial for efficient distributed training, while data‑parallel and model‑parallel strategies together optimize performance across large‑scale hardware clusters.

Distributed TrainingGPUNVLink
0 likes · 8 min read
Why NVLink and NVSwitch Are Essential for Training Massive AI Models
AI Algorithm Path
AI Algorithm Path
Mar 16, 2025 · Artificial Intelligence

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

This article walks through concrete techniques to accelerate PyTorch training, covering mixed‑precision with torch.cuda.amp, profiling with torch.profiler, DataLoader tuning, torch.compile, distributed strategies like DataParallel and DDP, gradient accumulation, and advanced libraries such as Lightning, Apex, and DeepSpeed, plus model‑level optimizations and monitoring tips.

DataLoaderDistributed TrainingProfiling
0 likes · 12 min read
Speed Up Your PyTorch Model Training: Practical Tips and Tricks
Architects' Tech Alliance
Architects' Tech Alliance
Mar 5, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

DeepSeek’s Open‑Source Week unveiled five high‑performance projects—FlashMLA, DeepEP, DeepGEMM, DualPipe/EPLB, and 3FS—each delivering novel GPU optimizations, communication kernels, matrix‑multiplication libraries, parallelism strategies, and a distributed file system that together dramatically accelerate large‑scale AI training and inference workloads.

AI accelerationDeepSeekDistributed Training
0 likes · 9 min read
How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance
JD Retail Technology
JD Retail Technology
Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

Distributed TrainingGPUInference
0 likes · 19 min read
JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications
JD Tech Talk
JD Tech Talk
Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

Distributed TrainingGPUInference
0 likes · 20 min read
AI Engine Technology Based on Domestic Chips for JD Retail
Data Thinking Notes
Data Thinking Notes
Mar 2, 2025 · Artificial Intelligence

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

During DeepSeek’s Open‑Source Week (Feb 24‑28), five production‑tested projects were released, spanning GPU‑optimized MLA kernels, MoE communication libraries, high‑performance FP8 GEMM, dual‑pipeline parallelism, and a AI‑focused distributed file system, each delivering significant performance and efficiency gains for large‑scale AI workloads.

Distributed TrainingGPU Optimizationai
0 likes · 13 min read
How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations
DataFunTalk
DataFunTalk
Mar 2, 2025 · Artificial Intelligence

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

This tutorial explains how to build a distributed reinforcement‑learning pipeline using the GRPO algorithm, covering data preparation, evaluation and reward functions, multi‑GPU DataParallel implementation, and full fine‑tuning of the Qwen2.5‑1.5B‑Instruct model with PyTorch, FlashAttention2 and Weights & Biases.

Distributed TrainingGRPOPyTorch
0 likes · 10 min read
Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct
AI Product Manager Community
AI Product Manager Community
Feb 28, 2025 · Artificial Intelligence

What’s Inside DeepSeek’s Open‑Source Week? DualPipe, EPLB, 3FS and More Explained

DeepSeek’s recent Open‑Source Week unveiled a suite of AI‑focused tools—including the DualPipe pipeline parallelism algorithm, the EPLB expert load balancer, detailed training‑inference framework data, the high‑performance 3FS parallel file system, and the Smallpond data‑processing framework—each with GitHub links and performance highlights.

Distributed Trainingaifile system
0 likes · 7 min read
What’s Inside DeepSeek’s Open‑Source Week? DualPipe, EPLB, 3FS and More Explained
AIWalker
AIWalker
Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1Distributed TrainingGPU Optimization
0 likes · 39 min read
Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial
DataFunSummit
DataFunSummit
Feb 14, 2025 · Artificial Intelligence

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

This presentation details how Alibaba Cloud's AI platform integrates big‑data pipelines, feature‑store services, and large language model capabilities to construct high‑performance search‑recommendation architectures, covering system design, training and inference optimizations, LLM‑driven use cases, and open‑source RAG tooling.

AI PlatformBig DataDistributed Training
0 likes · 17 min read
Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform
AI Algorithm Path
AI Algorithm Path
Feb 10, 2025 · Artificial Intelligence

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

This article explains how the DualPipe scheduling mechanism in DeepSeek‑R1 improves GPU cluster compute‑communication efficiency by using fine‑grained pipeline stages and bidirectional data flow, comparing it with Zero Bubble pipeline parallelism and discussing the challenges of large‑scale distributed training.

DeepSeekDistributed TrainingDualPipe
0 likes · 10 min read
Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)
DataFunSummit
DataFunSummit
Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

Distributed TrainingGPU OptimizationLLM
0 likes · 24 min read
NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

Distributed TrainingPPOPRM
0 likes · 21 min read
Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance
Kuaishou Large Model
Kuaishou Large Model
Nov 22, 2024 · Artificial Intelligence

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

This article details a comprehensive set of techniques—including data‑ and tensor‑parallel overlap, context‑parallelism, activation rematerialization, and a performance‑driven cost model—that dramatically improve large‑language‑model training efficiency on ultra‑large GPU clusters while preserving model quality.

Distributed TrainingLarge Language ModelsParallelism
0 likes · 28 min read
Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism
Kuaishou Tech
Kuaishou Tech
Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

Distributed TrainingGPU clustersPerformance Modeling
0 likes · 27 min read
Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters
Tencent Tech
Tencent Tech
Nov 19, 2024 · Artificial Intelligence

How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award

Tencent’s Angel machine learning platform, recognized for breakthroughs in trillion‑scale model training, inference, and deployment, won the 2024 World Internet Conference Leading Technology Award, highlighting its self‑developed hardware‑software stack, high‑performance networking, and extensive real‑world AI applications.

AI PlatformAngelDistributed Training
0 likes · 6 min read
How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Oct 23, 2024 · Artificial Intelligence

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

This article examines the challenges of scaling large AI models across multiple GPUs, explores data, pipeline, and tensor parallelism, analyzes collective communication patterns and data‑channel technologies such as PCIe, NVLink and RDMA, and offers concrete optimization recommendations to boost training efficiency.

Distributed TrainingGPU communicationcollective communication
0 likes · 21 min read
How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights
Baidu Tech Salon
Baidu Tech Salon
Oct 17, 2024 · Artificial Intelligence

How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide

This article explains how the open‑source Yuan 2.0 large language model is fully integrated with Baidu’s PaddleNLP, covering its capabilities, fine‑tuning optimizations, step‑by‑step deployment instructions, interaction examples, and training/finetuning results with loss‑curve visualizations.

Distributed TrainingFine-tuningPaddleNLP
0 likes · 10 min read
How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI InfrastructureDistributed TrainingGPU cluster
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
DataFunSummit
DataFunSummit
Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphDistributed TrainingGPU Optimization
0 likes · 11 min read
Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

Data ParallelismDeepSpeedDistributed Training
0 likes · 27 min read
Master Distributed Training for Massive AI Models on Multi‑GPU Clusters
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 26, 2024 · Artificial Intelligence

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

At the 2024 Yunqi Conference, Alibaba Cloud’s AI Infra experts detailed the latest challenges of large‑model deployment—such as hardware costs, resource management, and software‑hardware coordination—and introduced PAI’s new capabilities, including stability tools, automated distributed training, reinforcement‑learning frameworks, inference optimizations, and integrated big‑data AI solutions.

AI InfraBig Data IntegrationDistributed Training
0 likes · 14 min read
How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 18, 2024 · Artificial Intelligence

Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It

Training deep learning models on a thousand GPUs faces steep communication overhead, higher failure probability, and scaling inefficiencies, but by profiling each step, overlapping compute and communication, using gradient bucketing and accumulation, and employing elastic training techniques, practitioners can approach near‑linear performance while mitigating common pitfalls.

Distributed TrainingGPU scalingPyTorch
0 likes · 13 min read
Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It
Baidu Geek Talk
Baidu Geek Talk
Aug 28, 2024 · Artificial Intelligence

How PaddlePaddle 3.0 Simplifies Large‑Model Distributed Training with Automatic Parallelism

This article explains the challenges of scaling large AI models, introduces PaddlePaddle 3.0's four‑dimensional hybrid parallelism and its unified automatic parallel framework, details core concepts such as ProcessMesh and Placements, provides step‑by‑step code examples, and outlines performance‑optimizing strategies like operator fusion and pipeline scheduling.

Distributed TrainingHybrid ParallelPaddlePaddle
0 likes · 17 min read
How PaddlePaddle 3.0 Simplifies Large‑Model Distributed Training with Automatic Parallelism
Baidu Geek Talk
Baidu Geek Talk
Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

Distributed TrainingGPU OptimizationLarge Language Models
0 likes · 16 min read
RLHF Performance Optimization: PPO Algorithm Acceleration Techniques
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 24, 2024 · Artificial Intelligence

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

AI InfrastructureDistributed TrainingLlama 3
0 likes · 56 min read
What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure
360 Smart Cloud
360 Smart Cloud
Jul 4, 2024 · Artificial Intelligence

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

Distributed TrainingLarge Language ModelsMixture of Experts
0 likes · 10 min read
Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework
21CTO
21CTO
Jun 7, 2024 · Artificial Intelligence

10 Essential Tools for Building a Modern AI Data Lake Architecture

This article outlines ten critical components of a modern data lake reference architecture for AI/ML, detailing each function, the supporting vendor tools and open‑source libraries, and how they enable scalable storage, MLOps, distributed training, model hubs, vector search, and data visualization.

Data LakeDistributed TrainingMLOps
0 likes · 14 min read
10 Essential Tools for Building a Modern AI Data Lake Architecture
iQIYI Technical Product Team
iQIYI Technical Product Team
May 31, 2024 · Artificial Intelligence

How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform

Opal is iQIYI's end‑to‑end machine‑learning platform that integrates feature production, sample construction, model training, and deployment with big‑data services, addressing duplicated effort, weak data processing, and fragmented pipelines to boost efficiency across recommendation, advertising, and risk‑control scenarios.

AI OperationsBig Data IntegrationDistributed Training
0 likes · 19 min read
How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform
Bilibili Tech
Bilibili Tech
May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

Distributed TrainingGPUNCCL
0 likes · 23 min read
Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2024 · Industry Insights

InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?

With AI models growing to billions of parameters, the choice of high‑performance interconnect—InfiniBand or RoCEv2—directly impacts training speed, scalability, latency, and operational complexity, and this article analyzes their architectures, performance metrics, vendor ecosystems, and suitability for large‑scale AI clusters.

Distributed TrainingHigh‑performance computingInfiniBand
0 likes · 13 min read
InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?
DataFunTalk
DataFunTalk
May 10, 2024 · Artificial Intelligence

GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework

This article presents a comprehensive overview of Tencent PCG's GPU‑based recommendation model training framework, detailing why GPU adoption is essential, the hardware and software challenges faced, the multi‑level data architecture, pipeline design, and a series of network, storage, and compute optimizations, followed by future directions.

Distributed TrainingGPUModel Training
0 likes · 13 min read
GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel
0 likes · 9 min read
GPU Memory Analysis and Distributed Training Strategies
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 24, 2024 · Artificial Intelligence

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training

With AI training demands outgrowing single‑chip GPU clusters, this article explains how to construct and speed up heterogeneous AI clusters—combining GPUs, Kunlun, and Ascend chips—by addressing interconnect, distributed parallel strategies, and specialized acceleration suites to achieve high MFU and efficient large‑model training.

AI clusteringDistributed TrainingGPU Acceleration
0 likes · 15 min read
How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training
Cloud Native Technology Community
Cloud Native Technology Community
Apr 11, 2024 · Cloud Native

Why Kubernetes Is the Ideal Platform for Deploying Large Language Models

Deploying large language models demands massive compute, flexible scaling, and robust resource management, and this article explains how Kubernetes’s auto‑scaling, portability, cloud‑native features, observability tools, and multi‑tenant isolation make it the optimal platform for training, serving, and iterating LLM workloads.

Cloud NativeDistributed TrainingKubernetes
0 likes · 17 min read
Why Kubernetes Is the Ideal Platform for Deploying Large Language Models
DataFunSummit
DataFunSummit
Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureData ParallelismDeepSpeed
0 likes · 23 min read
Challenges and Techniques in Distributed Training of Large Language Models
Tencent Tech
Tencent Tech
Mar 26, 2024 · Artificial Intelligence

How Tencent Angel’s AI Platform Won the 2023 CIE Science & Tech Award

Tencent’s Angel machine‑learning platform, recognized with the 2023 China Institute of Electronics Science & Technology Award, showcases breakthrough distributed training, high‑efficiency caching, adaptive sampling, multimodal fusion, and graph‑model search technologies that dramatically improve large‑scale AI model performance and cost.

Distributed TrainingTencentai
0 likes · 8 min read
How Tencent Angel’s AI Platform Won the 2023 CIE Science & Tech Award
NewBeeNLP
NewBeeNLP
Mar 21, 2024 · Artificial Intelligence

Mastering Large Language Model Training: Key Challenges and Optimization Strategies

This article examines the resource and efficiency challenges of scaling large language model training, explains data, model, pipeline, and tensor parallelism, and provides practical I/O, communication, and stability optimization techniques—including high‑availability storage, RDMA networking, NCCL tuning, and fault‑tolerant recovery—to improve throughput and reliability.

AI EngineeringDistributed TrainingI/O optimization
0 likes · 15 min read
Mastering Large Language Model Training: Key Challenges and Optimization Strategies
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis
DataFunSummit
DataFunSummit
Jan 22, 2024 · Artificial Intelligence

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

This article introduces Colossal‑AI, an open‑source platform that tackles the challenges of training, fine‑tuning, and deploying massive AI models by leveraging efficient memory management, N‑dimensional parallelism, and high‑performance inference to dramatically reduce cost and improve scalability across thousands of GPUs.

AI InfrastructureColossal-AIDistributed Training
0 likes · 21 min read
Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI
AntTech
AntTech
Jan 9, 2024 · Artificial Intelligence

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group’s newly open‑sourced ATorch library extends PyTorch with a layered architecture and automated resource‑aware strategies, boosting large‑model training efficiency up to 60% utilization, enhancing stability, and delivering significant throughput gains across multi‑node, multi‑GPU deployments.

AI accelerationDistributed TrainingPyTorch
0 likes · 6 min read
ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models
DataFunTalk
DataFunTalk
Dec 6, 2023 · Artificial Intelligence

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

This article presents a comprehensive overview of state‑of‑the‑art distributed training methods for large language models, using GPT‑175B as a case study to analyze memory, communication, and compute overheads, and to recommend practical optimization strategies such as tensor, pipeline, and sequence parallelism, ZeRO‑1 optimizer, and selective activation checkpointing.

Distributed TrainingGPU memory optimizationLLM
0 likes · 22 min read
Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)
DataFunTalk
DataFunTalk
Nov 21, 2023 · Artificial Intelligence

Improving Efficiency of Large-Scale Distributed Training for Large Language Models

Recent advances in large language models have dramatically increased model size and training data, leading to soaring computational costs; this article examines the scaling trends, hardware utilization challenges, distributed training techniques, and ethical considerations, highlighting methods to improve efficiency, reduce costs, and mitigate environmental impact.

AI ethicsDistributed TrainingLarge Language Models
0 likes · 29 min read
Improving Efficiency of Large-Scale Distributed Training for Large Language Models
Ximalaya Technology Team
Ximalaya Technology Team
Oct 23, 2023 · Artificial Intelligence

HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud

Ximalaya AI Cloud adopted the open‑source HybridBackend framework to overcome sparse‑data bottlenecks, enabling columnar Parquet reads and hybrid parallel GPU training that boost GPU utilization by over threefold, cut recommendation model training time by more than half, and now powers all TensorFlow and DeepRec production models.

AI cloudDistributed TrainingGPU training
0 likes · 8 min read
HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Oct 19, 2023 · Artificial Intelligence

Unleashing Game AI: Inside NetEase’s Bray Distributed RL Framework

NetEase’s AI team reveals how their self‑developed distributed reinforcement‑learning platform, Bray, enables high‑level AI agents for the MOBA game Dream of Kingdom 2, covering GameCore integration, weighted random initialization, modular APIs, difficulty scaling, and cost‑effective training for realistic player experiences.

AI FrameworkDistributed TrainingMoBA
0 likes · 9 min read
Unleashing Game AI: Inside NetEase’s Bray Distributed RL Framework
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 18, 2023 · Cloud Computing

How AI Is Redefining Cloud Computing: From Scale‑Up to Serverless

The talk explores how the rise of large AI models is transforming cloud computing architecture, workloads, and services—shifting from traditional virtualization to heterogeneous compute, massive scaling, serverless infrastructures, and new networking designs that together enable agile AI‑native applications.

AI-nativeDistributed TrainingHardware acceleration
0 likes · 23 min read
How AI Is Redefining Cloud Computing: From Scale‑Up to Serverless
Alimama Tech
Alimama Tech
Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization
0 likes · 10 min read
Megatron-LLaMA: High-Performance Large Language Model Training Framework
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 11, 2023 · Artificial Intelligence

Debugging Random OOM Issues in PyTorch Distributed Training on A100 Clusters

The iQIYI backend team traced random OOM crashes in PyTorch Distributed Data Parallel on an A100 cluster to a malformed DDP message injected by a security scan, which forced a near‑terabyte allocation; using jemalloc for diagnostics, they mitigated the issue by adjusting scan policies and collaborating with PyTorch to harden the protocol.

Distributed TrainingMemory DebuggingOOM
0 likes · 9 min read
Debugging Random OOM Issues in PyTorch Distributed Training on A100 Clusters
Architects' Tech Alliance
Architects' Tech Alliance
Aug 10, 2023 · Industry Insights

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

This article examines the architecture of AI compute clusters, explaining offline training and inference pipelines, the role of RDMA, and the technical differences between InfiniBand and RoCEv2—including latency, bandwidth, scalability, cost, and vendor considerations—to help engineers choose the optimal high‑performance network for large‑model training.

AI computeDistributed TrainingHigh‑Performance Networking
0 likes · 13 min read
InfiniBand vs RoCEv2: Which Network Powers AI Model Training?
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 24, 2023 · Artificial Intelligence

How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations

This article explains the challenges of training massive AI models and details PaddlePaddle's 4D hybrid parallelism, MoE acceleration, long‑sequence strategies, end‑to‑end performance optimizations, and practical code examples for building and scaling large models efficiently.

Distributed TrainingPaddlePaddleParallelism
0 likes · 12 min read
How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 17, 2023 · Artificial Intelligence

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training

During a livestream titled “Solving the ‘Development Difficulty’ of Large Models with MindSpore Auto Parallel”, Huawei’s MindSpore experts explained how the framework’s distributed training techniques—including data, model, and pipeline parallelism as well as memory‑saving strategies—enable efficient pre‑training of trillion‑parameter models across diverse AI domains.

Data ParallelDistributed TrainingMemory Optimization
0 likes · 6 min read
How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training
Alibaba Cloud Native
Alibaba Cloud Native
Jun 25, 2023 · Artificial Intelligence

Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

This guide explains how to leverage Alibaba Cloud Container Service ACK's AI suite and DeepSpeed to efficiently run distributed large‑language‑model training on Kubernetes, covering prerequisites, configuration, command‑line deployment, monitoring with TensorBoard, and performance‑optimizing techniques.

Alibaba CloudArenaDeepSpeed
0 likes · 11 min read
Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 21, 2023 · Artificial Intelligence

How Baidu’s AIPod Network Powers Massive AI Model Training

This article explains the design and engineering of Baidu's AIPod high‑performance network, detailing the massive bandwidth, scalability, stability, and low‑latency requirements of large‑scale AI model training and the practical tools used to monitor and troubleshoot such workloads.

AIPodDistributed TrainingHigh‑Performance Networking
0 likes · 22 min read
How Baidu’s AIPod Network Powers Massive AI Model Training
Baidu Tech Salon
Baidu Tech Salon
May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseDistributed Training
0 likes · 17 min read
Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models
Amap Tech
Amap Tech
May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data
0 likes · 29 min read
A 20‑Year Review of AI Infrastructure Milestones
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters
0 likes · 19 min read
How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models
DataFunTalk
DataFunTalk
May 2, 2023 · Artificial Intelligence

Automatic Parallelism in PaddlePaddle: Architecture, Implementation, and Application Practice

This article presents a comprehensive overview of PaddlePaddle's automatic parallel design for heterogeneous scenarios, covering background motivation, architectural principles, key implementation details, practical usage interfaces, and future outlook, while illustrating concepts with detailed diagrams and examples.

AI frameworksDistributed TrainingPaddlePaddle
0 likes · 19 min read
Automatic Parallelism in PaddlePaddle: Architecture, Implementation, and Application Practice
21CTO
21CTO
Apr 21, 2023 · Artificial Intelligence

Essential AI Reading List: LLMs, AutoGPT, Distributed Training & More

This curated collection highlights the latest open‑source LLM breakthroughs, comprehensive surveys, AutoGPT developments, distributed training pitfalls, and practical tools for AI engineers, providing concise descriptions and direct links to each resource for deeper exploration.

AI researchAutoGPTDistributed Training
0 likes · 10 min read
Essential AI Reading List: LLMs, AutoGPT, Distributed Training & More
Baidu Geek Talk
Baidu Geek Talk
Apr 19, 2023 · Artificial Intelligence

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

When training large‑batch deep learning models, developers often use recompute to trade computation for memory, but in dynamic graph frameworks this can trigger synchronization errors in distributed data parallel training; the article explains the underlying DDP mechanics, illustrates the error, and offers a practical no_sync workaround with code examples.

CheckpointDistributed TrainingPyTorch
0 likes · 14 min read
Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 11, 2023 · Artificial Intelligence

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.

DeepRecDistributed TrainingEmbeddingVariable
0 likes · 13 min read
How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations
DataFunSummit
DataFunSummit
Apr 2, 2023 · Artificial Intelligence

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

This article introduces the challenges of scaling deep‑learning model training, explains the design and components of the open‑source Easy Parallel Library (EPL) that unifies data, pipeline, and operator‑split parallelism, and demonstrates its best‑practice results on large‑scale classification, BERT‑large, and massive multimodal models.

Distributed TrainingEPLLarge-Scale Training
0 likes · 15 min read
Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)
Tencent Advertising Technology
Tencent Advertising Technology
Mar 30, 2023 · Artificial Intelligence

Tencent's Taiji Machine Learning Platform: End-to-End MLOps for Advertising

Tencent’s Taiji machine learning platform, a cloud‑native, distributed parameter‑server system, provides end‑to‑end MLOps for advertising by integrating data ingestion, feature engineering, model training, evaluation, deployment, and monitoring, supporting massive models up to billions of parameters while improving efficiency, scalability, and resource management.

Distributed TrainingMLOpsMachine Learning Platform
0 likes · 18 min read
Tencent's Taiji Machine Learning Platform: End-to-End MLOps for Advertising
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Mar 27, 2023 · Artificial Intelligence

How Reinforcement Learning Powers AI Bots in ‘Barbarian Battle 2’

This article details NetEase Zhiji and Dianhun Network's use of reinforcement learning, a distributed training framework, and middleware to create, train, deploy, and iterate AI robots for the game "Barbarian Battle 2", highlighting technical challenges, solutions, and the impact on player experience.

AI botsDistributed TrainingGame Development
0 likes · 13 min read
How Reinforcement Learning Powers AI Bots in ‘Barbarian Battle 2’
Baidu Geek Talk
Baidu Geek Talk
Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelDistributed Training
0 likes · 27 min read
Infrastructure Challenges and Solutions for Large‑Scale AI Model Training
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 20, 2023 · Artificial Intelligence

How HybridBackend Supercharged Ximalaya’s Recommendation Engine with GPU Acceleration

This article details how Ximalaya’s AI Cloud adopted the open‑source HybridBackend framework to overcome sparse data access and distributed training bottlenecks, achieving multi‑GPU utilization gains, faster model training, and significant cost reductions across its recommendation services.

Distributed TrainingGPU AccelerationHybridBackend
0 likes · 9 min read
How HybridBackend Supercharged Ximalaya’s Recommendation Engine with GPU Acceleration
Hulu Beijing
Hulu Beijing
Mar 16, 2023 · Artificial Intelligence

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

This article explores Hulu’s five‑year‑old machine‑learning training platform, detailing its three‑layer architecture, the shift from single‑node to distributed training, and the technical solutions—including parameter servers, Ring AllReduce, Kubernetes, Volcano, and Horovod—that enable scalable AI workloads across GPU, CPU, and storage resources.

AI InfrastructureDistributed TrainingHulu
0 likes · 13 min read
Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Jan 14, 2023 · Artificial Intelligence

Deep Graph Library (DGL): Technical Features, Community Progress, and Challenges in Graph Deep Learning

This article provides a comprehensive overview of the Deep Graph Library (DGL), covering its technical characteristics, open‑source community developments, various graph learning tasks, message‑passing mechanisms, system design challenges, training strategies on single and multiple GPUs, inference optimization, and a Q&A comparing DGL with other frameworks.

Deep Graph LibraryDistributed TrainingGNN Training
0 likes · 15 min read
Deep Graph Library (DGL): Technical Features, Community Progress, and Challenges in Graph Deep Learning