Tagged articles

distributed training

170 articles · Page 1 of 2

Jun 23, 2026 · Artificial Intelligence

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.

GPU memoryLarge Language Modelsdata parallelism

0 likes · 14 min read

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

Machine Learning Algorithms & Natural Language Processing

Jun 18, 2026 · Artificial Intelligence

UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

UniRL is an open‑source, distributed reinforcement‑learning post‑training framework that consolidates fragmented pipelines for image, video, and language‑vision models, offering a unified rollout‑reward‑advantage‑train‑sync contract, extensive model support, built‑in algorithms, and multi‑modal reward components to lower engineering barriers in AIGC research.

Diffusion ModelsLLMMultimodal RL

0 likes · 10 min read

UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

MaGe Linux Operations

Jun 18, 2026 · Artificial Intelligence

How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

This guide walks engineers through the memory, compute and bandwidth limits of training 7B‑70B models, compares data parallel (DP/DDP), tensor parallel (TP), pipeline parallel (PP), ZeRO stages and FSDP, shows how to calculate GPU memory, estimate communication overhead, configure each strategy, and avoid common pitfalls, enabling you to decide which parallelism to use on multi‑GPU or multi‑node clusters.

DeepSpeedFSDPZeRO

0 likes · 24 min read

How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

Baidu Intelligent Cloud Tech Hub

Jun 2, 2026 · Artificial Intelligence

Halving Training Time: LoongForge Full‑Stack Optimizations Boost GR00T N1.6 Throughput 2.3×

LoongForge applies system‑level optimizations—async data prefetch, fine‑grained communication‑compute overlap via a Megatron distributed optimizer, and per‑microbatch CUDA Graph scheduling—to the GR00T N1.6 Vision‑Language‑Action model, delivering up to 2.3× higher training throughput and a 56.6% reduction in overall training time on an 8×A800 cluster.

CUDA GraphGR00T N1.6LoongForge

0 likes · 14 min read

Halving Training Time: LoongForge Full‑Stack Optimizations Boost GR00T N1.6 Throughput 2.3×

Baidu Geek Talk

May 25, 2026 · Artificial Intelligence

Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained

The article analyzes how data‑parallel (DP) load imbalance hampers large‑scale multimodal model training, details LoongForge's two‑stage adaptive data‑reallocation method that builds a precise compute‑cost model and dynamically redistributes samples, and presents experimental results showing up to 10% throughput gains on massive DP clusters.

DP load balancingData ParallelLoongForge

0 likes · 16 min read

Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained

Baobao Algorithm Notes

May 22, 2026 · Artificial Intelligence

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

The article examines the bottleneck of synchronous rollout in large‑model post‑training, proposes an asynchronous design using gradient accumulation and a global micro‑batch count to preserve loss equivalence, and introduces LogitsExpress for efficient top‑K knowledge‑distillation communication, all implemented in the lightweight LiteScale framework.

Knowledge DistillationLarge Language Modelsasynchronous rollout

0 likes · 16 min read

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

PaperAgent

May 8, 2026 · Artificial Intelligence

Jeff Dean’s Decoupled DiLoCo Shatters the Million‑Chip LLM Pre‑training Bottleneck

The article explains how Google’s Decoupled DiLoCo architecture breaks the scalability wall of million‑chip LLM pre‑training by partitioning the cluster into independent learners, using an asynchronous syncer, and achieving up to 88% effective compute while preserving model quality.

AIGoogleLLM

0 likes · 7 min read

Jeff Dean’s Decoupled DiLoCo Shatters the Million‑Chip LLM Pre‑training Bottleneck

Machine Heart

Apr 30, 2026 · Artificial Intelligence

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Embodied AILWDVLA

0 likes · 13 min read

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

CodeTrend

Apr 26, 2026 · Artificial Intelligence

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

The article analyzes why most open‑source large models cannot run on Huawei Ascend NPU, detailing the CUDA‑centric ecosystem, Ascend's CANN stack, three core technical hurdles, and the deep collaboration and tooling that enabled DeepSeek V4’s successful adaptation.

AI model portingCANNDeepSeek-V4

0 likes · 10 min read

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

Machine Heart

Apr 25, 2026 · Artificial Intelligence

Jeff Dean’s New Paper Shows Elastic Large‑Scale Distributed Pre‑Training Is Now Feasible

Decoupled DiLoCo, a new distributed training framework introduced by Jeff Dean and colleagues, enables resilient large‑scale AI pre‑training across heterogeneous hardware by decoupling learners, using lightweight syncers, adaptive quorum, and balanced tensor fragmentation, dramatically improving goodput and reducing bandwidth while preserving model quality.

Bandwidth ReductionDecoupled DiLoCoGoodput

0 likes · 10 min read

Jeff Dean’s New Paper Shows Elastic Large‑Scale Distributed Pre‑Training Is Now Feasible

Xiaohongshu Tech REDtech

Apr 15, 2026 · Artificial Intelligence

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Asynchronous PipelineMulti-modalRay Serve

0 likes · 10 min read

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Huawei Cloud Developer Alliance

Apr 13, 2026 · Artificial Intelligence

How AReaL v1.0 Enables Scalable Agentic RL on Ascend NPU with AWEX Weight Sync

The new AReaL v1.0 release brings full Ascend NPU support, detailed installation guides, and a best‑practice example for training a 30B MoE model across four nodes, while the integrated AWEX weight‑sync mechanism dramatically reduces synchronization time, improving efficiency and stability for large‑scale Agentic RL workloads.

AWEXAgentic RLAscend NPU

0 likes · 12 min read

How AReaL v1.0 Enables Scalable Agentic RL on Ascend NPU with AWEX Weight Sync

Alibaba Cloud Big Data AI Platform

Apr 8, 2026 · Artificial Intelligence

Running Distributed Reinforcement Learning with Isaac Lab’s Newton Engine and Rerun Visualizer on PAI

This guide explains how to use the Newton physics engine and the lightweight Rerun visualizer with Isaac Lab on the PAI platform, covering environment setup, visualizer selection, single‑ and multi‑GPU reinforcement‑learning training, and performance analysis via TensorBoard.

Isaac LabNewton enginePAI

0 likes · 9 min read

Running Distributed Reinforcement Learning with Isaac Lab’s Newton Engine and Rerun Visualizer on PAI

Alibaba Cloud Big Data AI Platform

Mar 25, 2026 · Artificial Intelligence

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

This article explains how to use NVIDIA Isaac Lab and the TiledCamera component to run large‑scale, multimodal reinforcement learning on GPU clusters, covering environment setup, noVNC visualization, command‑line execution, distributed training with torchrun, and performance analysis across multiple GPU configurations.

GPU scalingNVIDIA Isaac LabTiledCamera

0 likes · 12 min read

Scaling Multimodal Reinforcement Learning with NVIDIA Isaac Lab and TiledCamera

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026 · Artificial Intelligence

630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights

A 630‑line Python Autoresearch project sparked a community‑run distributed system that created over 80 autonomous AI agents, executed more than 2,300 experiments in four days, self‑organized roles and peer‑review, and uncovered ten concrete pre‑training findings.

AI agentsautoresearchdistributed training

0 likes · 9 min read

630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights

AI2ML AI to Machine Learning

Feb 24, 2026 · Artificial Intelligence

Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training

The article explains how randomly masking a large portion of gradient updates during large‑model training—sometimes up to 99%—can accelerate convergence and even surpass traditional optimizers like Adam, supported by recent Google research and empirical observations.

Large Language ModelsMagma algorithmadaptive optimizers

0 likes · 3 min read

Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training

HyperAI Super Neural

Feb 11, 2026 · Artificial Intelligence

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

Researchers at Oak Ridge National Laboratory introduced D‑CHAG, a distributed cross‑channel hierarchical aggregation method that cuts memory consumption by up to 75% and more than doubles throughput when training massive multi‑channel foundation models on up to 1,024 AMD GPUs, as demonstrated on hyperspectral imaging and weather‑forecasting workloads.

D-CHAGFoundation ModelsMemory optimization

0 likes · 14 min read

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

JD Retail Technology

Jan 30, 2026 · Artificial Intelligence

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.

GPU/NPUUniAttentiondistributed training

0 likes · 26 min read

How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale

PaperAgent

Jan 8, 2026 · Artificial Intelligence

How SOP Enables Scalable Online Post-Training for Real‑World Robots

The SOP (Scalable Online Post‑training) framework redesigns VLA post‑training from offline, single‑machine, sequential processing to a distributed, parallel online system, allowing robot fleets to continuously learn, share experiences, and scale intelligence while maintaining stability and generalization in complex real‑world environments.

SOPVLAdistributed training

0 likes · 11 min read

How SOP Enables Scalable Online Post-Training for Real‑World Robots

DataFunSummit

Dec 20, 2025 · Artificial Intelligence

How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications

This article details AutoHome's end‑to‑end development of the Cangjie large model, covering the training infrastructure with distributed data, pipeline and tensor parallelism, core business use cases such as video script generation and multi‑tool Agent capabilities, inference optimizations through quantization and fast serving frameworks, and future directions for personalized automotive AI services.

Agent AIQuantizationdistributed training

0 likes · 19 min read

How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications

AntTech

Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Open-sourceScalable RLagentic AI

0 likes · 14 min read

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

AntTech

Nov 27, 2025 · Artificial Intelligence

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

The article explains the design, implementation, and performance of the AMem NCCL‑Plugin, a lightweight extension to NVIDIA's NCCL that enables transparent offloading and rapid recovery of GPU memory during reinforcement‑learning training of trillion‑parameter models, detailing its architecture, APIs, benchmarks, installation steps, and integration guidelines.

ASystemGPUNCCL

0 likes · 18 min read

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

AntTech

Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

High-performance computingPythondistributed training

0 likes · 17 min read

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

AI2ML AI to Machine Learning

Oct 19, 2025 · Artificial Intelligence

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

This article provides a thorough analysis of nanochat’s source code, detailing transformer component differences, precise parameter‑size formulas, FlashNorm and ReLU² innovations, scaling‑law insights, memory‑usage estimations, and the distributed optimizer and training pipelines used to build the model.

LLMTransformerdistributed training

0 likes · 20 min read

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

Architects' Tech Alliance

Sep 28, 2025 · Artificial Intelligence

How AI Workloads Are Redefining Network Architecture: Key Requirements and Topologies

The article examines how the rapid growth of AI models and workloads is reshaping network design, highlighting the need for ultra‑high bandwidth, sub‑millisecond latency, reliability, scalable topologies like Fat‑Tree and Dragonfly, and robust security and QoS mechanisms across data‑center, cloud, and edge environments.

AI networkingData CenterHigh Bandwidth

0 likes · 11 min read

How AI Workloads Are Redefining Network Architecture: Key Requirements and Topologies

IT Architects Alliance

Sep 17, 2025 · Artificial Intelligence

How Distributed Scheduling Redefines AI Large-Model Training Architecture

The article examines how the explosive compute, storage, network, and fault‑tolerance demands of AI large‑model training force a fundamental redesign of system architecture, covering layered storage, optimized All‑Reduce communication, elastic resource orchestration, observability, and cost‑saving strategies.

AI ArchitectureCompute SchedulingStorage Hierarchy

0 likes · 9 min read

How Distributed Scheduling Redefines AI Large-Model Training Architecture

Fun with Large Models

Aug 30, 2025 · Artificial Intelligence

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

This article explains how to fine‑tune large models across multiple machines and GPUs by covering data, model, tensor, and pipeline parallelism, hybrid 3D parallel strategies, engineering details such as NCCL, PyTorch Distributed, DeepSpeed, fault‑tolerance, checkpointing, and the ZeRO optimizer stages that dramatically reduce memory usage.

Data ParallelDeepSpeedMegatron-LLM

0 likes · 8 min read

How to Fine‑Tune Large Models on Multiple Nodes and GPUs – A Must‑Know Interview Answer

Kuaishou Tech

Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Resource Schedulingdistributed trainingpipeline optimization

0 likes · 13 min read

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

Alibaba Cloud Developer

Jul 24, 2025 · Artificial Intelligence

Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving

This article shares practical experience training perception‑detection mini‑models on two different compute cards, covering environment setup, technical architecture, common dependency issues, performance‑boosting tricks such as CPU process pools, torch dataloader tuning, NCCL P2P handling, and CPFS storage optimization.

Model TrainingPerceptionPerformance Optimization

0 likes · 17 min read

Optimizing Small Perception Models on Different Compute Cards for Autonomous Driving

Tech Freedom Circle

Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDualPipeFP8

0 likes · 44 min read

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

The paper introduces ChunkFlow, an efficient training framework for variable‑length and ultra‑long sequence datasets that powers Qwen models, achieving up to 4.53× speedup over Megatron‑LM and more than 2× overall performance gains by reorganizing data into fixed‑size chunks and employing a state‑aware scheduler.

AI performanceChunkFlowGPU efficiency

0 likes · 7 min read

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

Network Intelligence Research Center (NIRC)

Jul 13, 2025 · Artificial Intelligence

Getting Started with Hugging Face Transformers Trainer

This guide walks through the Hugging Face Transformers Trainer library, explaining its core features such as configurable training loops, mixed‑precision and gradient‑accumulation support, seamless distributed training via Accelerate and DeepSpeed, and provides a step‑by‑step example of converting a simple PyTorch CNN model to use Trainer.

AccelerateDeepSpeedHugging Face

0 likes · 7 min read

Getting Started with Hugging Face Transformers Trainer

Alibaba Cloud Big Data AI Platform

Jun 25, 2025 · Artificial Intelligence

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

This article introduces Alibaba Cloud's PAI platform and demonstrates how open‑source reinforcement‑learning frameworks such as Cosmos‑RL, Ray, and VeRL accelerate post‑training for large language models, offering higher throughput, fault‑tolerance, and seamless integration for AI developers.

AI platformOpen Source Frameworksdistributed training

0 likes · 9 min read

Boost Post‑Training Efficiency with Cosmos‑RL, Ray, and VeRL on Alibaba PAI

Architects' Tech Alliance

May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

InfiniBandNetwork ArchitectureRDMA

0 likes · 12 min read

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

AI Cyberspace

May 20, 2025 · Artificial Intelligence

Why SuperNode and SuperPOD Are Critical for Scaling AI Models

This article explains the scaling laws behind large language models, the explosive growth of model sizes and compute demands, and why modern AI infrastructure must adopt SuperNode and SuperPOD architectures that combine high‑bandwidth Scale‑Up networks with flexible Scale‑Out networking to overcome bandwidth, latency, and power challenges.

AI scalingSuperPoDdistributed training

0 likes · 42 min read

Why SuperNode and SuperPOD Are Critical for Scaling AI Models

Baidu Geek Talk

May 19, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

AI inferenceAlltoall optimizationHPN

0 likes · 14 min read

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

Baidu Geek Talk

Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI InfrastructureLarge Language ModelsModel Inference Optimization

0 likes · 23 min read

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

Python Programming Learning Circle

Apr 3, 2025 · Artificial Intelligence

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

This article explains how to dramatically speed up PyTorch model training using code optimizations, mixed‑precision, torch.compile, distributed data parallelism, and DeepSpeed, presenting benchmark results that show up to 11.5× acceleration on multiple GPUs while maintaining high accuracy.

Deep LearningDeepSpeedGPU

0 likes · 6 min read

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

Architects' Tech Alliance

Apr 3, 2025 · Artificial Intelligence

Why NVLink and NVSwitch Are Essential for Training Massive AI Models

Training today's massive AI foundation models demands extensive GPU resources and sophisticated multi‑GPU communication, making technologies like NVLink and NVSwitch crucial for efficient distributed training, while data‑parallel and model‑parallel strategies together optimize performance across large‑scale hardware clusters.

AIGPUNVLink

0 likes · 8 min read

Why NVLink and NVSwitch Are Essential for Training Massive AI Models

AI Algorithm Path

Mar 16, 2025 · Artificial Intelligence

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

This article walks through concrete techniques to accelerate PyTorch training, covering mixed‑precision with torch.cuda.amp, profiling with torch.profiler, DataLoader tuning, torch.compile, distributed strategies like DataParallel and DDP, gradient accumulation, and advanced libraries such as Lightning, Apex, and DeepSpeed, plus model‑level optimizations and monitoring tips.

DataLoaderProfilingPyTorch

0 likes · 12 min read

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

Architects' Tech Alliance

Mar 5, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

DeepSeek’s Open‑Source Week unveiled five high‑performance projects—FlashMLA, DeepEP, DeepGEMM, DualPipe/EPLB, and 3FS—each delivering novel GPU optimizations, communication kernels, matrix‑multiplication libraries, parallelism strategies, and a distributed file system that together dramatically accelerate large‑scale AI training and inference workloads.

AI accelerationDeepSeekGPU Optimization

0 likes · 9 min read

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

JD Retail Technology

Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

AIGPUModel Optimization

0 likes · 19 min read

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Tech Talk

Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

AIGPUNPU

0 likes · 20 min read

AI Engine Technology Based on Domestic Chips for JD Retail

Data Thinking Notes

Mar 2, 2025 · Artificial Intelligence

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

During DeepSeek’s Open‑Source Week (Feb 24‑28), five production‑tested projects were released, spanning GPU‑optimized MLA kernels, MoE communication libraries, high‑performance FP8 GEMM, dual‑pipeline parallelism, and a AI‑focused distributed file system, each delivering significant performance and efficiency gains for large‑scale AI workloads.

AIGPU OptimizationOpen-source

0 likes · 13 min read

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

DataFunTalk

Mar 2, 2025 · Artificial Intelligence

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

This tutorial explains how to build a distributed reinforcement‑learning pipeline using the GRPO algorithm, covering data preparation, evaluation and reward functions, multi‑GPU DataParallel implementation, and full fine‑tuning of the Qwen2.5‑1.5B‑Instruct model with PyTorch, FlashAttention2 and Weights & Biases.

AIGRPOPyTorch

0 likes · 10 min read

Implementing GRPO from Scratch with Distributed Reinforcement Learning on Qwen2.5-1.5B-Instruct

AI Product Manager Community

Feb 28, 2025 · Artificial Intelligence

What’s Inside DeepSeek’s Open‑Source Week? DualPipe, EPLB, 3FS and More Explained

DeepSeek’s recent Open‑Source Week unveiled a suite of AI‑focused tools—including the DualPipe pipeline parallelism algorithm, the EPLB expert load balancer, detailed training‑inference framework data, the high‑performance 3FS parallel file system, and the Smallpond data‑processing framework—each with GitHub links and performance highlights.

AIFile Systemdistributed training

0 likes · 7 min read

What’s Inside DeepSeek’s Open‑Source Week? DualPipe, EPLB, 3FS and More Explained

AIWalker

Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU OptimizationLLM deployment

0 likes · 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

DataFunSummit

Feb 14, 2025 · Artificial Intelligence

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

This presentation details how Alibaba Cloud's AI platform integrates big‑data pipelines, feature‑store services, and large language model capabilities to construct high‑performance search‑recommendation architectures, covering system design, training and inference optimizations, LLM‑driven use cases, and open‑source RAG tooling.

AI platformBig DataFeature Store

0 likes · 17 min read

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

AI Algorithm Path

Feb 10, 2025 · Artificial Intelligence

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

This article explains how the DualPipe scheduling mechanism in DeepSeek‑R1 improves GPU cluster compute‑communication efficiency by using fine‑grained pipeline stages and bidirectional data flow, comparing it with Zero Bubble pipeline parallelism and discussing the challenges of large‑scale distributed training.

DeepSeekDualPipeLarge Language Models

0 likes · 10 min read

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

DataFunSummit

Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

GPU OptimizationLLMNeMo

0 likes · 24 min read

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

Xiaohongshu Tech REDtech

Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

PPOPRMPerformance

0 likes · 21 min read

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Kuaishou Large Model

Nov 22, 2024 · Artificial Intelligence

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

This article details a comprehensive set of techniques—including data‑ and tensor‑parallel overlap, context‑parallelism, activation rematerialization, and a performance‑driven cost model—that dramatically improve large‑language‑model training efficiency on ultra‑large GPU clusters while preserving model quality.

Large Language ModelsPerformance Modelingactivation recomputation

0 likes · 28 min read

Boost LLM Training on Massive Clusters with DP/TP Overlap and Context Parallelism

Kuaishou Tech

Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

GPU clustersPerformance Modelingactivation rematerialization

0 likes · 27 min read

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

Tencent Tech

Nov 19, 2024 · Artificial Intelligence

How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award

Tencent’s Angel machine learning platform, recognized for breakthroughs in trillion‑scale model training, inference, and deployment, won the 2024 World Internet Conference Leading Technology Award, highlighting its self‑developed hardware‑software stack, high‑performance networking, and extensive real‑world AI applications.

AI platformAngelTechnology Award

0 likes · 6 min read

How Tencent’s Angel Platform Secured the 2024 World Internet Conference Leading Technology Award

AsiaInfo Technology: New Tech Exploration

Oct 23, 2024 · Artificial Intelligence

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

This article examines the challenges of scaling large AI models across multiple GPUs, explores data, pipeline, and tensor parallelism, analyzes collective communication patterns and data‑channel technologies such as PCIe, NVLink and RDMA, and offers concrete optimization recommendations to boost training efficiency.

GPU communicationcollective communicationdistributed training

0 likes · 21 min read

How to Optimize Distributed Training for Massive AI Models: Strategies & Performance Insights

Baidu Tech Salon

Oct 17, 2024 · Artificial Intelligence

How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide

This article explains how the open‑source Yuan 2.0 large language model is fully integrated with Baidu’s PaddleNLP, covering its capabilities, fine‑tuning optimizations, step‑by‑step deployment instructions, interaction examples, and training/finetuning results with loss‑curve visualizations.

AIPaddleNLPYuan 2.0

0 likes · 10 min read

How to Deploy Yuan 2.0 LLM with PaddleNLP: A Step‑by‑Step Guide

360 Zhihui Cloud Developer

Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI InfrastructureGPU ClusterKubernetes

0 likes · 22 min read

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

DataFunSummit

Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphGPU OptimizationPyTorch

0 likes · 11 min read

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

DeepSpeedGPU communicationMegatron

0 likes · 27 min read

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

Alibaba Cloud Big Data AI Platform

Sep 26, 2024 · Artificial Intelligence

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

At the 2024 Yunqi Conference, Alibaba Cloud’s AI Infra experts detailed the latest challenges of large‑model deployment—such as hardware costs, resource management, and software‑hardware coordination—and introduced PAI’s new capabilities, including stability tools, automated distributed training, reinforcement‑learning frameworks, inference optimizations, and integrated big‑data AI solutions.

AI InfraInference Optimizationbig data integration

0 likes · 14 min read

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

Baobao Algorithm Notes

Sep 18, 2024 · Artificial Intelligence

Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It

Training deep learning models on a thousand GPUs faces steep communication overhead, higher failure probability, and scaling inefficiencies, but by profiling each step, overlapping compute and communication, using gradient bucketing and accumulation, and employing elastic training techniques, practitioners can approach near‑linear performance while mitigating common pitfalls.

GPU scalingPerformance OptimizationPyTorch

0 likes · 13 min read

Why Training on 1,000 GPUs Is Harder Than You Think—and How to Tame It

Baidu Geek Talk

Aug 28, 2024 · Artificial Intelligence

How PaddlePaddle 3.0 Simplifies Large‑Model Distributed Training with Automatic Parallelism

This article explains the challenges of scaling large AI models, introduces PaddlePaddle 3.0's four‑dimensional hybrid parallelism and its unified automatic parallel framework, details core concepts such as ProcessMesh and Placements, provides step‑by‑step code examples, and outlines performance‑optimizing strategies like operator fusion and pipeline scheduling.

Hybrid ParallelPaddlePaddlePerformance Optimization

0 likes · 17 min read

How PaddlePaddle 3.0 Simplifies Large‑Model Distributed Training with Automatic Parallelism

Baidu Geek Talk

Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

GPU OptimizationLarge Language ModelsPPO optimization

0 likes · 16 min read

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

Baobao Algorithm Notes

Jul 24, 2024 · Artificial Intelligence

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

AI InfrastructureLlama 3distributed training

0 likes · 56 min read

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

360 Smart Cloud

Jul 4, 2024 · Artificial Intelligence

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

AILarge Language ModelsMixture of Experts

0 likes · 10 min read

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

21CTO

Jun 7, 2024 · Artificial Intelligence

10 Essential Tools for Building a Modern AI Data Lake Architecture

This article outlines ten critical components of a modern data lake reference architecture for AI/ML, detailing each function, the supporting vendor tools and open‑source libraries, and how they enable scalable storage, MLOps, distributed training, model hubs, vector search, and data visualization.

AIData LakeMLOps

0 likes · 14 min read

10 Essential Tools for Building a Modern AI Data Lake Architecture

iQIYI Technical Product Team

May 31, 2024 · Artificial Intelligence

How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform

Opal is iQIYI's end‑to‑end machine‑learning platform that integrates feature production, sample construction, model training, and deployment with big‑data services, addressing duplicated effort, weak data processing, and fragmented pipelines to boost efficiency across recommendation, advertising, and risk‑control scenarios.

AI OperationsData ValidationMachine Learning Platform

0 likes · 19 min read

How Opal Turns iQIYI’s ML Workflow into a Unified AI Platform

Bilibili Tech

May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

GPUNCCLPerformance Optimization

0 likes · 23 min read

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

Architects' Tech Alliance

May 19, 2024 · Industry Insights

InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?

With AI models growing to billions of parameters, the choice of high‑performance interconnect—InfiniBand or RoCEv2—directly impacts training speed, scalability, latency, and operational complexity, and this article analyzes their architectures, performance metrics, vendor ecosystems, and suitability for large‑scale AI clusters.

AIHigh-performance computingInfiniBand

0 likes · 13 min read

InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?

DataFunTalk

May 10, 2024 · Artificial Intelligence

GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework

This article presents a comprehensive overview of Tencent PCG's GPU‑based recommendation model training framework, detailing why GPU adoption is essential, the hardware and software challenges faced, the multi‑level data architecture, pipeline design, and a series of network, storage, and compute optimizations, followed by future directions.

GPUModel TrainingPerformance Optimization

0 likes · 13 min read

GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework

Rare Earth Juejin Tech Community

May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

GPU memoryPipeline ParallelZeRO

0 likes · 9 min read

GPU Memory Analysis and Distributed Training Strategies

Baidu Intelligent Cloud Tech Hub

Apr 24, 2024 · Artificial Intelligence

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training

With AI training demands outgrowing single‑chip GPU clusters, this article explains how to construct and speed up heterogeneous AI clusters—combining GPUs, Kunlun, and Ascend chips—by addressing interconnect, distributed parallel strategies, and specialized acceleration suites to achieve high MFU and efficient large‑model training.

AI clusteringGPU AccelerationMFU

0 likes · 15 min read

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training

Cloud Native Technology Community

Apr 11, 2024 · Cloud Native

Why Kubernetes Is the Ideal Platform for Deploying Large Language Models

Deploying large language models demands massive compute, flexible scaling, and robust resource management, and this article explains how Kubernetes’s auto‑scaling, portability, cloud‑native features, observability tools, and multi‑tenant isolation make it the optimal platform for training, serving, and iterating LLM workloads.

Cloud NativeKubernetesLarge Language Models

0 likes · 17 min read

Why Kubernetes Is the Ideal Platform for Deploying Large Language Models

DataFunSummit

Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureDeepSpeedLarge Language Models

0 likes · 23 min read

Challenges and Techniques in Distributed Training of Large Language Models

Tencent Tech

Mar 26, 2024 · Artificial Intelligence

How Tencent Angel’s AI Platform Won the 2023 CIE Science & Tech Award

Tencent’s Angel machine‑learning platform, recognized with the 2023 China Institute of Electronics Science & Technology Award, showcases breakthrough distributed training, high‑efficiency caching, adaptive sampling, multimodal fusion, and graph‑model search technologies that dramatically improve large‑scale AI model performance and cost.

AITencentdistributed training

0 likes · 8 min read

How Tencent Angel’s AI Platform Won the 2023 CIE Science & Tech Award

NewBeeNLP

Mar 21, 2024 · Artificial Intelligence

Mastering Large Language Model Training: Key Challenges and Optimization Strategies

This article examines the resource and efficiency challenges of scaling large language model training, explains data, model, pipeline, and tensor parallelism, and provides practical I/O, communication, and stability optimization techniques—including high‑availability storage, RDMA networking, NCCL tuning, and fault‑tolerant recovery—to improve throughput and reliability.

AI EngineeringI/O optimizationLarge Language Models

0 likes · 15 min read

Mastering Large Language Model Training: Key Challenges and Optimization Strategies

Baidu Geek Talk

Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureFault diagnosisObservability

0 likes · 11 min read

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

Baidu Intelligent Cloud Tech Hub

Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureFault diagnosisObservability

0 likes · 11 min read

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

DataFunSummit

Jan 22, 2024 · Artificial Intelligence

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

This article introduces Colossal‑AI, an open‑source platform that tackles the challenges of training, fine‑tuning, and deploying massive AI models by leveraging efficient memory management, N‑dimensional parallelism, and high‑performance inference to dramatically reduce cost and improve scalability across thousands of GPUs.

AI InfrastructureColossal-AIMemory Management

0 likes · 21 min read

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

AntTech

Jan 9, 2024 · Artificial Intelligence

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group’s newly open‑sourced ATorch library extends PyTorch with a layered architecture and automated resource‑aware strategies, boosting large‑model training efficiency up to 60% utilization, enhancing stability, and delivering significant throughput gains across multi‑node, multi‑GPU deployments.

AI accelerationPyTorchdistributed training

0 likes · 6 min read

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

DataFunTalk

Dec 6, 2023 · Artificial Intelligence

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

This article presents a comprehensive overview of state‑of‑the‑art distributed training methods for large language models, using GPT‑175B as a case study to analyze memory, communication, and compute overheads, and to recommend practical optimization strategies such as tensor, pipeline, and sequence parallelism, ZeRO‑1 optimizer, and selective activation checkpointing.

GPU memory optimizationLLMMegatron

0 likes · 22 min read

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

DataFunTalk

Nov 21, 2023 · Artificial Intelligence

Improving Efficiency of Large-Scale Distributed Training for Large Language Models

Recent advances in large language models have dramatically increased model size and training data, leading to soaring computational costs; this article examines the scaling trends, hardware utilization challenges, distributed training techniques, and ethical considerations, highlighting methods to improve efficiency, reduce costs, and mitigate environmental impact.

AI ethicsEfficiencyLarge Language Models

0 likes · 29 min read

Improving Efficiency of Large-Scale Distributed Training for Large Language Models

Ximalaya Technology Team

Oct 23, 2023 · Artificial Intelligence

HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud

Ximalaya AI Cloud adopted the open‑source HybridBackend framework to overcome sparse‑data bottlenecks, enabling columnar Parquet reads and hybrid parallel GPU training that boost GPU utilization by over threefold, cut recommendation model training time by more than half, and now powers all TensorFlow and DeepRec production models.

AI cloudGPU trainingHybridBackend

0 likes · 8 min read

HybridBackend Accelerates GPU-Based Recommendation Model Training for Ximalaya AI Cloud

NetEase Smart Enterprise Tech+

Oct 19, 2023 · Artificial Intelligence

Unleashing Game AI: Inside NetEase’s Bray Distributed RL Framework

NetEase’s AI team reveals how their self‑developed distributed reinforcement‑learning platform, Bray, enables high‑level AI agents for the MOBA game Dream of Kingdom 2, covering GameCore integration, weighted random initialization, modular APIs, difficulty scaling, and cost‑effective training for realistic player experiences.

AI FrameworkMoBAdistributed training

0 likes · 9 min read

Unleashing Game AI: Inside NetEase’s Bray Distributed RL Framework

Baidu Intelligent Cloud Tech Hub

Oct 18, 2023 · Cloud Computing

How AI Is Redefining Cloud Computing: From Scale‑Up to Serverless

The talk explores how the rise of large AI models is transforming cloud computing architecture, workloads, and services—shifting from traditional virtualization to heterogeneous compute, massive scaling, serverless infrastructures, and new networking designs that together enable agile AI‑native applications.

AI-nativeCloud ComputingServerless

0 likes · 23 min read

How AI Is Redefining Cloud Computing: From Scale‑Up to Serverless

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedGPU OptimizationLLaMA

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework

iQIYI Technical Product Team

Aug 11, 2023 · Artificial Intelligence

Debugging Random OOM Issues in PyTorch Distributed Training on A100 Clusters

The iQIYI backend team traced random OOM crashes in PyTorch Distributed Data Parallel on an A100 cluster to a malformed DDP message injected by a security scan, which forced a near‑terabyte allocation; using jemalloc for diagnostics, they mitigated the issue by adjusting scan policies and collaborating with PyTorch to harden the protocol.

Memory DebuggingOOMPyTorch

0 likes · 9 min read

Debugging Random OOM Issues in PyTorch Distributed Training on A100 Clusters

Architects' Tech Alliance

Aug 10, 2023 · Industry Insights

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

This article examines the architecture of AI compute clusters, explaining offline training and inference pipelines, the role of RDMA, and the technical differences between InfiniBand and RoCEv2—including latency, bandwidth, scalability, cost, and vendor considerations—to help engineers choose the optimal high‑performance network for large‑model training.

AI computeInfiniBandRDMA

0 likes · 13 min read

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

Baidu Intelligent Cloud Tech Hub

Jul 24, 2023 · Artificial Intelligence

How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations

This article explains the challenges of training massive AI models and details PaddlePaddle's 4D hybrid parallelism, MoE acceleration, long‑sequence strategies, end‑to‑end performance optimizations, and practical code examples for building and scaling large models efficiently.

AIPaddlePaddledistributed training

0 likes · 12 min read

How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations

Huawei Cloud Developer Alliance

Jul 17, 2023 · Artificial Intelligence

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training

During a livestream titled “Solving the ‘Development Difficulty’ of Large Models with MindSpore Auto Parallel”, Huawei’s MindSpore experts explained how the framework’s distributed training techniques—including data, model, and pipeline parallelism as well as memory‑saving strategies—enable efficient pre‑training of trillion‑parameter models across diverse AI domains.

Data ParallelMemory optimizationMindSpore

0 likes · 6 min read

How MindSpore’s Auto Parallel Tech Simplifies Large-Model Training

Alibaba Cloud Native

Jun 25, 2023 · Artificial Intelligence

Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

This guide explains how to leverage Alibaba Cloud Container Service ACK's AI suite and DeepSpeed to efficiently run distributed large‑language‑model training on Kubernetes, covering prerequisites, configuration, command‑line deployment, monitoring with TensorBoard, and performance‑optimizing techniques.

AIAlibaba CloudArena

0 likes · 11 min read

Accelerate Large‑Scale LLM Training on Alibaba Cloud ACK with DeepSpeed and Arena

Baidu Intelligent Cloud Tech Hub

Jun 21, 2023 · Artificial Intelligence

How Baidu’s AIPod Network Powers Massive AI Model Training

This article explains the design and engineering of Baidu's AIPod high‑performance network, detailing the massive bandwidth, scalability, stability, and low‑latency requirements of large‑scale AI model training and the practical tools used to monitor and troubleshoot such workloads.

AIAIPoddistributed training

0 likes · 22 min read

How Baidu’s AIPod Network Powers Massive AI Model Training

Baidu Tech Salon

May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseGPU Cluster

0 likes · 17 min read

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

Amap Tech

May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data

0 likes · 29 min read

A 20‑Year Review of AI Infrastructure Milestones

Baidu Intelligent Cloud Tech Hub

May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureCloud ComputingGPU clusters

0 likes · 19 min read

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

DataFunTalk

May 2, 2023 · Artificial Intelligence

Automatic Parallelism in PaddlePaddle: Architecture, Implementation, and Application Practice

This article presents a comprehensive overview of PaddlePaddle's automatic parallel design for heterogeneous scenarios, covering background motivation, architectural principles, key implementation details, practical usage interfaces, and future outlook, while illustrating concepts with detailed diagrams and examples.

AI frameworksPaddlePaddleautomatic parallelism

0 likes · 19 min read

Automatic Parallelism in PaddlePaddle: Architecture, Implementation, and Application Practice

21CTO

Apr 21, 2023 · Artificial Intelligence

Essential AI Reading List: LLMs, AutoGPT, Distributed Training & More

This curated collection highlights the latest open‑source LLM breakthroughs, comprehensive surveys, AutoGPT developments, distributed training pitfalls, and practical tools for AI engineers, providing concise descriptions and direct links to each resource for deeper exploration.

AI researchAutoGPTdistributed training

0 likes · 10 min read

Essential AI Reading List: LLMs, AutoGPT, Distributed Training & More

Baidu Geek Talk

Apr 19, 2023 · Artificial Intelligence

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

When training large‑batch deep learning models, developers often use recompute to trade computation for memory, but in dynamic graph frameworks this can trigger synchronization errors in distributed data parallel training; the article explains the underlying DDP mechanics, illustrates the error, and offers a practical no_sync workaround with code examples.

CheckpointPyTorchdistributed training

0 likes · 14 min read

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

Alibaba Cloud Big Data AI Platform

Apr 11, 2023 · Artificial Intelligence

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.

DeepRecEmbeddingVariableGPU inference

0 likes · 13 min read

How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

DataFunSummit

Apr 2, 2023 · Artificial Intelligence

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

This article introduces the challenges of scaling deep‑learning model training, explains the design and components of the open‑source Easy Parallel Library (EPL) that unifies data, pipeline, and operator‑split parallelism, and demonstrates its best‑practice results on large‑scale classification, BERT‑large, and massive multimodal models.

EPLLarge‑Scale TrainingZeRO

0 likes · 15 min read

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)