Tagged articles

12 articles

Page 1 of 1

May 17, 2026 · Artificial Intelligence

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

The article dissects DeepSeek's MoE model‑parallel strategy, explaining how GPU compute and communication are overlapped through expert, pipeline, and ZeRO‑1 parallelism, and introduces DualPipe and Waved‑EP kernels that enable efficient training on large‑scale hardware.

DeepSeekGPU Communication OverlapMixture of Experts

0 likes · 18 min read

How DeepSeek Leverages MoE Parallelism: GPU Compute and Communication Optimizations

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronModel Parallelism

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Architect

May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismExpert Parallelism

0 likes · 14 min read

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

Bilibili Tech

Mar 4, 2025 · Artificial Intelligence

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

The Bilibili TTV team optimized OpenSora and CogVideoX text‑to‑video models by redesigning data storage with Alluxio, parallelizing VAE encoding, applying dynamic sequence‑parallel and DeepSpeed‑Ulysses attention, adapting GPU code for NPU execution, leveraging profiling‑driven kernel fusion, FlashAttention, and expandable memory to dramatically increase training efficiency and frame throughput, while outlining future pipeline‑parallel and ZeRO‑3 scaling plans.

Diffusion TransformerFlashAttentionModel Parallelism

0 likes · 26 min read

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

AI Algorithm Path

Feb 10, 2025 · Artificial Intelligence

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

This article explains how the DualPipe scheduling mechanism in DeepSeek‑R1 improves GPU cluster compute‑communication efficiency by using fine‑grained pipeline stages and bidirectional data flow, comparing it with Zero Bubble pipeline parallelism and discussing the challenges of large‑scale distributed training.

DeepSeekDistributed TrainingDualPipe

0 likes · 10 min read

Understanding DualPipe: DeepDive into DeepSeek‑R1 Architecture (Part 5)

NewBeeNLP

Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV cacheModel Parallelism

0 likes · 17 min read

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

Data ParallelismDeepSpeedDistributed Training

0 likes · 27 min read

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

Alibaba Cloud Big Data AI Platform

Jan 29, 2024 · Artificial Intelligence

Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud

This article explains how Alibaba Cloud's PAI platform and NVIDIA's Megatron-Core enable efficient training of sparse Mixture-of-Experts (MoE) large language models, covering algorithm basics, the Megatron-Core MoE framework, weight conversion pipelines, and performance results on Mixtral‑8x7B.

Megatron-CoreMixture of ExpertsModel Parallelism

0 likes · 18 min read

Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud

360 Smart Cloud

Jan 26, 2024 · Artificial Intelligence

Parallel Strategies for Distributed Deep Learning Training

This article reviews distributed training techniques for large deep‑learning models, covering data parallelism, model parallelism (including pipeline and tensor parallelism), gradient bucketing and accumulation, 3D parallelism, and practical implementations such as Megatron‑LM and 360AI platform optimizations.

AIData ParallelismDeep Learning

0 likes · 22 min read

Parallel Strategies for Distributed Deep Learning Training

DataFunSummit

Apr 2, 2023 · Artificial Intelligence

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

This article introduces the challenges of scaling deep‑learning model training, explains the design and components of the open‑source Easy Parallel Library (EPL) that unifies data, pipeline, and operator‑split parallelism, and demonstrates its best‑practice results on large‑scale classification, BERT‑large, and massive multimodal models.

Distributed TrainingEPLLarge-Scale Training

0 likes · 15 min read

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

Baidu Geek Talk

Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelDistributed Training

0 likes · 27 min read

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

DataFunSummit

Aug 16, 2021 · Artificial Intelligence

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

The article reviews how deep learning models have grown deeper and wider, discusses the memory and bandwidth limits of single GPUs, and explains pipeline and sharding techniques—including GPU clusters and TPU pods—to efficiently train large‑scale models in industrial settings.

GPUMixture of ExpertsModel Parallelism

0 likes · 6 min read

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies