Tagged articles
17 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 24, 2026 · Artificial Intelligence

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

LoongForge is an open‑source, Megatron‑based multimodal training framework that unifies LLM, VLM, VLA and diffusion models, runs seamlessly on NVIDIA GPUs and Baidu Kunlun XPU, and delivers 15%‑45% end‑to‑end training acceleration while scaling linearly on thousands of cards.

GPUKunlun XPULoongForge
0 likes · 23 min read
LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA
0 likes · 9 min read
How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 11, 2025 · Artificial Intelligence

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

The article introduces a fine‑grained activation offloading technique implemented in Megatron‑Core that offloads module‑level activations to CPU, overlaps transfer with computation, and remains compatible with pipeline and virtual pipeline parallelism, dramatically reducing peak GPU memory for large language models while incurring minimal throughput loss.

LLMMegatronMemory Optimization
0 likes · 18 min read
Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Nov 4, 2025 · Artificial Intelligence

Common Debugging Signals for Large Language Models

This article outlines the end‑to‑end workflow for large‑model training, highlights typical debugging challenges such as memory OOM, performance bottlenecks, and gradient issues, and provides concrete strategies, tools (DeepSpeed, Megatron, Torchtitan, veScale) and best‑practice checklists to help engineers diagnose and resolve problems efficiently.

DebuggingDeepSpeedLLM
0 likes · 12 min read
Common Debugging Signals for Large Language Models
DataFunSummit
DataFunSummit
Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceMegatronTraining Optimization
0 likes · 4 min read
Boosting Large Model Training: Optimizing Performance with the Verl Framework
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronModel Parallelism
0 likes · 16 min read
How to Train a 671B‑Scale Model with RL: Insights from a verl Internship
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 7, 2025 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

This article explains how the Pai‑Megatron‑Patch toolkit enhances the usability and training performance of the Qwen2‑VL multimodal large model by introducing model‑parallel weight conversion, user‑friendly data loading, visual feature processing optimizations, optimizer offloading, and pipeline parallelism techniques, supported by extensive experimental analysis.

MegatronPipeline ParallelismQwen2-VL
0 likes · 25 min read
How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2024 · Artificial Intelligence

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

MegatronTensor Parallelismactivation memory
0 likes · 20 min read
How Sequence Parallelism Slashes Activation Memory in Megatron Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

Data ParallelismDeepSpeedDistributed Training
0 likes · 27 min read
Master Distributed Training for Massive AI Models on Multi‑GPU Clusters
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 12, 2024 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

This article introduces Pai‑Megatron‑Patch, a suite of tools built on Nvidia Megatron‑LM that accelerates large language model training through dense and MoE model support, high‑precision HuggingFace↔MCore weight conversion, CPU offloading for optimizers and activations, FlashAttention‑3, and communication‑compute overlapping, and provides detailed experimental results and command‑line usage examples.

CPU offloadingCommunication OverlapDistributed Optimizer
0 likes · 22 min read
How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap
DataFunTalk
DataFunTalk
Dec 6, 2023 · Artificial Intelligence

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

This article presents a comprehensive overview of state‑of‑the‑art distributed training methods for large language models, using GPT‑175B as a case study to analyze memory, communication, and compute overheads, and to recommend practical optimization strategies such as tensor, pipeline, and sequence parallelism, ZeRO‑1 optimizer, and selective activation checkpointing.

Distributed TrainingGPU memory optimizationLLM
0 likes · 22 min read
Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 13, 2023 · Artificial Intelligence

Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training

This article introduces the open‑source Pai‑Megatron‑Patch tool from Alibaba Cloud, explains its non‑intrusive patch architecture, enumerates supported models and features such as weight conversion, Flash‑Attention 2.0, FP8 training with Transformer Engine, and provides detailed command‑line examples for model conversion, pre‑training, supervised fine‑tuning, inference, and RLHF reinforcement learning pipelines.

Deep LearningFP8LLM
0 likes · 19 min read
Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron
0 likes · 19 min read
How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud
DataFunSummit
DataFunSummit
Jan 5, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

These notes explain how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—illustrated with Megatron-LM, MoE models, and practical compression techniques such as quantization, distillation, and pruning.

AIGPUMegatron
0 likes · 16 min read
GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification
DataFunTalk
DataFunTalk
Jan 4, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

This article explains how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—detailing methods such as model, pipeline, and tensor parallelism, Megatron framework, MoE models, and various model compression techniques.

AIGPUMegatron
0 likes · 17 min read
GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification