Tagged articles

17 articles

Page 1 of 1

Apr 24, 2026 · Artificial Intelligence

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

LoongForge is an open‑source, Megatron‑based multimodal training framework that unifies LLM, VLM, VLA and diffusion models, runs seamlessly on NVIDIA GPUs and Baidu Kunlun XPU, and delivers 15%‑45% end‑to‑end training acceleration while scaling linearly on thousands of cards.

GPUKunlun XPULoongForge

0 likes · 23 min read

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

Instant Consumer Technology Team

Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA

0 likes · 9 min read

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

Xiaohongshu Tech REDtech

Dec 11, 2025 · Artificial Intelligence

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

The article introduces a fine‑grained activation offloading technique implemented in Megatron‑Core that offloads module‑level activations to CPU, overlaps transfer with computation, and remains compatible with pipeline and virtual pipeline parallelism, dramatically reducing peak GPU memory for large language models while incurring minimal throughput loss.

LLMMegatronMemory Optimization

0 likes · 18 min read

Fine‑Grained Activation Offloading: Cutting Memory Use While Preserving LLM Throughput

AI2ML AI to Machine Learning

Nov 4, 2025 · Artificial Intelligence

Common Debugging Signals for Large Language Models

This article outlines the end‑to‑end workflow for large‑model training, highlights typical debugging challenges such as memory OOM, performance bottlenecks, and gradient issues, and provides concrete strategies, tools (DeepSpeed, Megatron, Torchtitan, veScale) and best‑practice checklists to help engineers diagnose and resolve problems efficiently.

DebuggingDeepSpeedLLM

0 likes · 12 min read

Common Debugging Signals for Large Language Models

DataFunSummit

Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceMegatronTraining Optimization

0 likes · 4 min read

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronModel Parallelism

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Alibaba Cloud Big Data AI Platform

Mar 7, 2025 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

This article explains how the Pai‑Megatron‑Patch toolkit enhances the usability and training performance of the Qwen2‑VL multimodal large model by introducing model‑parallel weight conversion, user‑friendly data loading, visual feature processing optimizations, optimizer offloading, and pipeline parallelism techniques, supported by extensive experimental analysis.

MegatronPipeline ParallelismQwen2-VL

0 likes · 25 min read

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

Baobao Algorithm Notes

Nov 4, 2024 · Artificial Intelligence

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

This article provides a detailed technical analysis of DeepSpeed Ulysses, explaining its sequence‑parallel workflow, comparing its communication volume with Megatron, and examining how All2All operations and Zero‑3 integration affect scalability and efficiency.

All2AllDeepSpeedMegatron

0 likes · 15 min read

How DeepSpeed Ulysses Cuts Communication Overhead Compared to Megatron

Baobao Algorithm Notes

Oct 30, 2024 · Artificial Intelligence

How Sequence Parallelism Slashes Activation Memory in Megatron Training

This article provides a detailed technical walkthrough of sequence parallelism (SP) for Megatron models, covering tensor parallelism basics, precise activation memory calculations for MLP and attention layers, the SP implementation that splits activations across GPUs, and selective activation recomputation strategies that further reduce memory while preserving training speed.

MegatronTensor Parallelismactivation memory

0 likes · 20 min read

How Sequence Parallelism Slashes Activation Memory in Megatron Training

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

This guide walks you through the fundamentals of distributed training for large AI models, explaining data, model, and pipeline parallelism, GPU communication primitives, and advanced techniques like Megatron 3‑D parallelism and DeepSpeed ZeRO stages, with practical examples and visual illustrations to help you design efficient multi‑GPU training pipelines.

Data ParallelismDeepSpeedDistributed Training

0 likes · 27 min read

Master Distributed Training for Massive AI Models on Multi‑GPU Clusters

Alibaba Cloud Big Data AI Platform

Sep 12, 2024 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

This article introduces Pai‑Megatron‑Patch, a suite of tools built on Nvidia Megatron‑LM that accelerates large language model training through dense and MoE model support, high‑precision HuggingFace↔MCore weight conversion, CPU offloading for optimizers and activations, FlashAttention‑3, and communication‑compute overlapping, and provides detailed experimental results and command‑line usage examples.

CPU offloadingCommunication OverlapDistributed Optimizer

0 likes · 22 min read

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

DataFunTalk

Dec 6, 2023 · Artificial Intelligence

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

This article presents a comprehensive overview of state‑of‑the‑art distributed training methods for large language models, using GPT‑175B as a case study to analyze memory, communication, and compute overheads, and to recommend practical optimization strategies such as tensor, pipeline, and sequence parallelism, ZeRO‑1 optimizer, and selective activation checkpointing.

Distributed TrainingGPU memory optimizationLLM

0 likes · 22 min read

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

Alibaba Cloud Infrastructure

Sep 13, 2023 · Artificial Intelligence

Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training

This article introduces the open‑source Pai‑Megatron‑Patch tool from Alibaba Cloud, explains its non‑intrusive patch architecture, enumerates supported models and features such as weight conversion, Flash‑Attention 2.0, FP8 training with Transformer Engine, and provides detailed command‑line examples for model conversion, pre‑training, supervised fine‑tuning, inference, and RLHF reinforcement learning pipelines.

Deep LearningFP8LLM

0 likes · 19 min read

Pai‑Megatron‑Patch: Design Principles, Key Features, and End‑to‑End Usage for Large Language Model Training

Alibaba Cloud Big Data AI Platform

Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron

0 likes · 19 min read

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

DataFunSummit

May 25, 2023 · Artificial Intelligence

Intel Announces Aurora genAI: A Trillion-Parameter Generative AI Model Powered by the Aurora Supercomputer

Intel revealed its Aurora genAI project, a generative AI model with up to one trillion parameters that will run on the Aurora supercomputer—leveraging NVIDIA Megatron and Microsoft DeepSpeed frameworks, delivering over 2 Exaflops performance and targeting scientific as well as broader AI applications.

AuroraHPCIntel

0 likes · 9 min read

Intel Announces Aurora genAI: A Trillion-Parameter Generative AI Model Powered by the Aurora Supercomputer

DataFunSummit

Jan 5, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

These notes explain how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—illustrated with Megatron-LM, MoE models, and practical compression techniques such as quantization, distillation, and pruning.

AIGPUMegatron

0 likes · 16 min read

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

DataFunTalk

Jan 4, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

This article explains how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—detailing methods such as model, pipeline, and tensor parallelism, Megatron framework, MoE models, and various model compression techniques.

AIGPUMegatron

0 likes · 17 min read