Artificial Intelligence 19 min read

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

Alibaba Cloud Big Data AI Platform

Sep 13, 2023

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

Introduction

With the rapid evolution of large language models (LLMs) and their growing scale, developers need tools that simplify training, reduce development effort, and handle continuous model iteration. Pai‑Megatron‑Patch, released by Alibaba Cloud's PAI platform, addresses these needs.

What is Pai‑Megatron‑Patch?

Pai‑Megatron‑Patch is a toolkit built by the PAI algorithm team on top of the Alibaba Cloud intelligent computing service (PAI‑Lingjun). It enables efficient distributed training, supervised instruction fine‑tuning, and offline inference verification for LLMs, offering a complete development pipeline compatible with Megatron‑LM.

Main Features

Supports many popular LLMs such as LLaMA, LLaMA‑2, CodeLLaMA, Baichuan, Qwen, Falcon, GLM, StarCoder, Bloom, ChatGLM, etc.

Provides weight conversion between HuggingFace, Megatron, and Transformer Engine formats.

Accelerates training with FlashAttention 2.0 and Transformer Engine FP8 support while ensuring convergence.

Rich, easy‑to‑use examples covering pre‑training, fine‑tuning, evaluation, inference, and reinforcement‑learning best practices.

Technical Architecture

Pai‑Megatron‑Patch follows a non‑intrusive design: it does not modify Megatron‑LM source code but adds functionality via patch files. The patch builds the LLM training pipeline by depending on Megatron‑LM core libraries, keeping the two projects decoupled so Megatron upgrades do not affect user experience.

Key Technologies

1. Model Weight Conversion

The toolkit converts HuggingFace checkpoints to Megatron format (and vice‑versa) by mapping operator namespaces and merging Q‑K‑V projections. It also supports converting LLaMA‑2 weights, handling token‑embedding and LM‑head adjustments, and provides scripts for batch conversion.

--swiglu \
--use-rotary-position-embeddings \
--no-position-embedding \
--untie-embeddings-and-output-weights \
--disable-bias-linear

For Baichuan‑style models, adding --use-alibi-mask and disabling rotary embeddings switches the configuration.

--swiglu \
--use-alibi-mask \
--position-embedding-type none \
--untie-embeddings-and-output-weights \
--disable-bias-linear

2. FP8 Training with Transformer Engine (TE)

TE provides FP8 mixed‑precision on NVIDIA Hopper GPUs, integrating FlashAttention and fused operators such as LayerNormLinear and LayerNormMLM. Converting HuggingFace weights to TE follows the same mapping logic, with an extra _extra_state for FP8 scaling. Loading with strict=False avoids conflicts.

if [ $PR = fp8 ]; then
    pr_options=" \
        --bf16 \
        --fp8-hybrid \
        --fp8-amax-compute-algo max \
        --fp8-amax-history-len 1024 \
        --transformer-impl transformer_engine"
fi

Loss curves for LLaMA‑7B and LLaMA‑2‑70B with FP8 match those of FP16/BF16, confirming convergence.

3. Large‑Model Training & Inference

Typical workflow:

Clone the repository and prepare model checkpoints.

Convert HuggingFace weights to Megatron format using provided scripts.

Run pre‑training scripts (e.g., run_pretrain_megatron_llama.sh) with configurable parameters such as batch size, learning rate, sequence length, parallelism, and precision.

Perform supervised fine‑tuning, then offline inference using Megatron‑compatible scripts.

Example command for model conversion:

cd /mnt/workspace/PAI-Megatron-Patch/toolkits/model_checkpoints_convertor/llama
sh model_convertor.sh \
/root/Megatron-LM-23.04 \
/mnt/workspace/llama2-ckpts/llama2-7b-hf \
/mnt/workspace/llama2-ckpts/llama2-7b-hf-to-megatron-tp1-pp1 \
1 \
1 \
llama-7b \
0 \
false

4. Reinforcement Learning (RLHF)

After supervised fine‑tuning, the toolkit supports reward‑model (RM) training and PPO reinforcement learning using popular open‑source frameworks such as DeepSpeed‑Chat and trlx. Conversion scripts allow Megatron checkpoints to be transformed back to HuggingFace format for RLHF pipelines.

Typical steps:

Clone DeepSpeed‑Chat, copy necessary scripts, install dependencies.

Train reward model on LLaMA‑2 or BLOOM.

Run PPO training with trlx or DeepSpeed‑Chat.

Open‑Source Ecosystem and Future Directions

Pai‑Megatron‑Patch contributes to the open‑source community by providing lossless weight conversion, FP8 training on H800 clusters, best‑practice guides for LLM training on PAI‑Lingjun, and reinforcement‑learning workflows. Future work includes expanding LoRA support for Megatron and further enhancements to Transformer Engine integration.

References

Attention Is All You Need

Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism

Reducing Activation Recomputation in Large Transformer Models

FP8 Formats for Deep Learning

ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models

LLaMA: Open and Efficient Foundation Language Models

Llama 2: Open Foundation and Fine‑Tuned Chat Models

Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning model conversion LLM training FP8 Megatron Transformer Engine

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.