Artificial Intelligence 8 min read

Reproducing the GSPO Reinforcement Learning Algorithm on Alibaba PAI: A Step‑by‑Step Guide

This article introduces the GSPO (Group Sequence Policy Optimization) reinforcement learning algorithm, explains its advantages over GRPO, and provides a detailed, end‑to‑end tutorial for reproducing GSPO training on Alibaba Cloud's PAI platform using the PAI‑ChatLearn framework.

Alibaba Cloud Big Data AI Platform

Aug 8, 2025

Reproducing the GSPO Reinforcement Learning Algorithm on Alibaba PAI: A Step‑by‑Step Guide

GSPO Algorithm Introduction

Reinforcement Learning (RL) is a key paradigm for extending language models with deeper reasoning and problem‑solving abilities. Existing RL methods such as GRPO suffer from instability and can cause irreversible model collapse during long‑term training. To address this, the Tongyi team proposed GSPO (Group Sequence Policy Optimization), which defines importance ratios at the sequence level and performs clipping, reward, and optimization on sequences.

Compared with GRPO, GSPO offers three major advantages:

High efficiency : Significantly higher training efficiency and continuous performance gains with increased compute.

Excellent stability : Maintains stable training and fundamentally resolves RL stability issues for Mixture‑of‑Experts (MoE) models.

Infrastructure friendliness : Higher tolerance to precision errors at the sequence level, simplifying RL infrastructure requirements.

PAI‑ChatLearn Reinforcement Learning Framework

PAI‑ChatLearn (https://github.com/alibaba/ChatLearn) is Alibaba Cloud's high‑performance, integrated RL framework that quickly supports and reproduces the GSPO training process.

Ease of use: Users only need to implement a few functions to train various RL algorithms via a computation‑graph based approach, with flexible resource scheduling for exclusive or shared model usage.

High performance: Supports acceleration techniques such as Sequence Packing, Sequence Parallel, and Group GEMM, greatly improving GPU utilization.

Broad engine support: Compatible with vLLM and SGLang for inference, and with FSDP and Megatron for stable, efficient training.

End‑to‑End GSPO Reproduction on PAI

Step 1: Environment Setup

Prepare the Docker image on PAI‑DLC or PAI‑DSW and start an instance with the following image address:

dsw-registry.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312

For faster image pulling in Shanghai region, use the VPC‑accelerated address:

dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312

Step 2: Data & Model Preparation

Use the MATH‑lighteval dataset as an example. Download the dataset, preprocess it, and obtain the Qwen3‑30B‑A3B model weights.

cd ChatLearn
# download dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# preprocess dataset
python chatlearn/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval
# download model weight
modelscope download --model Qwen/Qwen3-30B-A3B --local_dir pretrained_models/Qwen3-30B-A3B

Step 3: Model Conversion

Convert the HuggingFace‑format Qwen3‑30B‑A3B model to MCore format using the provided script.

CHATLEARN_ROOT=$(pwd)
cd ../Pai-Megatron-Patch/toolkits/distributed_checkpoints_convertor
bash scripts/qwen3/run_8xH20.sh \
A3B \
${CHATLEARN_ROOT}/pretrained_models/Qwen3-30B-A3B  \
${CHATLEARN_ROOT}/pretrained_models/Qwen3-30B-A3B-to-mcore \
false  \
true  \
bf16

Step 4: Training

Run the training script to start GSPO reinforcement learning training.

cd ${CHATLEARN_ROOT}
bash scripts/train_mcore_vllm_qwen3_30b_gspo.sh

Experimental Results

On the MATH‑lighteval benchmark, GSPO consistently outperforms GRPO, showing faster convergence and higher final performance.

Conclusion

With the rapid growth of large models, Alibaba Cloud's PAI platform offers comprehensive lifecycle support and will continue to release best practices and technical insights for reinforcement learning, model distillation, data preprocessing, and other AI engineering scenarios, inviting practitioners to explore enterprise‑grade AI solutions.

reinforcement learning algorithm implementation PAI GSPO ChatLearn

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.