Master Post-Training: Fine-Tune LLMs with SFT, DPO, and GRPO on Alibaba PAI
This article explains post‑training concepts, compares SFT, DPO, and GRPO fine‑tuning methods, and provides step‑by‑step guidance for using Alibaba Cloud's PAI platform—including Model Gallery and DSW—to fine‑tune large language models with code examples and practical tips.
Introduction
Post‑Training (model post‑training) is a crucial stage for deploying large models, allowing significant performance optimization with lower computational and data requirements compared with pre‑training.
Common Fine‑tuning Methods
Model fine‑tuning adapts a pretrained LLM to specific tasks. Typical approaches include Supervised Fine‑Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO).
SFT
SFT continues training a pretrained model using labeled task data, either updating all parameters (Full Fine‑tuning, FFT) or only a subset (Parameter‑Efficient Fine‑tuning, PEFT) such as LoRA or QLoRA.
Full Fine‑tuning updates every parameter and is resource‑intensive.
PEFT updates only part of the parameters, offering faster training and lower resource consumption. LoRA modifies the self‑attention weight matrix with low‑rank decomposition; QLoRA combines LoRA with 4‑bit/8‑bit quantization to further reduce memory usage.
DPO
DPO aligns model outputs with human preferences without a separate reward model or reinforcement‑learning loop, using a simple classification loss. It provides stability, strong performance, and lower computational cost compared with RLHF.
GRPO
GRPO optimizes relative preferences among a set of candidate answers, eliminating the need for a value model and using group‑based baseline rewards. It directly incorporates KL divergence into the loss, improving efficiency for tasks like mathematical reasoning.
Fine‑tuning Algorithm Comparison
The three algorithms differ in difficulty and suitable scenarios; a typical workflow is SFT followed by DPO to combine domain capability with preference alignment.
PAI Model Fine‑tuning Practice
Alibaba Cloud AI platform PAI offers a full suite of fine‑tuning capabilities through three product lines:
PAI‑Model Gallery
Provides zero‑code fine‑tuning, model compression, evaluation, and deployment. Users select a base model, configure training parameters, and submit a task.
[
{"instruction":"你是一个心血管科医生,请根据患者的问题给出建议:我患高血压五六年啦,天天喝药吃烦啦,哪种东西能根治高血压,高血压克星是什么?","output":"高血压的患者可以吃许多新鲜的水果蔬菜..."},
{"instruction":"你是一个呼吸科医生,请根据患者的问题给出建议:风寒感冒咳白痰怎么治疗?","output":"风寒感冒,咳有白痰的患者..."}
]Training hyper‑parameters can be referenced in the documentation.
PAI‑DSW
Interactive cloud IDE for developers familiar with Python and notebooks. Example workflow for fine‑tuning Qwen2.5‑7B using Pai‑Megatron‑Patch:
cd /mnt/data/yy/qwen25_sft
mkdir qwen-ckpts
cd qwen-ckpts
git clone --recurse-submodules https://github.com/alibaba/Pai-Megatron-Patch.git
...
sh run_mcore_qwen.sh dsw 7B 1 8 1e-5 1e-6 128 128 bf16 1 1 1 true true true true false false 100 /mnt/data/.../mmap_qwen2_sft_datasets_text_document /mnt/data/.../mmap_qwen2_sft_datasets_text_document /mnt/data/.../Qwen2.5-7B-to-mcore 1000 100 /mnt/data/.../output_mcore_qwen2.5_finetuneTraining logs and resource monitoring can be viewed in the DSW interface. After training, the fine‑tuned model is stored in OSS and can be deployed as an online service with a single click.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
