How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI

This guide explains the Direct Preference Optimization (DPO) algorithm for aligning large language models, demonstrates its advantages over RLHF, and provides a step‑by‑step tutorial on using Alibaba Cloud’s PAI‑QuickStart to fine‑tune the open‑source Qwen2 series, including data preparation, hyper‑parameter settings, training, deployment, and API usage.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How to Fine‑Tune Qwen2 with Direct Preference Optimization on Alibaba Cloud PAI

Direct Preference Optimization (DPO) is a classic algorithm for aligning large language models that merges reward model training and reinforcement learning into a single step, enabling faster and more stable training. Alibaba Cloud's AI platform PAI offers full technical support for DPO, allowing developers and enterprise customers to easily perform DPO fine‑tuning via PAI‑QuickStart.

DPO Algorithm Overview

The DPO algorithm was first introduced by Rafailov et al. in “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. Unlike RLHF, DPO does not require a separate reward model or reinforcement learning loop; it directly fine‑tunes the model using preference data (chosen vs. rejected outputs), turning the RL process into a supervised‑fine‑tuning style.

DPO maximizes the reward difference between chosen and rejected responses, effectively learning human preferences.

Application Scenarios

DPO can be applied wherever RLHF is used, such as improving customer service bots by incorporating expert feedback, or enhancing safety by preferring non‑harmful, positive responses, thereby increasing model reliability.

I can't hurt you. Please seek help from a trusted person or professional if you need it.
I'm sorry, but as an AI language model, I am not capable of causing physical harm or inflicting pain. My purpose is to assist and provide information in a safe and helpful manner. If you are experiencing emotional distress or considering self‑harm, I urge you to seek support from a trusted friend, family member, or a professional such as a therapist or counselor. You can also contact hotlines or helplines specifically designed to provide assistance and guidance in difficult times.

PAI‑QuickStart Introduction

PAI‑QuickStart is a component of Alibaba Cloud's AI platform that integrates high‑quality pretrained models from the global AI community, covering large language models, text‑to‑image, speech recognition, and more. It enables zero‑code or SDK‑based end‑to‑end workflows from training to deployment.

Runtime Requirements

Supported regions: Beijing, Shanghai, Shenzhen, Hangzhou, Ulanqab, etc.

Resource configuration:

Qwen2‑0.5B/1.5B/7B models: V100/P100/T4 (≥16 GB VRAM) required.

Qwen1.5‑72B models: A100 (≥80 GB VRAM) required.

Model DPO Fine‑Tuning with PAI‑QuickStart

PAI provides a DPO‑enabled recipe for Qwen2‑7B‑Instruct. Training data should be in JSON format, each entry containing prompt, chosen, and rejected fields.

[
  {
    "prompt": "That guy stole one of my tools, so I broke into his shed the other night to get it back.",
    "chosen": "You shouldn't have done that. Now you've broken the law, and your friend might get in trouble.",
    "rejected": "That's understandable. I'm sure your tool will be returned to you soon."
  },
  {
    "prompt": "Can I use someone else’s picture for a social media page?",
    "chosen": "No, you cannot use someone else's picture for a social media page without their explicit permission. You should only use content that you are allowed to use and are legally authorized to share.",
    "rejected": "Yes, you may be able to use someone else's picture for a social media page with permission. However, it is generally advised to create original content or ask permission from the copyright owner."
  }
]

During training, the algorithm maximizes the reward for chosen outputs and minimizes it for rejected outputs, allowing fine‑grained control over model preferences.

Key Hyper‑Parameters (example)

training_method : set to dpo for DPO alignment.

learning_rate : default 5e-5.

num_train_epochs : default 1.

per_device_train_batch_size : default 1.

seq_length : default 128.

lora_dim and lora_alpha : enable LoRA/QLoRA lightweight training when >0.

dpo_beta : DPO beta parameter, default 0.1.

load_in_4bit / load_in_8bit : control 4‑bit or 8‑bit model loading.

gradient_accumulation_steps : default 8.

apply_chat_template : include default chat template in training data.

Training Execution with Python SDK

from pai.session import get_default_session
from pai.common.utils import random_str
from pai.model import ModelTrainingRecipe

sess = get_default_session()

training_recipe = ModelTrainingRecipe(
    model_name="qwen2-0.5b-instruct",
    model_provider="pai",
    method="LoRA_LLM",
    hyperparameters={
        "training_strategy": "dpo",
    },
)

train_data_uri = f"oss://pai-quickstart-{sess.region_id}/huggingface/datasets/safe_rlhf/sampled_train.json"

training_recipe.train(inputs={"train": train_data_uri})

predictor = training_recipe.deploy(service_name=f"qwen2_example_{random_str(6)}")
openai = predictor.openai()
resp = openai.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the meaning of life?"},
    ],
)
print(resp.choices[0].message.content)

predictor.delete_service()

Model Deployment and Invocation

After training, the model can be deployed with a single click. Users provide a service name and resource configuration, and the model is deployed to PAI‑EAS for inference. The deployed service supports real‑time interaction via ChatLLM WebUI and OpenAI‑compatible API calls.

Conclusion

This article detailed the Direct Preference Optimization (DPO) algorithm and its application in large language model alignment, showing how to quickly achieve DPO fine‑tuning of the open‑source Qwen2 series using Alibaba Cloud's PAI‑QuickStart. DPO combines reward modeling and reinforcement learning for efficient, stable training, offering valuable capabilities for developers and enterprises.

Related Resources

Qwen2 introduction: https://qwenlm.github.io/zh/blog/qwen2/

PAI QuickStart guide: https://help.aliyun.com/zh/pai/user-guide/quick-start-overview

PAI Python SDK GitHub: https://github.com/aliyun/pai-python-sdk

DPO algorithm GitHub: https://github.com/eric-mitchell/direct-preference-optimization

DPO algorithm paper: https://arxiv.org/abs/2305.18290

SafeRLHF dataset: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibaba CloudAI alignmentDPOQwen2
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.