Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B
This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.
GRPO Core Principle
GRPO treats multiple answers to the same question as a single assignment group. For each answer an advantage score is computed as reward - average_reward_of_group. The model preferentially selects answers with higher advantage, guiding the next training iteration.
Example ("farm has 10 chickens, 5 roosters, 3 egg‑laying hens; how many chickens do not lay eggs?"):
(1) 10-5-3+5 = 7 # correct reasoning, score 5, advantage 2.7
(2) 10-5-3 = 2 # near‑correct, score 2, advantage -0.3
(3) 5 roosters # wrong, score 0, advantage -2.3The highest‑advantage answer (1) becomes the evolution direction.
Environment Preparation
Install modelscope and create a Conda environment with Python 3.11. Install torch, transformers, trl, and wandb.
Download the Qwen2.5‑0.5B‑Instruct checkpoint from ModelScope (e.g.,
modelscope download --model Qwen/Qwen2.5-0.5B-Instruct --local_dir ./Qwen2.5-0.5B-Instruct).
Load the OpenAI/GSM8K dataset (≈7 473 training and 1 319 test examples) and reshape each entry to contain a prompt field (system prompt + user question) and an answer field.
System prompt defining the required XML‑style output:
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """
<reasoning>{reasoning}</reasoning>
<answer>{answer}</answer>
"""Reward Functions
correctness_reward_func: rewards correct answers. int_reward_func: rewards outputs that are pure integers. strict_format_reward_func: rewards exact XML format ( <reasoning> and <answer> tags on separate lines). soft_format_reward_func: rewards a relaxed format where whitespace between tags is allowed. count_xml / xmlcount_reward_func: counts XML tag occurrences and penalises malformed structures.
Example implementations:
# correctness reward
def correctness_reward_func(prompts, completions, answer, **kwargs):
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]
# integer reward
def int_reward_func(completions, **kwargs):
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted]
# strict XML format reward
import re
def strict_format_reward_func(completions, **kwargs):
pattern = r"^<reasoning>
.*?
</reasoning>
<answer>
.*?
</answer>$"
responses = [c[0]["content"] for c in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if m else 0.0 for m in matches]
# soft XML format reward
def soft_format_reward_func(completions, **kwargs):
pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
responses = [c[0]["content"] for c in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if m else 0.0 for m in matches]
# XML count reward
def count_xml(text):
score = 0.0
if text.count("<reasoning>
") == 1:
score += 0.125
if text.count("
</reasoning>
") == 1:
score += 0.125
if text.count("
<answer>
") == 1:
score += 0.125
score -= len(text.split("
</answer>
")[-1]) * 0.001
if text.count("
</answer>") == 1:
score += 0.125
score -= (len(text.split("
</answer>")[-1]) - 1) * 0.001
return score
def xmlcount_reward_func(completions, **kwargs):
contents = [c[0]["content"] for c in completions]
return [count_xml(c) for c in contents]Training Configuration
GRPO training is orchestrated with GRPOConfig from the TRL library:
training_args = GRPOConfig(
output_dir="outputs/Qwen-0.5B-GRPO",
run_name="Qwen-0.5B-GRPO-gsm8k",
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
logging_steps=1,
bf16=True,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=16,
max_prompt_length=256,
max_completion_length=200,
num_train_epochs=1,
save_steps=100,
max_grad_norm=0.1,
log_on_each_node=False,
use_vllm=False,
report_to="wandb"
)Model and tokenizer are loaded, the trainer is instantiated with the reward functions, and training is started:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map=None
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
int_reward_func,
correctness_reward_func
],
args=training_args,
train_dataset=dataset
)
trainer.train()
trainer.save_model(output_dir)Pre‑training Baseline Test
Before GRPO, the model was queried with a simple arithmetic problem:
prompt = "Joy can read 8 pages of a book in 20 minutes. How many hours will it take her to read 120 pages?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)The baseline model returned a plain answer without the required reasoning block or XML tags.
Post‑training Evaluation
After GRPO training, the same prompt yields a structured response that includes a reasoning segment and an answer wrapped in the defined XML tags.
Quantitative evaluation on the GSM8K test set shows accuracy improving from 22.4 % (baseline) to 48.6 % after GRPO training, indicating both format compliance and genuine reasoning improvement.
Reproducibility
The entire pipeline—GRPO theory, dataset preparation, reward‑function engineering, training configuration, and evaluation—can be reproduced using the code snippets above. The approach is applicable to other small LLMs and tasks that require step‑by‑step reasoning and structured output.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
