Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

GRPO Core Principle

GRPO treats multiple answers to the same question as a single assignment group. For each answer an advantage score is computed as reward - average_reward_of_group. The model preferentially selects answers with higher advantage, guiding the next training iteration.

Example ("farm has 10 chickens, 5 roosters, 3 egg‑laying hens; how many chickens do not lay eggs?"):

(1) 10-5-3+5 = 7   # correct reasoning, score 5, advantage 2.7
(2) 10-5-3 = 2     # near‑correct, score 2, advantage -0.3
(3) 5 roosters    # wrong, score 0, advantage -2.3

The highest‑advantage answer (1) becomes the evolution direction.

Environment Preparation

Install modelscope and create a Conda environment with Python 3.11. Install torch, transformers, trl, and wandb.

Download the Qwen2.5‑0.5B‑Instruct checkpoint from ModelScope (e.g.,

modelscope download --model Qwen/Qwen2.5-0.5B-Instruct --local_dir ./Qwen2.5-0.5B-Instruct

).

Load the OpenAI/GSM8K dataset (≈7 473 training and 1 319 test examples) and reshape each entry to contain a prompt field (system prompt + user question) and an answer field.

System prompt defining the required XML‑style output:

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """
<reasoning>{reasoning}</reasoning>
<answer>{answer}</answer>
"""

Reward Functions

correctness_reward_func

: rewards correct answers. int_reward_func: rewards outputs that are pure integers. strict_format_reward_func: rewards exact XML format ( <reasoning> and <answer> tags on separate lines). soft_format_reward_func: rewards a relaxed format where whitespace between tags is allowed. count_xml / xmlcount_reward_func: counts XML tag occurrences and penalises malformed structures.

Example implementations:

# correctness reward
def correctness_reward_func(prompts, completions, answer, **kwargs):
    responses = [c[0]["content"] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]

# integer reward
def int_reward_func(completions, **kwargs):
    responses = [c[0]["content"] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted]

# strict XML format reward
import re

def strict_format_reward_func(completions, **kwargs):
    pattern = r"^<reasoning>
.*?
</reasoning>
<answer>
.*?
</answer>$"
    responses = [c[0]["content"] for c in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if m else 0.0 for m in matches]

# soft XML format reward
def soft_format_reward_func(completions, **kwargs):
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [c[0]["content"] for c in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if m else 0.0 for m in matches]

# XML count reward
def count_xml(text):
    score = 0.0
    if text.count("<reasoning>
") == 1:
        score += 0.125
    if text.count("
</reasoning>
") == 1:
        score += 0.125
    if text.count("
<answer>
") == 1:
        score += 0.125
        score -= len(text.split("
</answer>
")[-1]) * 0.001
    if text.count("
</answer>") == 1:
        score += 0.125
        score -= (len(text.split("
</answer>")[-1]) - 1) * 0.001
    return score

def xmlcount_reward_func(completions, **kwargs):
    contents = [c[0]["content"] for c in completions]
    return [count_xml(c) for c in contents]

Training Configuration

GRPO training is orchestrated with GRPOConfig from the TRL library:

training_args = GRPOConfig(
    output_dir="outputs/Qwen-0.5B-GRPO",
    run_name="Qwen-0.5B-GRPO-gsm8k",
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=False,
    report_to="wandb"
)

Model and tokenizer are loaded, the trainer is instantiated with the reward functions, and training is started:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func
    ],
    args=training_args,
    train_dataset=dataset
)

trainer.train()
trainer.save_model(output_dir)

Pre‑training Baseline Test

Before GRPO, the model was queried with a simple arithmetic problem:

prompt = "Joy can read 8 pages of a book in 20 minutes. How many hours will it take her to read 120 pages?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

The baseline model returned a plain answer without the required reasoning block or XML tags.

Post‑training Evaluation

After GRPO training, the same prompt yields a structured response that includes a reasoning segment and an answer wrapped in the defined XML tags.

Quantitative evaluation on the GSM8K test set shows accuracy improving from 22.4 % (baseline) to 48.6 % after GRPO training, indicating both format compliance and genuine reasoning improvement.

Reproducibility

The entire pipeline—GRPO theory, dataset preparation, reward‑function engineering, training configuration, and evaluation—can be reproduced using the code snippets above. The approach is applicable to other small LLMs and tasks that require step‑by‑step reasoning and structured output.

reinforcement learningReward FunctionGRPOQwen2.5TRLGSM8K
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.