Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO

Abstract

We dissect the two pillars of R1‑Zero‑style training—foundation models and reinforcement‑learning (RL) optimisation. Experiments on a 500‑question sample from the MATH benchmark reveal that Qwen‑2.5 models answer correctly without any prompt template, indicating a strong pre‑training bias toward concatenated question‑answer pairs. DeepSeek‑V3‑Base exhibits self‑reflection tokens (e.g., “Aha”, “wait”, “verify the problem”), confirming the emergence of “Aha moments” during pure RL fine‑tuning. We identify two systematic biases in the GRPO optimiser that inflate response length for incorrect answers and overweight easy problems. By removing the problematic normalisation terms and replacing the masked‑mean budget with a constant token budget, we obtain an unbiased optimiser called Dr. GRPO , which improves token efficiency while preserving reasoning performance. Using Dr. GRPO with a Qwen‑Math template on Qwen‑2.5‑Math‑7B, we achieve 43.3 % accuracy on AIME 2024 after 27 h on eight A100 GPUs, establishing a new state‑of‑the‑art result for a 7 B model.

1 Introduction

The study isolates two components of the R1‑Zero pipeline:

Foundation models: we evaluate Qwen‑2.5 series, Llama‑3.1, and DeepSeek models on MATH questions.

RL optimiser: we analyse the GRPO algorithm, expose its length and difficulty biases, and propose a corrected version.

Our minimalist pipeline combines an unbiased Dr. GRPO optimiser with the Qwen‑Math prompt template and fine‑tunes Qwen‑2.5‑Math‑7B on MATH difficulty levels 3‑5. The resulting model reaches state‑of‑the‑art performance on AIME 2024 within a modest compute budget.

2 Foundation Model Analysis

2.1 Prompt‑Template Trainability

We compare three prompting strategies on six models: (1) the R1 template (Guo et al., 2025), (2) the Qwen‑Math template (Zeng et al., 2025), and (3) no template. For each model we first query without a template, use GPT‑4.0 to classify responses as “answer” vs. “sentence‑completion”, and record the answer‑rate. Then we apply the two templates and compute pass@8 accuracy.

Experiment setup
Experiment setup

2.2 Qwen‑2.5 Models: Dropping Templates Unlocks Peak Performance

All Qwen‑2.5 variants achieve a 100 % answer‑rate with no prompt, confirming that the question‑answer mapping is already internalised during pre‑training. Removing the template yields an average ~60 % boost in pass@8 compared with template‑based prompting (Table 1). This suggests that Qwen‑2.5‑Math was pre‑trained on concatenated QA pairs.

Performance gain without template
Performance gain without template

2.3 “Aha Moments” in DeepSeek‑V3‑Base

DeepSeek‑V3‑Base generates self‑reflection keywords such as “Aha”, “wait”, and “verify the problem” during RL fine‑tuning, demonstrating reflective behaviour. However, the frequency of these tokens does not correlate with higher MATH accuracy.

Aha moments examples
Aha moments examples

3 Reinforcement Learning Analysis

LLM generation is formalised as a token‑level Markov Decision Process (MDP). The state at step t consists of the input question concatenated with the generated output so far, and the policy selects the next token from the vocabulary.

3.1 Biases Introduced by GRPO

GRPO normalises rewards by token count and by problem‑specific statistics, which creates two biases:

Response‑length bias : dividing by token count gives larger updates to short correct answers and smaller penalties to long incorrect answers.

Problem‑level difficulty bias : normalising by per‑problem statistics up‑weights easy problems during updates.

GRPO bias diagram
GRPO bias diagram

3.2 Dr. GRPO: Unbiased Optimiser

We remove both normalisation terms and replace the masked‑mean operation with a constant token‑budget term, restoring the original PPO objective with an unbiased baseline. The resulting algorithm is implemented in the Oat RL framework and evaluated on Qwen‑2.5‑1.5B with the R1 template and a Math‑VeR1fy² reward function.

# Pseudo‑code for Dr. GRPO (simplified)
for each batch:
    # Sample responses r_i for each problem
    rewards = compute_rewards(r_i)
    # No division by token count or problem variance
    advantage = rewards - baseline
    loss = -log_prob * advantage + clip_term
    optimizer.step(loss)
Dr. GRPO loss diagram
Dr. GRPO loss diagram

3.3 Template‑Problem‑Set Interaction

Using the Qwen‑Math template, training on a simple GSM‑8K problem set nearly doubles final accuracy on harder benchmarks, demonstrating that broader problem coverage dramatically improves RL outcomes. Conversely, a mismatch between model and template (e.g., applying the R1 template to Qwen‑2.5‑Math‑1.5B) forces RL to rebuild reasoning that the template had destroyed.

Template‑problem interaction
Template‑problem interaction

3.4 Domain‑Specific Pre‑Training Raises the RL Ceiling

We fine‑tune Llama‑3.2‑3B on a math‑focused dataset (FineMath4) and on a concatenated QA dataset (NuminaMath‑1.5). After RL fine‑tuning with Dr. GRPO, both models show modest but consistent gains over the vanilla Llama baseline, confirming that domain‑specific pre‑training lifts the performance ceiling of RL‑based methods.

Pre‑training impact
Pre‑training impact

4 Conclusion

Our analysis shows that (1) pre‑training bias in Qwen‑2.5 models can render prompt templates unnecessary, and (2) GRPO’s normalisation introduces length and difficulty biases that degrade token efficiency. The proposed Dr. GRPO optimiser removes these biases, yielding shorter, more efficient responses without sacrificing reasoning ability. A minimalist R1‑Zero pipeline—Dr. GRPO + Qwen‑Math template + Qwen‑2.5‑Math‑7B—achieves 43.3 % accuracy on AIME 2024 in 27 h on eight A100 GPUs, demonstrating that smaller models can reach state‑of‑the‑art performance when optimisation bias is eliminated.

GRPOfoundation modelsLLM evaluationToken EfficiencyR1-Zero
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.