Can Small 7B Models Beat the State‑of‑the‑Art? A Critical Analysis of R1‑Zero Training and Unbiased GRPO
This article critically examines R1‑Zero‑style training by analyzing foundation models and reinforcement learning, uncovering pre‑training and optimization biases, proposing an unbiased Dr. GRPO method, and demonstrating a minimalist 7B‑model recipe that achieves new state‑of‑the‑art performance on AIME 2024.
Abstract
We dissect the two pillars of R1‑Zero‑style training—foundation models and reinforcement‑learning (RL) optimisation. Experiments on a 500‑question sample from the MATH benchmark reveal that Qwen‑2.5 models answer correctly without any prompt template, indicating a strong pre‑training bias toward concatenated question‑answer pairs. DeepSeek‑V3‑Base exhibits self‑reflection tokens (e.g., “Aha”, “wait”, “verify the problem”), confirming the emergence of “Aha moments” during pure RL fine‑tuning. We identify two systematic biases in the GRPO optimiser that inflate response length for incorrect answers and overweight easy problems. By removing the problematic normalisation terms and replacing the masked‑mean budget with a constant token budget, we obtain an unbiased optimiser called Dr. GRPO , which improves token efficiency while preserving reasoning performance. Using Dr. GRPO with a Qwen‑Math template on Qwen‑2.5‑Math‑7B, we achieve 43.3 % accuracy on AIME 2024 after 27 h on eight A100 GPUs, establishing a new state‑of‑the‑art result for a 7 B model.
1 Introduction
The study isolates two components of the R1‑Zero pipeline:
Foundation models: we evaluate Qwen‑2.5 series, Llama‑3.1, and DeepSeek models on MATH questions.
RL optimiser: we analyse the GRPO algorithm, expose its length and difficulty biases, and propose a corrected version.
Our minimalist pipeline combines an unbiased Dr. GRPO optimiser with the Qwen‑Math prompt template and fine‑tunes Qwen‑2.5‑Math‑7B on MATH difficulty levels 3‑5. The resulting model reaches state‑of‑the‑art performance on AIME 2024 within a modest compute budget.
2 Foundation Model Analysis
2.1 Prompt‑Template Trainability
We compare three prompting strategies on six models: (1) the R1 template (Guo et al., 2025), (2) the Qwen‑Math template (Zeng et al., 2025), and (3) no template. For each model we first query without a template, use GPT‑4.0 to classify responses as “answer” vs. “sentence‑completion”, and record the answer‑rate. Then we apply the two templates and compute pass@8 accuracy.
2.2 Qwen‑2.5 Models: Dropping Templates Unlocks Peak Performance
All Qwen‑2.5 variants achieve a 100 % answer‑rate with no prompt, confirming that the question‑answer mapping is already internalised during pre‑training. Removing the template yields an average ~60 % boost in pass@8 compared with template‑based prompting (Table 1). This suggests that Qwen‑2.5‑Math was pre‑trained on concatenated QA pairs.
2.3 “Aha Moments” in DeepSeek‑V3‑Base
DeepSeek‑V3‑Base generates self‑reflection keywords such as “Aha”, “wait”, and “verify the problem” during RL fine‑tuning, demonstrating reflective behaviour. However, the frequency of these tokens does not correlate with higher MATH accuracy.
3 Reinforcement Learning Analysis
LLM generation is formalised as a token‑level Markov Decision Process (MDP). The state at step t consists of the input question concatenated with the generated output so far, and the policy selects the next token from the vocabulary.
3.1 Biases Introduced by GRPO
GRPO normalises rewards by token count and by problem‑specific statistics, which creates two biases:
Response‑length bias : dividing by token count gives larger updates to short correct answers and smaller penalties to long incorrect answers.
Problem‑level difficulty bias : normalising by per‑problem statistics up‑weights easy problems during updates.
3.2 Dr. GRPO: Unbiased Optimiser
We remove both normalisation terms and replace the masked‑mean operation with a constant token‑budget term, restoring the original PPO objective with an unbiased baseline. The resulting algorithm is implemented in the Oat RL framework and evaluated on Qwen‑2.5‑1.5B with the R1 template and a Math‑VeR1fy² reward function.
# Pseudo‑code for Dr. GRPO (simplified)
for each batch:
# Sample responses r_i for each problem
rewards = compute_rewards(r_i)
# No division by token count or problem variance
advantage = rewards - baseline
loss = -log_prob * advantage + clip_term
optimizer.step(loss)3.3 Template‑Problem‑Set Interaction
Using the Qwen‑Math template, training on a simple GSM‑8K problem set nearly doubles final accuracy on harder benchmarks, demonstrating that broader problem coverage dramatically improves RL outcomes. Conversely, a mismatch between model and template (e.g., applying the R1 template to Qwen‑2.5‑Math‑1.5B) forces RL to rebuild reasoning that the template had destroyed.
3.4 Domain‑Specific Pre‑Training Raises the RL Ceiling
We fine‑tune Llama‑3.2‑3B on a math‑focused dataset (FineMath4) and on a concatenated QA dataset (NuminaMath‑1.5). After RL fine‑tuning with Dr. GRPO, both models show modest but consistent gains over the vanilla Llama baseline, confirming that domain‑specific pre‑training lifts the performance ceiling of RL‑based methods.
4 Conclusion
Our analysis shows that (1) pre‑training bias in Qwen‑2.5 models can render prompt templates unnecessary, and (2) GRPO’s normalisation introduces length and difficulty biases that degrade token efficiency. The proposed Dr. GRPO optimiser removes these biases, yielding shorter, more efficient responses without sacrificing reasoning ability. A minimalist R1‑Zero pipeline—Dr. GRPO + Qwen‑Math template + Qwen‑2.5‑Math‑7B—achieves 43.3 % accuracy on AIME 2024 in 27 h on eight A100 GPUs, demonstrating that smaller models can reach state‑of‑the‑art performance when optimisation bias is eliminated.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
