Do Recent LLM‑RL Papers Overstate Their Gains? A Critical Review
This article critically examines seven high‑profile reinforcement‑learning papers for large language models, exposing flawed baseline evaluations, unrealistic settings, and modest actual improvements despite bold claims of dramatic performance gains.
Overview
The article Incorrect Baseline Evaluations Call into Question Recent LLM‑RL Claims critically examines seven recent reinforcement‑learning (RL) papers that report large performance gains for large language models (LLMs) on reasoning tasks. Across all studies, the authors identify a common pattern: modest changes to evaluation settings—such as temperature, output formatting, few‑shot prompting, or token‑length limits—inflate the apparent improvement, while the underlying RL methods contribute little or no genuine gain.
Paper 1: Spurious Rewards – Rethinking Training Signals in RLVR
RLVR is claimed to improve mathematical reasoning even when the reward signal is unrelated or negatively correlated with the correct answer. The original authors report up to a +26.5 % gain for Qwen2.5, close to the +28.8 % gain obtained with true rewards. The critique shows that the baseline score used in the paper is far below Qwen2.5’s actual performance; when evaluated with the proper baseline the net improvement shrinks to roughly 5 % .
Paper 2: Maximizing Confidence Alone Improves Reasoning
This work proposes an unsupervised RL method that uses the model’s own prediction entropy as an intrinsic reward, encouraging higher confidence in chain‑of‑thought outputs. The reported improvement stems largely from a low baseline caused by the model’s inability to follow the original GSM‑8K output format. When the community‑standard \boxed{} format is used, the baseline score exceeds the RL‑enhanced score, indicating that the RL step provides no real benefit.
Paper 3: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Model: Qwen2.5‑Math‑1.5B
Baseline MATH‑500 accuracy: 36.0 %
After RL with a single example: 73.6 %
Average accuracy across six math benchmarks: 17.6 % → 35.7 %
The authors attribute the jump to the RL algorithm, but the critique identifies two methodological flaws: (1) an unreasonable temperature setting (e.g., temperature=0.1) that artificially depresses the baseline, and (2) evaluation procedures that differ between the baseline and RL runs, making the comparison invalid.
Paper 4: Learning to Reason without External Rewards (INTUITOR)
INTUITOR replaces external rewards with self‑certainty scores, achieving performance comparable to GRPO on math benchmarks and better cross‑domain generalization on code generation. However, the RL‑enhanced model does not surpass the official few‑shot accuracy of the base model. The critique argues that the reported baseline is underestimated because the evaluation was zero‑shot and used a suboptimal output format; with the proper few‑shot setup the base model already matches or exceeds the RL results.
Paper 5: Verifree – Reinforcing General Reasoners without Verifiers
Verifree removes answer verifiers and directly maximizes the probability of reference answers via RL. Experiments show comparable or superior results to verifier‑based methods while reducing computational cost. The critique notes that the comparison uses a temperature of 0, which the Qwen‑3 documentation states degrades model capability, so the reported advantage may be an artifact of mismatched decoding settings.
Paper 6: Unreasonable Effectiveness of Entropy Minimization
The authors minimize output entropy, concentrating probability mass on the model’s most confident predictions. Without any labeled data, this simple objective yields significant gains on mathematics, physics, and coding tasks. The critique again points out that the RL baseline scores were lowered by using an excessively low temperature (e.g., temperature=0.1), inflating the apparent improvement.
Paper 7: Can Large Reasoning Models Self‑Train?
This paper introduces an online self‑training RL algorithm that treats model self‑consistency as a correctness signal, eliminating the need for external labels. Applied to complex mathematical reasoning, the method matches or exceeds gold‑standard RL trained with true answers. Nevertheless, the reported pre‑training scores are much lower than those observed in practice, suggesting that evaluation parameters (temperature, prompt style) differ between the baseline and the self‑training runs.
Key Takeaways
Many claimed RL gains are attributable to inconsistent or suboptimal evaluation settings rather than algorithmic advances.
Temperature selection (often set to 0 or 0.1) frequently depresses baseline performance, making RL‑augmented results appear larger.
Output formatting (e.g., using \boxed{} for GSM‑8K) can dramatically affect scores; baselines that ignore community‑standard formats underestimate true performance.
Rigorous, transparent benchmarking—identical decoding parameters, prompt styles, and evaluation scripts for both baseline and RL models—is essential to avoid misleading conclusions.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
