Artificial Intelligence 12 min read

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

This article analyzes a recent study on language‑model reasoning, revealing that reinforcement learning often brings little or no improvement, while evaluation variance caused by seeds, hardware, and decoding settings can dramatically affect benchmark results, and supervised fine‑tuning emerges as a more reliable path.

AI Frontier Lectures

Apr 17, 2025

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

Background

Reasoning ability is a rapidly advancing frontier for language models. The paper A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility (arXiv:2504.07086) investigates whether reinforcement learning (RL) truly improves small distilled LLMs and why reported gains are often unstable.

Methodology

The authors built a standardized evaluation pipeline using the LightEval framework. Six mathematical reasoning benchmarks were used:

AIME‑24

AIME‑25

AMC‑23

MATH‑500

Minerva

OlympiadBench

For each benchmark they ran:

10 random seeds for AIME/AMC, 3 seeds for the other datasets.

Identical prompt templates, context length (4096 tokens for math models, 32768 for others), and decoding parameters (temperature, top_p, max_new_tokens).

Evaluations on five GPU clusters with different hardware (e.g., V100, A100) and software stacks (PyTorch, vLLM, etc.).

Model families compared:

DeepSeek‑R1‑Distill (1.5B and 7B) – base, instruction‑tuned, and RL‑fine‑tuned variants (e.g., OpenRS‑1.5B, DeepscaleR).

SFT (supervised fine‑tuning) on reasoning trajectories for the same base models.

Key Findings

Seed variance: Pass@1 standard deviation ranged from 5 pp to 15 pp across seeds, especially on small benchmarks (AIME‑24, AMC‑23) where a single seed could shift accuracy by 2.5–3.3 pp.

Hardware & framework effects: Changing GPU type or evaluation library (LightEval vs. Evalchemy) altered rankings by 1–2 pp; on AIME‑24 the same model showed up to an 8 pp spread across clusters.

Decoding settings: Reducing max_new_tokens lowered performance, particularly on problems requiring long solutions. Prompt format also mattered: math‑specific prompts combined with the model’s native chat template yielded the best results.

RL vs. SFT: Most RL‑trained variants did not achieve statistically significant improvements over their base models. When gains existed, they were modest and less robust than SFT, which consistently outperformed RL across all benchmarks.

Length‑error correlation: Longer generated responses correlated with higher error rates for both RL and SFT models; the effect was stronger for RL‑trained models.

Diversity collapse: No systematic drop in diversity was observed. Improvements in Pass@1 were usually accompanied by gains in Pass@k (k = 5, 10), contradicting the “diversity collapse” hypothesis.

Reproducibility Recommendations

Report results over multiple random seeds and include error bars.

Fix and disclose all decoding hyper‑parameters (temperature, top_p, max_new_tokens) and prompt templates.

Document hardware (GPU model, memory) and software stack (framework version, inference engine).

Prefer a single, open‑source evaluation framework (e.g., LightEval) and share configuration files.

Consider supervised fine‑tuning on reasoning trajectories as a more reliable way to improve performance.

Experimental Results (selected)

Table 3 (referenced in the paper) reports Pass@1 ± σ for each model‑benchmark pair. Highlights include:

OpenRS‑1.5B showed up to an 8 pp performance swing on AIME‑24 across hardware clusters.

DeepSeek‑R1‑Distill‑7B RL variants improved Pass@1 by ≤ 2 pp, whereas SFT on the same model improved by 4–6 pp.

Pass@k (k = 5, 10) improvements tracked Pass@1 gains, indicating no loss of answer diversity.

Conclusion

When evaluated under a rigorously controlled, reproducible setup, reinforcement learning provides only marginal benefits for small distilled LLMs, while supervised fine‑tuning on reasoning trajectories yields consistent and larger gains. The study underscores the importance of standardized evaluation practices—multiple seeds, fixed hyper‑parameters, and transparent hardware/software reporting—to obtain trustworthy progress signals in language‑model reasoning research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM reinforcement learning Reproducibility

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.