Artificial Intelligence 11 min read

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Machine Heart

Apr 27, 2026

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

Background

Reinforcement‑learning (RL) after training has become a core technique for boosting the inference ability of large models such as DeepSeek‑R1 and Kimi K2.5. Unlike pre‑training, RL optimizes a reward‑maximizing policy rather than next‑token prediction, so the familiar pre‑training scaling laws cannot be directly applied.

Experimental Design

To uncover the scaling behavior of RL after training, the authors selected mathematical reasoning as the benchmark because its answers are verifiable and provide precise reward signals. They conducted large‑scale controlled experiments on two model families:

Qwen2.5 dense models ranging from 0.5 B to 72 B parameters (single variable: model size).

Llama 3 models from 1 B to 70 B parameters for cross‑architecture validation.

All experiments used the VeRL distributed RL platform with the GRPO algorithm, repeated three times per configuration, and covered both Base and Instruct variants. Training data came from the guru‑RL‑92k subset (≈54 k math problems) organized as a curriculum. Test loss L = 1 − Pass@1 served as the primary metric, evaluated on 500 in‑domain math questions and on ~3 000 cross‑domain problems (code, logic, science, etc.).

Key Findings

Finding 1 – Predictive RL Scaling Law : Test loss L scales logarithmically linearly with training resources X (compute C or data D) via a simple power‑law formula, where the learning‑efficiency coefficient k(N) grows monotonically with model size N. The fit achieves R² > 0.99 and enables two practical predictions:

Inter‑model extrapolation : Parameters fitted on 0.5 B–32 B models accurately predict the full training curve of a 72 B model.

Intra‑model prediction : Using only the first 20‑30 % of training steps, the formula forecasts the final converged performance of the same model.

Finding 2 – Saturating learning efficiency : While larger models learn faster (k(N) increases from 0.5 B to 72 B), the growth is sub‑linear and plateaus after roughly 32 B parameters, approaching a theoretical ceiling. This saturation explains a “performance crossover” where, under equal compute budgets, a 32 B model can initially outperform a 72 B model because it completes more training steps.

Finding 3 – Data reuse is effective : Re‑using the same data multiple times (repeat factor r) shows that for r ≤ 25 the final performance is virtually identical to using the full dataset once. Only at extreme reuse (r = 100) does over‑fitting become noticeable. Thus, moderate data reuse is a low‑cost strategy when high‑quality reasoning data are scarce.

Cross‑Architecture Validation

The same scaling law and saturation behavior were reproduced on Llama 3 models (Llama‑3.2‑1B/3B‑Instruct, Llama‑3.1‑8B/70B‑Instruct). Despite lower absolute performance compared to Qwen, the functional form of the scaling relationship and the k(N) saturation trend remain identical, with R² > 0.99, confirming that the law captures an intrinsic property of RL after training rather than an architecture‑specific artifact.

Conclusion

The study delivers a quantitative, predictive framework for RL‑after‑training of large language models, offering both inter‑model extrapolation and intra‑model trajectory forecasting. It also highlights the diminishing returns of scaling beyond 32 B parameters and validates data‑reuse as a practical training shortcut, providing actionable guidance for researchers and engineers seeking to improve model inference via RL.

large language models Scaling Law reinforcement learning Model Efficiency Qwen2.5 Data Reuse Llama 3

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.