Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study
Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.
Paper Overview
The Art of Scaling Reinforcement Learning Compute for LLMs Link: https://arxiv.org/abs/2510.13786The work introduces a scaling law for reinforcement‑learning (RL) applied to large language models (LLMs) and evaluates many RL design choices, culminating in a recommended “ScaleRL” recipe.
Scaling Law of RL
Experiments were run on an 8‑billion‑parameter dense model and a 17‑billion‑parameter × 16 mixture‑of‑experts (MoE) model, consuming roughly 50 000 GPU‑hours. Performance on an i.i.d. validation set follows a sigmoid‑shaped curve that extrapolates accurately to larger compute budgets and matches downstream AIME‑24 results.
The scaling law can be expressed as: RL Improvement = A * f(Compute) where f is a sigmoid function, A is the performance ceiling (maximum achievable RL reward), B is the learning efficiency (steepness), and C_mid is the compute required to reach half of the ceiling.
Empirical Study of RL Scaling
Asynchronous RL Settings
PPO‑off‑policy‑k : An old policy generates B prompts, which are split into k mini‑batches of size \hat{B} . Gradients are computed on each mini‑batch.
Pipeline‑RL‑k (Magistral): After each trainer update, the new parameters are loaded into the generator while preserving previously generated tokens and KV‑cache. The trainer is limited to at most k steps ahead of the generator.
Background : Generation and training are often implemented in separate frameworks, resulting in two distinct model instances.
Conclusions
Pipeline‑RL achieves the same performance ceiling A as PPO‑off‑policy but with higher learning efficiency B .
The optimal off‑policy horizon is k = 8 .
Loss Types
DAPO (ByteDance Seed) : Token‑level clipping; clipped tokens contribute no gradient.
GSPO (Qwen) : Sequence‑level clipping.
CISPO (Minimax) : Starts from vanilla REINFORCE, applies clipping and stop‑gradient on importance‑sampled rewards while retaining gradients for all tokens.
Conclusion : CISPO slightly outperforms GSPO, and both surpass DAPO in performance ceiling A .
FP32 Precision for LLM Logits
Using full‑precision (FP32) logits in both the generator and trainer heads yields a noticeable performance boost (proposed by Minimax‑M1).
Loss Aggregation Strategies
Sample average (GRPO): Average over all tokens within each trajectory.
Prompt average (DAPO): Average over all trajectories for each prompt.
Token average: Average over all tokens in a batch.
Conclusion : Prompt‑level averaging delivers the best results.
Advantage Normalization
Compute reward standard deviation across rollouts of a single prompt (GRPO).
Compute reward standard deviation across the entire batch.
Do not normalize (Dr.GRPO).
Conclusion : All three variants perform similarly; the batch‑wide standard deviation is theoretically superior and is adopted.
Zero‑Variance Filtering
Filtering out prompts whose rollouts have zero variance (identical rewards) improves training performance.
Adaptive Prompt Filtering (Polaris)
Prompts with average accuracy above 0.9 are permanently removed, raising the performance ceiling A .
ScaleRL: A Mature RL Recipe
Pipeline‑RL with an 8‑step off‑policy horizon.
Interruption‑based length control (stop generation when a length limit is reached and let the LLM answer directly).
FP32 precision for logits.
CISPO loss.
Prompt‑level loss aggregation.
Zero‑variance filtering.
No‑positive resampling.
Ablation experiments that remove each component from the full configuration demonstrate the contribution of every part.
Factors Influencing RL Scaling
Model size : The scaling law holds for both 7B dense and 17B × 16 MoE models.
Context length : Extending from 14k to 32k tokens reduces early learning efficiency (lower B ) but raises the final performance ceiling (higher A ).
Batch size : Smaller batches perform better early on; larger batches achieve higher ceilings as compute grows.
Rollout count : Varying the number of rollouts per prompt (8, 16, 24, 32) has negligible impact when total batch size is fixed.
Takeaways
RL performance improvement follows a sigmoid function of log‑scaled compute, parameterized by a ceiling A and efficiency B .
Algorithmic design determines the achievable ceiling A ; scaling‑law curves can predict which methods will remain effective at larger scales.
Many previously‑believed‑effective tricks (loss aggregation, curriculum learning, length penalties, advantage normalization) mainly boost efficiency B without raising the ceiling A .
The ScaleRL recipe provides a practical, empirically‑validated set of guidelines for future LLM RL training.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
