Artificial Intelligence 10 min read

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

Baobao Algorithm Notes

Oct 31, 2025

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Paper Overview

The Art of Scaling Reinforcement Learning Compute for LLMs

Link: https://arxiv.org/abs/2510.13786

The work introduces a scaling law for reinforcement‑learning (RL) applied to large language models (LLMs) and evaluates many RL design choices, culminating in a recommended “ScaleRL” recipe.

Scaling Law of RL

Experiments were run on an 8‑billion‑parameter dense model and a 17‑billion‑parameter × 16 mixture‑of‑experts (MoE) model, consuming roughly 50 000 GPU‑hours. Performance on an i.i.d. validation set follows a sigmoid‑shaped curve that extrapolates accurately to larger compute budgets and matches downstream AIME‑24 results.

The scaling law can be expressed as: RL Improvement = A * f(Compute) where f is a sigmoid function, A is the performance ceiling (maximum achievable RL reward), B is the learning efficiency (steepness), and C_mid is the compute required to reach half of the ceiling.

Empirical Study of RL Scaling

Asynchronous RL Settings

PPO‑off‑policy‑k : An old policy generates B prompts, which are split into k mini‑batches of size \hat{B} . Gradients are computed on each mini‑batch.

Pipeline‑RL‑k (Magistral): After each trainer update, the new parameters are loaded into the generator while preserving previously generated tokens and KV‑cache. The trainer is limited to at most k steps ahead of the generator.

Background : Generation and training are often implemented in separate frameworks, resulting in two distinct model instances.

Conclusions

Pipeline‑RL achieves the same performance ceiling A as PPO‑off‑policy but with higher learning efficiency B .

The optimal off‑policy horizon is k = 8 .

Loss Types

DAPO (ByteDance Seed) : Token‑level clipping; clipped tokens contribute no gradient.

GSPO (Qwen) : Sequence‑level clipping.

CISPO (Minimax) : Starts from vanilla REINFORCE, applies clipping and stop‑gradient on importance‑sampled rewards while retaining gradients for all tokens.

Conclusion : CISPO slightly outperforms GSPO, and both surpass DAPO in performance ceiling A .

FP32 Precision for LLM Logits

Using full‑precision (FP32) logits in both the generator and trainer heads yields a noticeable performance boost (proposed by Minimax‑M1).

Loss Aggregation Strategies

Sample average (GRPO): Average over all tokens within each trajectory.

Prompt average (DAPO): Average over all trajectories for each prompt.

Token average: Average over all tokens in a batch.

Conclusion : Prompt‑level averaging delivers the best results.

Advantage Normalization

Compute reward standard deviation across rollouts of a single prompt (GRPO).

Compute reward standard deviation across the entire batch.

Do not normalize (Dr.GRPO).

Conclusion : All three variants perform similarly; the batch‑wide standard deviation is theoretically superior and is adopted.

Zero‑Variance Filtering

Filtering out prompts whose rollouts have zero variance (identical rewards) improves training performance.

Adaptive Prompt Filtering (Polaris)

Prompts with average accuracy above 0.9 are permanently removed, raising the performance ceiling A .

ScaleRL: A Mature RL Recipe

Pipeline‑RL with an 8‑step off‑policy horizon.

Interruption‑based length control (stop generation when a length limit is reached and let the LLM answer directly).

FP32 precision for logits.

CISPO loss.

Prompt‑level loss aggregation.

Zero‑variance filtering.

No‑positive resampling.

Ablation experiments that remove each component from the full configuration demonstrate the contribution of every part.

Factors Influencing RL Scaling

Model size : The scaling law holds for both 7B dense and 17B × 16 MoE models.

Context length : Extending from 14k to 32k tokens reduces early learning efficiency (lower B ) but raises the final performance ceiling (higher A ).

Batch size : Smaller batches perform better early on; larger batches achieve higher ceilings as compute grows.

Rollout count : Varying the number of rollouts per prompt (8, 16, 24, 32) has negligible impact when total batch size is fixed.

Takeaways

RL performance improvement follows a sigmoid function of log‑scaled compute, parameterized by a ceiling A and efficiency B .

Algorithmic design determines the achievable ceiling A ; scaling‑law curves can predict which methods will remain effective at larger scales.

Many previously‑believed‑effective tricks (loss aggregation, curriculum learning, length penalties, advantage normalization) mainly boost efficiency B without raising the ceiling A .

The ScaleRL recipe provides a practical, empirically‑validated set of guidelines for future LLM RL training.

LLM Scaling Law reinforcement learning RL Optimization