Artificial Intelligence 8 min read

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

SPEC‑RL introduces speculative rollouts that reuse verified historical rollouts as prefixes, cutting rollout time by 2–3× while maintaining or improving performance across various math and reasoning benchmarks, and works seamlessly with PPO, GRPO, DAPO and other on‑policy algorithms.

Shopee Tech Team

Oct 14, 2025

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

Introduction

Large models are becoming smarter, but enabling true reasoning often relies on on‑policy reinforcement learning (RLVR), which requires costly rollout phases. Each training iteration regenerates the entire reasoning process even when large portions overlap with previous rounds.

The Xiamen University‑Shopee‑Tsinghua team proposes SPEC‑RL, a method that accelerates training by several times without sacrificing the model's reasoning ability.

Key Advantages

2–3× training speedup : SPEC‑RL dramatically shortens rollout time while preserving performance.

Avoids duplicate generation : Reuses verified historical rollouts as “speculative prefixes”, generating only the new portion.

Seamless integration : Naturally compatible with PPO, GRPO, DAPO and other mainstream algorithms.

Stable performance : Improves or maintains accuracy on GSM8K, MATH‑500, OlympiadBench, MMLU‑STEM and other reasoning tasks.

Research Motivation: Wasteful Duplicate Generation

Traditional RLVR training forces the model to regenerate the full reasoning trace each round, guaranteeing data‑model alignment but causing massive redundant computation. The team observed that 50%–70% of generated content overlaps across training rounds, indicating room for optimization.

Method Design: Speculative Decoding Framework

Inspired by speculative decoding, SPEC‑RL treats the cached historical rollout as a draft token sequence. The current policy validates this speculative prefix; if the probability gap is small, the prefix is accepted, otherwise generation resumes from the divergence point. This yields a new sequence composed of “speculative prefix + new suffix”.

Controlled "Lenience" Parameter

SPEC‑RL introduces a lenience parameter that balances efficiency and exploration. Higher lenience allows longer prefix reuse and faster training, while lower lenience enforces stricter validation for better exploration.

Seamless Embedding into Existing Training

SPEC‑RL acts as a plug‑in for the rollout phase, requiring no changes to reward functions or policy updates. It can be inserted directly into PPO, GRPO, DAPO pipelines.

Experimental Results: Speed and Performance Gains

Implemented SPEC‑RL on the verl framework and evaluated on math reasoning datasets (GSM8K, MATH‑500, Minerva, OlympiadBench, AMC23) and cross‑domain benchmarks (MMLU‑STEM, IFEval). The main experiment shows an average 2.31× speedup with a 66% reduction in generated tokens, while accuracy remains stable or improves.

Average speedup 2.31×, peak 2.88× (Qwen‑3‑8B + DAPO).

Performance stable on GSM8K, MATH‑500; notable gains on OlympiadBench.

The method is algorithm‑agnostic and model‑agnostic, benefiting both small (1.7B) and large (8B) models across PPO, GRPO, DAPO.

Small model (1.7B) under PPO gains nearly 2× speedup.

Large model (8B) under DAPO gains up to 2.88×.

Ablation Study: Effect of Lenience

Low lenience (≈1) yields strict reuse and limited speedup; moderate lenience (≈e^{0.5}) achieves the best trade‑off with 2–3× speedup and unchanged accuracy; excessive lenience leads to collapse despite massive speedup.

Future Outlook

Adaptive lenience that automatically balances efficiency and exploration.

System‑level optimizations combining KV‑cache and state reuse.

Extending to multi‑turn dialogue, off‑policy RL, and broader LLM applications.

Overall, SPEC‑RL addresses the rollout bottleneck in reinforcement learning training, delivering 2–3× efficiency gains without altering reward functions, and represents a key step toward scalable, low‑cost large‑model RL.

large language models reinforcement learning Training Acceleration AI Efficiency speculative rollout

Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.