How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent

This article analyzes the rStar2‑Agent paper, revealing how Agentic Reinforcement Learning, the GRPO‑RoC algorithm, a high‑throughput code‑execution service, and a three‑stage training recipe let a modest 14‑billion‑parameter model surpass much larger LLMs on challenging math benchmarks.

Data Party THU
Data Party THU
Data Party THU
How Agentic RL Enables a 14B LLM to Outperform Giant Models – Inside rStar2‑Agent

Recent large‑language‑model (LLM) advances rely on test‑time scaling, i.e., generating longer chain‑of‑thought (CoT) sequences, but longer reasoning does not always mean smarter reasoning, especially when tool‑use errors and noisy feedback dominate.

The paper introduces Agentic Reinforcement Learning (Agentic RL), where the model becomes an active agent that interacts with external tools (e.g., a Python interpreter) and adapts its reasoning strategy based on the environment’s feedback.

Core Innovation 1: GRPO‑RoC Algorithm – Learning Efficiently in Noisy Environments

When a model calls a tool, the generated code often contains syntax or logic errors, producing a Traceback instead of a useful result. This “environment noise” misleads traditional RL that rewards only the final answer, allowing completely faulty reasoning trajectories to receive full credit.

GRPO‑RoC solves this without altering the simple “correct‑answer‑only” reward. It adds a data‑level filter called Resample‑on‑Correct (RoC) that works as follows:

Oversample : for each question generate 2G rollout trajectories instead of the usual G .

Asymmetric Downsampling :

Tool error rate (p_err) : proportion of tool calls that fail.

Format violation rate (p_format) : proportion of outputs that break the required <answer> / <reason> markup.

Compute total penalty p_total = p_err + p_format.

Separate trajectories into positive samples (correct final answer) and negative samples (incorrect answer).

Randomly downsample negatives to half their count.

Prioritize high‑quality positives by sampling them with probability inversely proportional to p_total (lower penalty → higher chance).

Policy Update : use the filtered G trajectories to compute the advantage function and update the policy.

This approach keeps the reward simple while feeding the model more high‑quality positive examples and diverse negative examples, dramatically reducing tool‑error rates during training.

Core Innovation 2: Large‑Scale Agent RL Infrastructure

Agentic RL requires massive, frequent model‑environment interactions, creating two engineering challenges:

Massive concurrent tool calls : a single training step may issue tens of thousands of Python executions, which would overwhelm a local CPU and leave GPUs idle.

Highly imbalanced multi‑round rollouts : different questions generate vastly different numbers of tokens and tool calls, causing GPU load imbalance and synchronization delays.

To address these, rStar2‑Agent builds an isolated code‑execution service deployed on CPU clusters. A central task queue batches incoming requests, distributes them to many “execution workers,” and returns results to the RL process. The service processes >45 000 calls per second with an average latency of 0.3 s.

For the second challenge, a dynamic rollout scheduler monitors each GPU’s KV‑cache capacity and assigns new rollout tasks to GPUs that still have headroom, ensuring all GPUs stay busy without exceeding memory limits.

Core Innovation 3: Efficient Training – Low‑Cost, High‑Performance

The training recipe consists of three stages:

Stage 1 (concise RL, 8K max length) : train on 42 K math problems with a strict 8 K token limit. Early truncations force the model to use tools efficiently; response length quickly stabilizes around 4 K.

Stage 2 (12K length) : once performance plateaus, increase the length limit to 12 K, allowing the model to handle more complex problems; average response grows to ~6 K.

Stage 3 (hard‑sample focus, 12K) : select the ~17.3 K hardest problems (still unsolved) and continue training, pushing average response to ~8 K and reaching peak performance.

Before RL, a non‑reasoning SFT step teaches the model instruction following, JSON‑style tool calling, and answer formatting without any math‑reasoning data. This prevents over‑fitting to a specific reasoning pattern and yields short initial responses (~1 K tokens), which are ideal for the subsequent RL phases.

Results are striking: the 14 B rStar2‑Agent achieves 80.6 % pass@1 on AIME 2024 and 69.8 % on AIME 2025, surpassing OpenAI o3‑mini, DeepSeek‑R1 (671 B), and Claude Opus 4.0, while using only 510 RL steps and a week of training on 64 GPUs.

Token‑Entropy Analysis: How the Model Thinks Smarter

By examining token entropy, the authors identify two high‑entropy patterns:

Forking Tokens : tokens that appear when the model self‑reflects, asks a question, or plans verification (e.g., “But before…”, “let me double‑check…”). These drive exploration.

Reflection Tokens : tokens generated after receiving tool feedback, used to analyse the error, propose a fix, and produce corrected code. Example snippets include “To verify …” after a successful call and “The error occurred because …” followed by a workaround after a failure.

These patterns show that Agentic RL equips the model with a genuine feedback‑driven reasoning loop, not just longer CoT chains.

Discussion and Lessons Learned

Over‑long filtering (dropping truncated trajectories) unintentionally encouraged the model to produce even longer, repetitive text because it removed negative feedback.

N‑gram repetition filtering harmed useful verification calls, demonstrating that overly complex rule‑based rewards can be detrimental.

The study confirms that a simple final‑answer reward combined with data‑level RoC sampling yields robust learning, reduces bias, and preserves exploration.

RL mainly unlocks capabilities already present in the pretrained model; it cannot create capacity beyond the model’s inherent limits, highlighting the importance of efficient RL to reach the performance ceiling.

Conclusion

Agentic RL proves that making a model “think smarter” through tool interaction and feedback‑driven reflection is far more effective than merely extending reasoning time. With the GRPO‑RoC algorithm, a high‑throughput execution service, and a three‑stage low‑cost training recipe, a modest 14 B model attains state‑of‑the‑art math performance and strong generalization across scientific and alignment tasks.

Artificial IntelligenceLLMAI researchModel Efficiencytool useAgentic Reinforcement LearningRL Optimization
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.