Optimizing Structured Processes in the Large‑Model Era: From Reasoning to Agentic RL

The article analyzes how large‑model development has moved from reasoning to the agentic stage, compares open‑source and closed‑source capabilities, details Reasoning RL versus Agentic RL designs, and proposes skill‑centric data and verification mechanisms to close the performance gap.

AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Optimizing Structured Processes in the Large‑Model Era: From Reasoning to Agentic RL

Shift to the Agentic Stage

Large language models are described as progressing through five stages: dialogue, reasoning, agents, research, and management. The current competitive focus has moved from reasoning directly to the agentic stage. Open‑source model Deepseek‑R1 matches or exceeds closed‑source models on reasoning benchmarks, but no open‑source model yet reaches closed‑source performance on agentic tasks.

Growing Gap Between Open‑Source and Closed‑Source in the Agentic Era

Closed‑source models Claude‑Opus‑4.6 and Gemini‑3.1 dominate agentic benchmarks. The best open‑source effort is Zhipu GLM‑5, which explicitly lists “Agentic RL” in its post‑training roadmap.

Reasoning RL vs Agentic RL

Optimization goal : Reasoning RL improves step‑wise logical correctness; Agentic RL targets long‑horizon autonomy, tool use, environment interaction, self‑correction, and dynamic planning.

Typical tasks : Reasoning RL handles math, scientific reasoning, algorithmic puzzles, and other structured problems with known answers; Agentic RL handles real software engineering, terminal operations, multi‑step search, and long‑duration agent tasks.

Trajectory length : Reasoning RL uses sequences of a few hundred to a few thousand tokens; Agentic RL requires extremely long rollouts.

Reward signal : Reasoning RL relies on outcome or process rewards; Agentic RL adds process quality emphasizing planning robustness, error correction, and tool reliability.

Training style : Reasoning RL is mainly synchronous on‑policy RL (GRPO + IcePop); Agentic RL uses a fully asynchronous, decoupled RL framework built on the slime infrastructure.

Infrastructure : Reasoning RL uses conventional RL training (group=32, batch=32, KL term removed for speed); Agentic RL employs a Multi‑Task Rollout Orchestrator, separate inference and training engines, TITO, and dual‑side importance sampling.

Rationale for separation : First let the model “think correctly”; then let it “act steadily for long periods and self‑debug”.

Benchmark improvements : Reasoning RL shows gains on Humanity’s Last Exam, HMMT, MATH, etc.; Agentic RL shows gains on CC‑Bench‑V2 long‑task, BrowseComp, software‑engineering suites, τ²‑Bench, GDPval, and related benchmarks.

Agentic RL addresses two weak behaviors: blind execution without feedback and re‑thinking from scratch after each experience.

Why Open‑Source Lags Further

Open‑source providers lack extensive enterprise‑level API feedback data, which limits their ability to improve agentic capabilities beyond pure technical differences.

Lessons from the Reasoning Era

The “foundation model era” emphasized curated, balanced data and a multi‑stage training schedule (broad → refined, attribution‑reproduction, annealing‑ablation) to preserve emergent abilities while improving high‑frequency issue quality. Long‑context scenarios for enterprise needs were also optimized.

Open models such as Deepseek V3 and Qwen 1‑0 performed well under this regime.

Post‑training pipelines followed a three‑phase sequence: SFT → Preference Alignment → RL. Early observations indicated that synthetic data often outperformed RL when high‑frequency problems were known.

DeepSeek‑R1 demonstrated a “zero‑to‑R1” bootstrap sampling that leveraged massive annotation advantages, embodying the “broad‑then‑refine”, “attribution‑reproduction”, and “annealing‑ablation” principles.

Emerging Mechanisms in Agentic RL

Interleaved Thinking : Continuously incorporates environment feedback while deciding actions, analogous to a driver adjusting steering based on road conditions.

Preserved Thinking : Retains the full reasoning trace to avoid logical breaks, similar to a student keeping draft work before writing the final answer.

Deepseek v3.2 introduced a “thinking retention” mechanism for environment feedback; Deepseek‑math‑2.0 proposed a self‑verifiable structured reasoning approach with a “super‑verification” design.

Most newly released models (e.g., Qwen‑3) have not added specific Agentic RL enhancements, whereas GLM‑5 is the first model to explicitly target Agentic RL.

Skill‑Centric Data Paradigm

Claude’s “Skills” methodology treats mentor‑junior interactions as high‑quality data for strong agents. Applying Deepseek‑R1’s methodology could achieve “Agent Skill Zero”, then refine through RL + SFT + Super‑Verification + Thinking Retention, moving from Skill Zero to Skill R1. This pathway may enable open‑source models with limited user data to approach closed‑source performance.

Overall Training Pipeline

The pipeline resembles a staged learning‑rate schedule: start broad, then refine, attribute, anneal, and finally verify. Data—especially curated, synthetic, long‑context, chain‑of‑thought, and Skills data—drives emergence, task navigation, reasoning, and agentic capabilities. The process constitutes a long‑horizon structured optimization search.

References:

https://arxiv.org/pdf/2602.15763

https://arxiv.org/pdf/2602.12430

https://arxiv.org/pdf/2509.02547

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsDeepSeekagentic reinforcement learningskill learningGLM-5reasoning RLRL+SFT
AI2ML AI to Machine Learning
Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.