Agentic RL: Transforming LLMs into Autonomous Decision‑Making Agents
This survey formalizes the shift from preference‑based reinforcement fine‑tuning to Agentic Reinforcement Learning, defines Agentic RL via MDP/POMDP abstractions, proposes a dual taxonomy of capabilities and task domains, compiles over 500 recent works, and outlines open challenges for scalable, robust AI agents.
1. Introduction
Agentic Reinforcement Learning (Agentic RL) treats large language models (LLMs) as learnable policies that operate inside sequential decision processes rather than as static, single‑step generators. In this paradigm an LLM receives observations from a partially observable, dynamic environment, maintains internal state (e.g., memory), plans actions, invokes external tools, and can self‑improve through reflection. Reinforcement learning provides the optimization signal that turns these capabilities into adaptive, robust behaviours.
2. Formalizing the Paradigm Shift
Traditional LLM‑RL is modelled as a Markov Decision Process (MDP) with a single decision step and fully observable state. Agentic RL is modelled as a Partially Observable Markov Decision Process (POMDP) where:
State s : the true environment configuration, which may be hidden.
Observation o_t : the token sequence or sensor input available at time t.
Action a_t : a token, tool call, or higher‑level operation issued by the LLM.
Reward r_t : scalar feedback derived from task performance, human preference, or auxiliary metrics.
Transition P(s_{t+1}|s_t,a_t) and Observation model O(o_t|s_t) define the dynamics.
The objective is to maximise the expected discounted return J(π)=E_{π}\left[\sum_{t=0}^{T}\gamma^{t} r_t\right] where the policy π(a_t|o_{0:t}) is parameterised by the LLM.
3. Dual Taxonomy of Agentic RL
3.1 Capability‑centric axis
Core functions that can be enhanced by RL:
Planning : long‑horizon reasoning over future states.
Tool use : invoking APIs, retrieval‑augmented generation, or external executables.
Memory : persistent context across episodes.
Reasoning : logical inference and problem solving.
Self‑improvement (reflection) : updating internal knowledge or policies without external supervision.
Perception : processing multimodal observations (e.g., images, code).
3.2 Task‑centric axis
Typical application domains where the above capabilities are instantiated:
Search and information retrieval.
GUI navigation and web‑automation.
Code generation and program synthesis.
Mathematical reasoning and theorem proving.
Multi‑agent coordination and collaborative problem solving.
4. Research Gaps and Contributions
Existing literature often isolates a single capability, targets a narrow domain, or uses custom environments, resulting in fragmented terminology and evaluation protocols. The survey addresses this gap by:
Formally distinguishing Agentic RL from traditional LLM‑RL using MDP/POMDP abstractions.
Introducing a capability‑centric taxonomy that enumerates the functions RL can optimise.
Compiling an open‑source reference of environments (e.g., gym‑nasium, OpenAI‑Gym extensions for language agents), benchmark suites (e.g., LLM‑Bench, MiniWoB), and RL frameworks (e.g., Stable‑Baselines3, RLlib) that support training and evaluation of agentic LLMs.
Analyzing >500 recent papers to map the rapid development of the field and to highlight open challenges such as scalability, safety, and evaluation standardisation.
5. Survey Structure
The remainder of the review is organised as follows:
Section 2 formalises the paradigm shift via MDP/POMDP notation.
Section 3 details the capability‑centric taxonomy (planning, reasoning, tool use, memory, self‑improvement, perception).
Section 4 surveys cross‑domain applications (search, GUI navigation, code generation, mathematical reasoning, multi‑agent systems).
Section 5 aggregates open‑source environments, RL frameworks, and benchmark suites.
Section 6 discusses open challenges (sample efficiency, safety, interpretability) and future directions toward scalable, trustworthy autonomous agents.
Section 7 concludes the review.
6. Representative Algorithms and References
Key RL algorithms applied to LLM agents include on‑policy methods such as Proximal Policy Optimization (PPO) and Group‑Relative PPO (GRPO), as well as off‑policy actor‑critic and Q‑learning variants. Preference‑based fine‑tuning (PBRFT) pipelines train a reward model on human or data preferences and optimise the LLM via RLHF, DPO, or RLAIF. Recent high‑capability models (e.g., OpenAI o1 , DeepSeek‑R1) demonstrate the feasibility of integrating tool use and self‑evolution, motivating further research on Agentic RL.
7. Visual Overview
The figure below summarises the transition from static LLM‑RL (single‑step MDP) to Agentic RL (multi‑step POMDP) and the dual taxonomy.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
