Mastering Reinforcement Learning: From Basics to Advanced Agentic RL Techniques
This comprehensive guide walks through reinforcement learning fundamentals, MDP modeling, value functions, Bellman equations, and key algorithms such as Q‑learning, REINFORCE, PPO, DPO, and GRPO, then contrasts LLM‑RL with Agentic‑RL and surveys leading industry frameworks and real‑world applications.
Reinforcement Learning (RL) Fundamentals
RL is a core branch of machine learning where an agent interacts with an environment to learn a policy that maximizes the expected cumulative reward . The main components are:
Agent – the learning entity (e.g., robot, language model, industrial arm)
Environment – the external context (e.g., road network, dialogue system, factory floor)
State – the current observation of the environment
Action – the decision taken by the agent (e.g., brake, generate a token, move a robotic joint)
Reward – scalar feedback from the environment (positive or negative)
Policy – a mapping from states to action probabilities
Markov Decision Process (MDP) Formalism
An RL problem is modeled as an MDP, a 5‑tuple (S, A, P, R, \gamma):
State space S : all possible observations (board configurations, robot poses, dialogue histories).
Action space A : all admissible actions in a state (move, speak, query, write).
Transition probability P(s'\mid s,a) : the (often unknown) distribution over next states given the current state and action.
Reward function R(s,a,s') : immediate scalar feedback.
Discount factor \gamma \in (0,1) : determines the importance of future rewards.
Value Functions
Value functions compress the expected return of a trajectory into a scalar:
State‑value V(s) : expected discounted return when starting from state s and following policy \pi.
State‑action value Q(s,a) : expected return after taking action a in state s and then following \pi.
Advantage A(s,a)=Q(s,a)-V(s) : measures how much better an action is compared with the average action under the current policy; used to reduce variance in policy‑gradient methods.
Bellman Equations
The Bellman expectation equation expresses the recursive relationship of value functions. For the optimal policy the Bellman optimality equation holds, leading to the classic Q‑learning target:
V(s) = \mathbb{E}_{a\sim\pi}[R(s,a) + \gamma \; V(s')]
Q(s,a) = R(s,a) + \gamma \; \mathbb{E}_{s'\sim P}[\max_{a'} Q(s',a')]Core RL Algorithms
1. Value‑Based: Q‑Learning
Q‑learning learns an approximation of the optimal Q function without a model of the transition dynamics. The update rule for a tabular implementation is:
# Hyper‑parameters
alpha = 0.1 # learning rate
gamma = 0.99 # discount factor
epsilon = 0.1 # exploration probability
# Initialise Q(s,a) arbitrarily (e.g., zeros)
for episode in range(num_episodes):
s = env.reset()
done = False
while not done:
# ε‑greedy action selection
if random.random() < epsilon:
a = random.choice(A)
else:
a = argmax_a Q[s, a]
s_next, r, done, _ = env.step(a)
# TD target
target = r + (0 if done else gamma * max(Q[s_next, :]))
# Q‑update
Q[s, a] += alpha * (target - Q[s, a])
s = s_nextIn deep RL the table Q is replaced by a neural network Q_\theta(s,a) trained with stochastic gradient descent on the same target.
2. Policy‑Based: REINFORCE
REINFORCE treats the policy \pi_\theta as a differentiable model and performs gradient ascent on the expected return. The Monte‑Carlo policy‑gradient estimator is:
for each episode:
collect trajectory (s_0,a_0,r_0,…,s_T)
compute discounted returns G_t = \sum_{k=t}^{T}\gamma^{k-t} r_k
for each timestep t:
g += G_t * ∇_θ log π_θ(a_t|s_t)
θ ← θ + α * gBaseline techniques (e.g., subtracting a state‑value estimate) are often added to reduce variance.
3. Actor‑Critic: Proximal Policy Optimization (PPO)
PPO stabilizes policy‑gradient training by clipping the probability ratio between the new and old policies. The clipped surrogate objective is:
r_t(θ) = \frac{π_θ(a_t|s_t)}{π_{old}(a_t|s_t)}
L^{CLIP}_t(θ) = \min\big(r_t(θ) A_t,
\text{clip}(r_t(θ), 1-\epsilon, 1+\epsilon) A_t\big)The full loss combines the surrogate, a value‑function loss, and an entropy bonus:
loss = -L^{CLIP} + c_1 \; \text{MSE}(V_φ(s_t), G_t) - c_2 \; \mathcal{H}(π_θ(·|s_t))Generalized Advantage Estimation (GAE) is commonly used to compute A_t with a bias‑variance trade‑off controlled by λ.
4. Direct Preference Optimization (DPO)
DPO eliminates the RL loop by solving the KL‑regularized RLHF objective in closed form. The training objective is a logistic loss that aligns the policy with human preference pairs (x, y^+, y^-):
# Inputs: reference model π_ref (frozen), trainable model π_θ, preference pairs (x, y⁺, y⁻)
for minibatch in data:
logp⁺_θ = log π_θ(y⁺|x)
logp⁻_θ = log π_θ(y⁻|x)
logp⁺_ref = log π_ref(y⁺|x)
logp⁻_ref = log π_ref(y⁻|x)
Δθ = logp⁺_θ - logp⁻_θ
Δref = logp⁺_ref - logp⁻_ref
logits = β * (Δθ - Δref)
dpo_loss = -mean(log sigmoid(logits))
θ ← θ - lr * ∇_θ dpo_lossOnly two language models are required (reference and trainable), and no separate reward model or on‑policy sampling is needed.
5. Group‑Relative Policy Optimization (GRPO)
GRPO removes the critic and uses the average reward of a sampled group as a baseline, keeping PPO‑style clipping and an explicit KL regularizer against a frozen reference policy.
for outer iteration:
π_ref ← π_θ (frozen)
for step:
# Sample a batch of prompts q
for each q:
# Sample G outputs {o_i} from the current policy π_old
o_i ~ π_old(·|q)
r_i = reward_model(o_i, q)
# Group‑relative advantage
Â_i = (r_i - mean(r)) / std(r)
for k in range(μ):
ratio = π_θ(o_i|q) / π_old(o_i|q)
L_clip = mean(min(ratio * Â_i, clip(ratio, 1-ε, 1+ε) * Â_i))
KL = KL(π_θ || π_ref)
loss = -L_clip + λ * KL
θ ← θ - lr * ∇_θ lossLLM‑RL vs. Agentic‑RL
LLM‑RL treats a language model as a one‑shot policy: given a prompt it generates a complete answer, receives a single scalar reward (often from a reward model or human preference), and updates the token distribution (e.g., via PPO or DPO). The environment is static, actions are tokens or whole responses, and credit assignment is coarse.
Agentic‑RL models the full decision loop: the state includes environment observations and internal memory, actions are tool selections, API calls, or sub‑task planning, and rewards can be provided at multiple steps. This enables optimization of system‑level KPIs such as task success rate, latency, cost, and safety.
Key Differences
Environment & Interaction : LLM‑RL interacts once per prompt; Agentic‑RL interacts repeatedly with a dynamic environment (databases, APIs, users).
Action Granularity : LLM‑RL actions are tokens or whole responses; Agentic‑RL actions are high‑level decisions (choose tool, issue SQL, decide to continue).
Reward Signal : LLM‑RL typically receives a terminal reward; Agentic‑RL can receive dense rewards at intermediate steps, allowing precise credit assignment.
Optimization Objective : LLM‑RL optimizes alignment of output distribution; Agentic‑RL optimizes task‑level performance metrics (success, cost, safety).
Why Agentic‑RL Is Necessary
Real‑world tasks are long‑horizon, multi‑step processes (e.g., data agents that query databases, call APIs, write back results).
Static preference data cannot capture structured strategies such as tool selection, error recovery, or budget‑aware planning.
Online interaction creates a data flywheel: logs become RL episodes, enabling continuous on‑policy or off‑policy learning and rapid adaptation.
Industry Frameworks for Agentic‑RL
Hugging Face TRL – GitHub: https://github.com/huggingface/trl
ModelScope ms‑swift – GitHub: https://github.com/modelscope/ms-swift
Volcengine verl – GitHub: https://github.com/volcengine/verl
OpenPipe ART – GitHub: https://github.com/OpenPipe/ART
Microsoft Agent‑Lightning – GitHub: https://github.com/microsoft/agent-lightning
Notable Agentic‑RL Practices
GPT‑5‑Codex trains a coding agent that iteratively writes, runs, fixes, and submits code, using tool calls to IDEs and version control.
Tongyi DeepResearch combines long‑term research planning, web retrieval, and knowledge‑base updates with RL‑driven policy refinement.
Cursor 2.0 Composer provides an AI‑native coding environment where natural‑language intent is turned into full projects with multi‑turn interaction and real‑time debugging.
References
[1] Sutton R. S., Barto A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.
[2] Williams R. J. Simple statistical gradient‑following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
[3] Christiano P. F. et al. Deep reinforcement learning from human preferences. NeurIPS, 2017.
[4] Schulman J. et al. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
[5] Ouyang L. et al. Training language models to follow instructions with human feedback. NeurIPS, 2022.
[6] Rafailov R. et al. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
[7] Shao Z. et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024.
[8] Yao S. et al. ReAct: Synergizing reasoning and acting in language models. arXiv:2210.03629, 2022.
[9] Zhang G. et al. The landscape of agentic reinforcement learning for LLMs: A survey. arXiv:2509.02547, 2025.Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
