Artificial Intelligence 42 min read

Mastering Reinforcement Learning: From Basics to Advanced Agentic RL Techniques

This comprehensive guide walks through reinforcement learning fundamentals, MDP modeling, value functions, Bellman equations, and key algorithms such as Q‑learning, REINFORCE, PPO, DPO, and GRPO, then contrasts LLM‑RL with Agentic‑RL and surveys leading industry frameworks and real‑world applications.

Baobao Algorithm Notes

Feb 4, 2026

Mastering Reinforcement Learning: From Basics to Advanced Agentic RL Techniques

Reinforcement Learning (RL) Fundamentals

RL is a core branch of machine learning where an agent interacts with an environment to learn a policy that maximizes the expected cumulative reward . The main components are:

Agent – the learning entity (e.g., robot, language model, industrial arm)

Environment – the external context (e.g., road network, dialogue system, factory floor)

State – the current observation of the environment

Action – the decision taken by the agent (e.g., brake, generate a token, move a robotic joint)

Reward – scalar feedback from the environment (positive or negative)

Policy – a mapping from states to action probabilities

Markov Decision Process (MDP) Formalism

An RL problem is modeled as an MDP, a 5‑tuple (S, A, P, R, \gamma):

State space S : all possible observations (board configurations, robot poses, dialogue histories).

Action space A : all admissible actions in a state (move, speak, query, write).

Transition probability P(s'\mid s,a) : the (often unknown) distribution over next states given the current state and action.

Reward function R(s,a,s') : immediate scalar feedback.

Discount factor \gamma \in (0,1) : determines the importance of future rewards.

Value Functions

Value functions compress the expected return of a trajectory into a scalar:

State‑value V(s) : expected discounted return when starting from state s and following policy \pi.

State‑action value Q(s,a) : expected return after taking action a in state s and then following \pi.

Advantage A(s,a)=Q(s,a)-V(s) : measures how much better an action is compared with the average action under the current policy; used to reduce variance in policy‑gradient methods.

Bellman Equations

The Bellman expectation equation expresses the recursive relationship of value functions. For the optimal policy the Bellman optimality equation holds, leading to the classic Q‑learning target:

V(s) = \mathbb{E}_{a\sim\pi}[R(s,a) + \gamma \; V(s')]
Q(s,a) = R(s,a) + \gamma \; \mathbb{E}_{s'\sim P}[\max_{a'} Q(s',a')]

Core RL Algorithms

1. Value‑Based: Q‑Learning

Q‑learning learns an approximation of the optimal Q function without a model of the transition dynamics. The update rule for a tabular implementation is:

# Hyper‑parameters
alpha = 0.1          # learning rate
gamma = 0.99         # discount factor
epsilon = 0.1        # exploration probability

# Initialise Q(s,a) arbitrarily (e.g., zeros)
for episode in range(num_episodes):
    s = env.reset()
    done = False
    while not done:
        # ε‑greedy action selection
        if random.random() < epsilon:
            a = random.choice(A)
        else:
            a = argmax_a Q[s, a]
        s_next, r, done, _ = env.step(a)
        # TD target
        target = r + (0 if done else gamma * max(Q[s_next, :]))
        # Q‑update
        Q[s, a] += alpha * (target - Q[s, a])
        s = s_next

In deep RL the table Q is replaced by a neural network Q_\theta(s,a) trained with stochastic gradient descent on the same target.

2. Policy‑Based: REINFORCE

REINFORCE treats the policy \pi_\theta as a differentiable model and performs gradient ascent on the expected return. The Monte‑Carlo policy‑gradient estimator is:

for each episode:
    collect trajectory (s_0,a_0,r_0,…,s_T)
    compute discounted returns G_t = \sum_{k=t}^{T}\gamma^{k-t} r_k
    for each timestep t:
        g += G_t * ∇_θ log π_θ(a_t|s_t)
θ ← θ + α * g

Baseline techniques (e.g., subtracting a state‑value estimate) are often added to reduce variance.

3. Actor‑Critic: Proximal Policy Optimization (PPO)

PPO stabilizes policy‑gradient training by clipping the probability ratio between the new and old policies. The clipped surrogate objective is:

r_t(θ) = \frac{π_θ(a_t|s_t)}{π_{old}(a_t|s_t)}
L^{CLIP}_t(θ) = \min\big(r_t(θ) A_t,
                     \text{clip}(r_t(θ), 1-\epsilon, 1+\epsilon) A_t\big)

The full loss combines the surrogate, a value‑function loss, and an entropy bonus:

loss = -L^{CLIP} + c_1 \; \text{MSE}(V_φ(s_t), G_t) - c_2 \; \mathcal{H}(π_θ(·|s_t))

Generalized Advantage Estimation (GAE) is commonly used to compute A_t with a bias‑variance trade‑off controlled by λ.

4. Direct Preference Optimization (DPO)

DPO eliminates the RL loop by solving the KL‑regularized RLHF objective in closed form. The training objective is a logistic loss that aligns the policy with human preference pairs (x, y^+, y^-):

# Inputs: reference model π_ref (frozen), trainable model π_θ, preference pairs (x, y⁺, y⁻)
for minibatch in data:
    logp⁺_θ = log π_θ(y⁺|x)
    logp⁻_θ = log π_θ(y⁻|x)
    logp⁺_ref = log π_ref(y⁺|x)
    logp⁻_ref = log π_ref(y⁻|x)
    Δθ = logp⁺_θ - logp⁻_θ
    Δref = logp⁺_ref - logp⁻_ref
    logits = β * (Δθ - Δref)
    dpo_loss = -mean(log sigmoid(logits))
    θ ← θ - lr * ∇_θ dpo_loss

Only two language models are required (reference and trainable), and no separate reward model or on‑policy sampling is needed.

5. Group‑Relative Policy Optimization (GRPO)

GRPO removes the critic and uses the average reward of a sampled group as a baseline, keeping PPO‑style clipping and an explicit KL regularizer against a frozen reference policy.

for outer iteration:
    π_ref ← π_θ (frozen)
    for step:
        # Sample a batch of prompts q
        for each q:
            # Sample G outputs {o_i} from the current policy π_old
            o_i ~ π_old(·|q)
            r_i = reward_model(o_i, q)
        # Group‑relative advantage
        Â_i = (r_i - mean(r)) / std(r)
        for k in range(μ):
            ratio = π_θ(o_i|q) / π_old(o_i|q)
            L_clip = mean(min(ratio * Â_i, clip(ratio, 1-ε, 1+ε) * Â_i))
            KL = KL(π_θ || π_ref)
            loss = -L_clip + λ * KL
            θ ← θ - lr * ∇_θ loss

LLM‑RL vs. Agentic‑RL

LLM‑RL treats a language model as a one‑shot policy: given a prompt it generates a complete answer, receives a single scalar reward (often from a reward model or human preference), and updates the token distribution (e.g., via PPO or DPO). The environment is static, actions are tokens or whole responses, and credit assignment is coarse.

Agentic‑RL models the full decision loop: the state includes environment observations and internal memory, actions are tool selections, API calls, or sub‑task planning, and rewards can be provided at multiple steps. This enables optimization of system‑level KPIs such as task success rate, latency, cost, and safety.

Key Differences

Environment & Interaction : LLM‑RL interacts once per prompt; Agentic‑RL interacts repeatedly with a dynamic environment (databases, APIs, users).

Action Granularity : LLM‑RL actions are tokens or whole responses; Agentic‑RL actions are high‑level decisions (choose tool, issue SQL, decide to continue).

Reward Signal : LLM‑RL typically receives a terminal reward; Agentic‑RL can receive dense rewards at intermediate steps, allowing precise credit assignment.

Optimization Objective : LLM‑RL optimizes alignment of output distribution; Agentic‑RL optimizes task‑level performance metrics (success, cost, safety).

Why Agentic‑RL Is Necessary

Real‑world tasks are long‑horizon, multi‑step processes (e.g., data agents that query databases, call APIs, write back results).

Static preference data cannot capture structured strategies such as tool selection, error recovery, or budget‑aware planning.

Online interaction creates a data flywheel: logs become RL episodes, enabling continuous on‑policy or off‑policy learning and rapid adaptation.

Industry Frameworks for Agentic‑RL

Hugging Face TRL – GitHub: https://github.com/huggingface/trl

ModelScope ms‑swift – GitHub: https://github.com/modelscope/ms-swift

Volcengine verl – GitHub: https://github.com/volcengine/verl

OpenPipe ART – GitHub: https://github.com/OpenPipe/ART

Microsoft Agent‑Lightning – GitHub: https://github.com/microsoft/agent-lightning

Notable Agentic‑RL Practices

GPT‑5‑Codex trains a coding agent that iteratively writes, runs, fixes, and submits code, using tool calls to IDEs and version control.

Tongyi DeepResearch combines long‑term research planning, web retrieval, and knowledge‑base updates with RL‑driven policy refinement.

Cursor 2.0 Composer provides an AI‑native coding environment where natural‑language intent is turned into full projects with multi‑turn interaction and real‑time debugging.

References

[1] Sutton R. S., Barto A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.
[2] Williams R. J. Simple statistical gradient‑following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
[3] Christiano P. F. et al. Deep reinforcement learning from human preferences. NeurIPS, 2017.
[4] Schulman J. et al. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
[5] Ouyang L. et al. Training language models to follow instructions with human feedback. NeurIPS, 2022.
[6] Rafailov R. et al. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
[7] Shao Z. et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024.
[8] Yao S. et al. ReAct: Synergizing reasoning and acting in language models. arXiv:2210.03629, 2022.
[9] Zhang G. et al. The landscape of agentic reinforcement learning for LLMs: A survey. arXiv:2509.02547, 2025.

Artificial Intelligence machine learning LLM reinforcement learning Agentic RL RL Algorithms

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.