Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions
This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.
Basic Concepts
Reinforcement learning (RL) models an agent interacting with an environment by taking actions and receiving scalar rewards. The goal is to learn a policy that maximizes the expected cumulative discounted return.
Four Elements in Large‑Model RL
Agent : the decision‑making model (e.g., a large language model).
Environment : the observable context or history.
Action : the token or output generated by the agent.
Reward : immediate score provided by a reward model.
Markov Decision Process (MDP)
An MDP is a tuple (S, A, P, R, γ) where:
State S : current environment configuration.
Action A : set of possible decisions.
Transition P(s', r | s, a) : probability of next state s' and reward r given s and a.
Reward R(s, a, s') : immediate scalar feedback.
Discount factor γ∈[0,1] : weights future rewards.
The objective is to find a policy π(a|s) that maximizes
J(π)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} ]Value Functions
State‑Value Function
The expected return when starting from state s and following policy π:
V^{π}(s)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} \mid s_{0}=s ]Action‑Value Function
The expected return after taking action a in state s and then following π:
Q^{π}(s,a)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} \mid s_{0}=s, a_{0}=a ]Bellman Equations
The optimal state‑value function satisfies
V^{*}(s)=\max_{a} \; E_{s',r\sim P}[ r + γ V^{*}(s') \mid s, a ]The optimal action‑value function satisfies
Q^{*}(s,a)=E_{s',r\sim P}[ r + γ \max_{a'} Q^{*}(s',a') \mid s, a ]Advantage Function
The advantage of an action quantifies how much better it is than the average action in the same state: A(s,a)=Q(s,a)-V(s) Advantage functions are used in policy‑gradient methods (A2C, A3C, PPO) to reduce variance by centering updates around a baseline.
Key Algorithms (brief)
Value‑based : Q‑learning and its deep variant DQN approximate Q^{*} with neural networks.
Policy‑gradient : Directly optimize the policy parameters using the gradient of J(π).
Actor‑Critic : Combine a policy (actor) with a learned value function (critic) to estimate advantages.
Proximal Policy Optimization (PPO) : Constrains policy updates with a clipped surrogate objective for stable learning.
Soft Actor‑Critic (SAC) : Adds an entropy term to encourage exploration while maximizing return.
References
OpenAI Spinning‑Up: RL Introduction
arXiv:2412.05265
arXiv:2412.10400v3
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
