Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

Data Party THU
Data Party THU
Data Party THU
Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

Basic Concepts

Reinforcement learning (RL) models an agent interacting with an environment by taking actions and receiving scalar rewards. The goal is to learn a policy that maximizes the expected cumulative discounted return.

Four Elements in Large‑Model RL

Agent : the decision‑making model (e.g., a large language model).

Environment : the observable context or history.

Action : the token or output generated by the agent.

Reward : immediate score provided by a reward model.

RL interaction diagram
RL interaction diagram

Markov Decision Process (MDP)

An MDP is a tuple (S, A, P, R, γ) where:

State S : current environment configuration.

Action A : set of possible decisions.

Transition P(s', r | s, a) : probability of next state s' and reward r given s and a.

Reward R(s, a, s') : immediate scalar feedback.

Discount factor γ∈[0,1] : weights future rewards.

The objective is to find a policy π(a|s) that maximizes

J(π)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} ]

Value Functions

State‑Value Function

The expected return when starting from state s and following policy π:

V^{π}(s)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} \mid s_{0}=s ]

Action‑Value Function

The expected return after taking action a in state s and then following π:

Q^{π}(s,a)=E_{π}[ Σ_{t=0}^{∞} γ^{t} r_{t} \mid s_{0}=s, a_{0}=a ]

Bellman Equations

The optimal state‑value function satisfies

V^{*}(s)=\max_{a} \; E_{s',r\sim P}[ r + γ V^{*}(s') \mid s, a ]

The optimal action‑value function satisfies

Q^{*}(s,a)=E_{s',r\sim P}[ r + γ \max_{a'} Q^{*}(s',a') \mid s, a ]

Advantage Function

The advantage of an action quantifies how much better it is than the average action in the same state: A(s,a)=Q(s,a)-V(s) Advantage functions are used in policy‑gradient methods (A2C, A3C, PPO) to reduce variance by centering updates around a baseline.

Key Algorithms (brief)

Value‑based : Q‑learning and its deep variant DQN approximate Q^{*} with neural networks.

Policy‑gradient : Directly optimize the policy parameters using the gradient of J(π).

Actor‑Critic : Combine a policy (actor) with a learned value function (critic) to estimate advantages.

Proximal Policy Optimization (PPO) : Constrains policy updates with a clipped surrogate objective for stable learning.

Soft Actor‑Critic (SAC) : Adds an entropy term to encourage exploration while maximizing return.

References

OpenAI Spinning‑Up: RL Introduction

arXiv:2412.05265

arXiv:2412.10400v3

large language modelReinforcement learningMDPAI fundamentalsBellman Equationadvantage functionValue Function
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.