Demystifying Actor‑Critic and PPO: From Policy Gradients to Practical RL
This article provides a thorough, step‑by‑step explanation of reinforcement‑learning theory—covering policy‑based objectives, value‑function definitions, the derivation of policy gradients, actor‑critic architecture, advantage estimation, importance sampling, GAE, and the PPO algorithm—aimed at readers with little prior RL knowledge.
Inspired by a previous intuitive walkthrough of the RLHF pipeline, this article offers a rigorous yet accessible deep‑dive into reinforcement‑learning theory, focusing on the actor‑critic framework and Proximal Policy Optimization (PPO). The exposition follows Sutton’s textbook and reorganises the material around the author’s own logical flow.
Outline of the article
Introduce the optimisation target for policy‑based methods.
Define the value‑function concepts.
Introduce actor‑critic and discuss how to optimise the value part within a policy‑based objective.
Derive PPO from the actor‑critic perspective.
Why write another RL theory tutorial?
The author, coming from a non‑RL background, often felt that existing resources either omitted crucial notation (e.g., subscripts on expectations) or presented formulas too tersely, making it hard to grasp why certain terms appear. By re‑deriving the equations step‑by‑step, the article aims to fill that gap.
1. Policy (π)
Policies can be deterministic (a fixed action for each state) or stochastic (actions sampled from a probability distribution). Throughout the article a stochastic policy is assumed.
2. Reward
Reward depends on the current state, the taken action, and the next state. Two common notions are:
Single‑step reward (independent of the policy).
T‑step cumulative reward (sum of rewards over a trajectory).
3. Trajectory and State Transition
A trajectory (or rollout) is a sequence of states, actions, and rewards generated by interacting the agent with the environment. The article assumes stochastic state transitions.
4. Policy‑based RL optimisation target
The overall optimisation alternates between value evaluation (estimating the expected return of a policy) and policy iteration (improving the policy based on the evaluation). The goal is to find a policy that maximises the expected return J(π)=E_{τ∼π}[R(τ)].
Contrary to a common misconception, RL does not require two separate neural networks for policy and value; a single network can output both, which the author calls a value‑based approach.
5. Policy Gradient Derivation
The optimisation target can be written as
J(π_θ)=E_{τ∼π_θ}[R(τ)]
∇_θ J(π_θ)=E_{τ∼π_θ}[∇_θ log π_θ(τ)·R(τ)]After expanding the trajectory probability and discarding terms that do not depend on the policy, the final policy‑gradient expression becomes
∇_θ J(π_θ)=E_{τ∼π_θ}\[∑_{t}∇_θ log π_θ(a_t|s_t)·Â_t\]where Â_t denotes an advantage estimator.
6. Value Functions
6.1 Overview of value representations
Three common ways to measure value are:
Full‑trajectory cumulative reward (or discounted return).
Reward‑to‑go from timestep t onward.
Baseline‑subtracted returns to reduce variance.
6.2 Return
The discounted return from timestep t is defined as G_t = ∑_{k=0}^{∞} γ^k r_{t+k}, emphasising that near‑future rewards are weighted more heavily.
6.3 State‑Value Function V(s)
Formally, V^π(s)=E_{π}[G_t | s_t=s]. The expectation is taken over the stochastic policy and environment dynamics. The article walks through the expansion of the expectation, highlighting the role of the state‑distribution and transition probabilities.
6.4 Action‑Value Function Q(s,a)
Similarly, Q^π(s,a)=E_{π}[G_t | s_t=s, a_t=a]. The same notation conventions apply.
6.5 Relationship between V and Q
From the definitions, V^π(s)=∑_a π(a|s) Q^π(s,a). Conversely, Q^π(s,a)=R(s,a)+γ E_{s'}[V^π(s')]. This clarifies the often‑vague intuition that V is the expectation of Q.
6.6 Advantage and TD‑error
The advantage is defined as A^π(s,a)=Q^π(s,a)-V^π(s). When the value estimator is exact, the TD‑error δ_t = r_t + γ V(s_{t+1}) - V(s_t) is an unbiased estimator of the advantage. If the critic is inaccurate, the TD‑error becomes biased, motivating the use of Generalised Advantage Estimation (GAE).
7. Actor‑Critic
The actor represents the policy π_θ, and the critic approximates the value function V_w. The actor loss uses the advantage estimator, while the critic loss minimises the squared TD‑error.
# Pseudo‑code for PPO training loop
for i in range(steps):
exps = generate_experience(prompts, actor, critic, reward, ref)
for j in range(ppo_epochs):
actor_loss = cal_actor_loss(exps, actor)
critic_loss = cal_critic_loss(exps, critic)
actor.backward(actor_loss)
actor.step()
critic.backward(critic_loss)
critic.step()8. Proximal Policy Optimization (PPO)
8.1 Issues with naive actor‑critic
Two main problems are high sample cost (many environment interactions) and bias introduced by an imperfect critic. GAE is introduced to reduce variance while controlling bias.
8.2 Importance Sampling
To reuse data off‑policy, the objective is re‑weighted by the likelihood ratio ρ_t = π_θ(a_t|s_t) / π_{θ_{old}}(a_t|s_t). Large distribution shifts require many samples to keep the estimator unbiased.
8.3 GAE: Balancing bias and variance
GAE computes a weighted sum of n‑step TD‑errors: Â_t^{GAE(γ,λ)} = ∑_{l=0}^{∞} (γλ)^l δ_{t+l}. This interpolates between high‑variance Monte‑Carlo returns (λ=1) and low‑bias TD(0) (λ=0).
8.4 From TRPO to PPO
TRPO enforces a KL‑divergence constraint on policy updates. PPO simplifies this by using a clipped surrogate objective, avoiding the need for a second‑order optimizer.
8.5 PPO‑Clip
8.6 PPO‑Penalty
8.7 Critic loss in PPO
The critic loss typically minimises the mean‑squared error between the predicted value and the GAE‑estimated return. The article shows the concrete implementation used in the deepspeed‑chat RLHF code.
Overall, the article walks the reader from the fundamentals of MDPs and value functions to modern policy‑optimisation algorithms, providing explicit derivations, intuitive examples (e.g., a Mario‑style game), and practical code snippets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
