Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient
This article provides a comprehensive overview of reinforcement learning, covering fundamental concepts, differences from supervised learning, algorithm families, policy gradient methods, practical tricks like baselines and reward‑to‑go, and detailed explanations of TRPO and PPO variants with illustrative diagrams.
1.1 Basic Concepts of Reinforcement Learning
Reinforcement learning (RL) enables an agent to learn decision‑making by interacting with an environment and receiving a scalar reward . The goal is to maximize the expected cumulative reward.
Agent : entity that selects actions (e.g., robot, virtual character, algorithm).
Environment : everything the agent interacts with, providing observations and rewards.
Action : a move the agent can execute, potentially changing the environment.
State : representation of the environment at a given moment.
Reward : scalar feedback indicating the quality of an action.
Policy : mapping from states (or observations) to a probability distribution over actions.
Value Function : expected cumulative reward from a state (or state‑action pair).
Learning Process : the agent samples actions, observes rewards, and updates its policy/value estimates using algorithms such as Q‑learning, DQN, or policy‑gradient methods.
1.2 RL vs. Supervised Learning
Supervised learning assumes i.i.d. data, provides explicit labels, and optimizes a loss that matches predictions to those labels. It struggles when labeled data are scarce or when temporal dependencies exist.
Reinforcement learning deals with sequential, temporally correlated data, delayed rewards, and a trade‑off between exploration and exploitation, allowing agents to surpass human performance in some tasks.
1.3 Types of RL Algorithms
Algorithms can be grouped by the agent’s approach:
Value‑Based : learn a value function (e.g., Q‑learning, SARSA) and derive a policy from it.
Policy‑Based : directly learn a stochastic policy (e.g., Policy Gradient).
Actor‑Critic : combine both; an actor learns the policy while a critic estimates the value function.
They can also be classified by whether they model the environment:
Model‑Based : build a transition and reward model, enabling planning with fewer environment interactions.
Model‑Free : learn solely from real interactions, often requiring more samples but avoiding model bias.
2. Policy Gradient (PG)
PG methods treat the policy as a parameterized network π_θ(a|s). The objective is to maximize the expected cumulative reward J(θ)=E_{τ∼π_θ}[R(τ)]. Gradient ascent updates the parameters using the policy‑gradient theorem.
Key practical issues:
When all actions receive positive rewards of varying magnitude, the agent may over‑prefer high‑reward actions and ignore others.
Assigning the total episode reward to every state‑action pair can be unfair because early steps unrelated to a later outcome receive the same credit.
2.1 Baseline and Reward‑to‑Go
Introduce a baseline b (often the average reward) to reduce variance: A(s,a)=R(τ)-b. Subtracting a baseline helps distinguish truly good actions when rewards are all positive.
Replace the total episode reward with reward‑to‑go , i.e., the sum of rewards from the current timestep to the end, providing more accurate credit assignment.
2.2 Discount Factor
Apply a discount γ∈[0,1] to future rewards, reflecting that distant outcomes have less influence on current decisions.
2.3 Importance Sampling
When reusing data collected by an older policy π_old, weight each sample by the likelihood ratio π_θ(a|s) / π_old(a|s) to correct for distribution shift. Large divergence inflates variance, motivating trust‑region methods.
3. Trust‑Region and Proximal Policy Optimization
3.1 TRPO
Trust‑Region Policy Optimization constrains the KL divergence between the new and old policies to stay below a threshold, ensuring stable updates but requiring costly second‑order optimization.
3.2 PPO (Penalty and Clip)
PPO simplifies TRPO by moving the KL constraint into the objective. Two common variants:
PPO‑penalty : adds an adaptive KL penalty term; if the KL is below a target, the penalty weight is decreased, otherwise increased.
PPO‑clip : directly clips the probability ratio r = π_θ(a|s) / π_old(a|s) to [1‑ε, 1+ε], preventing large policy updates.
Typical PPO‑clip implementation (PaddlePaddle syntax):
ratios = paddle.exp(cur_batch_log_probs - batch_log_probs.detach())Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
