Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

This article provides a comprehensive overview of reinforcement learning, covering fundamental concepts, differences from supervised learning, algorithm families, policy gradient methods, practical tricks like baselines and reward‑to‑go, and detailed explanations of TRPO and PPO variants with illustrative diagrams.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Understanding Reinforcement Learning: From Basics to PPO and Policy Gradient

1.1 Basic Concepts of Reinforcement Learning

Reinforcement learning (RL) enables an agent to learn decision‑making by interacting with an environment and receiving a scalar reward . The goal is to maximize the expected cumulative reward.

Agent : entity that selects actions (e.g., robot, virtual character, algorithm).

Environment : everything the agent interacts with, providing observations and rewards.

Action : a move the agent can execute, potentially changing the environment.

State : representation of the environment at a given moment.

Reward : scalar feedback indicating the quality of an action.

Policy : mapping from states (or observations) to a probability distribution over actions.

Value Function : expected cumulative reward from a state (or state‑action pair).

Learning Process : the agent samples actions, observes rewards, and updates its policy/value estimates using algorithms such as Q‑learning, DQN, or policy‑gradient methods.

1.2 RL vs. Supervised Learning

Supervised learning assumes i.i.d. data, provides explicit labels, and optimizes a loss that matches predictions to those labels. It struggles when labeled data are scarce or when temporal dependencies exist.

Reinforcement learning deals with sequential, temporally correlated data, delayed rewards, and a trade‑off between exploration and exploitation, allowing agents to surpass human performance in some tasks.

1.3 Types of RL Algorithms

Algorithms can be grouped by the agent’s approach:

Value‑Based : learn a value function (e.g., Q‑learning, SARSA) and derive a policy from it.

Policy‑Based : directly learn a stochastic policy (e.g., Policy Gradient).

Actor‑Critic : combine both; an actor learns the policy while a critic estimates the value function.

They can also be classified by whether they model the environment:

Model‑Based : build a transition and reward model, enabling planning with fewer environment interactions.

Model‑Free : learn solely from real interactions, often requiring more samples but avoiding model bias.

2. Policy Gradient (PG)

PG methods treat the policy as a parameterized network π_θ(a|s). The objective is to maximize the expected cumulative reward J(θ)=E_{τ∼π_θ}[R(τ)]. Gradient ascent updates the parameters using the policy‑gradient theorem.

Key practical issues:

When all actions receive positive rewards of varying magnitude, the agent may over‑prefer high‑reward actions and ignore others.

Assigning the total episode reward to every state‑action pair can be unfair because early steps unrelated to a later outcome receive the same credit.

2.1 Baseline and Reward‑to‑Go

Introduce a baseline b (often the average reward) to reduce variance: A(s,a)=R(τ)-b. Subtracting a baseline helps distinguish truly good actions when rewards are all positive.

Replace the total episode reward with reward‑to‑go , i.e., the sum of rewards from the current timestep to the end, providing more accurate credit assignment.

2.2 Discount Factor

Apply a discount γ∈[0,1] to future rewards, reflecting that distant outcomes have less influence on current decisions.

2.3 Importance Sampling

When reusing data collected by an older policy π_old, weight each sample by the likelihood ratio π_θ(a|s) / π_old(a|s) to correct for distribution shift. Large divergence inflates variance, motivating trust‑region methods.

3. Trust‑Region and Proximal Policy Optimization

3.1 TRPO

Trust‑Region Policy Optimization constrains the KL divergence between the new and old policies to stay below a threshold, ensuring stable updates but requiring costly second‑order optimization.

3.2 PPO (Penalty and Clip)

PPO simplifies TRPO by moving the KL constraint into the objective. Two common variants:

PPO‑penalty : adds an adaptive KL penalty term; if the KL is below a target, the penalty weight is decreased, otherwise increased.

PPO‑clip : directly clips the probability ratio r = π_θ(a|s) / π_old(a|s) to [1‑ε, 1+ε], preventing large policy updates.

Typical PPO‑clip implementation (PaddlePaddle syntax):

ratios = paddle.exp(cur_batch_log_probs - batch_log_probs.detach())
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningreinforcement learningactor-criticpolicy gradientPPO
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.