Reinforcement Learning Tutorial Part 1: Core Concepts Explained

This article introduces the fundamental concepts of reinforcement learning, covering the agent‑environment interaction, key terminology, reward structures, task types, policies, value functions, the Bellman equations, and how optimal strategies are derived and approximated in practice.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Reinforcement Learning Tutorial Part 1: Core Concepts Explained

Reinforcement Learning Framework

Defines an agent that interacts stepwise with an environment . At each step the agent selects an action , receives a numeric reward , and observes the next state. The interaction generates tuples (s, a, r, s′) that are used to learn a policy π, a mapping from states to probability distributions over actions, with the goal of maximizing the expected cumulative reward.

Reward Types

The cumulative reward from time step t to the final step T is Gₜ = Σₖ₌ₜᴛ Rₖ. In practice a discounted cumulative reward is used: Gₜ = Σₖ₌ₜᴛ γ^{k‑t} Rₖ, where the discount factor γ satisfies 0 ≤ γ ≤ 1 and prioritizes short‑term returns.

Task Types

Episodic tasks consist of independent episodes that start from a state sampled from a distribution and terminate with a final reward. Continuous tasks have no terminal state, making the cumulative return undefined because the horizon is infinite.

Policy and Value Functions

A policy π(s) gives the probability distribution over actions in state s. The state‑value function v(s)=E₍π₎[Gₜ | sₜ=s] estimates the expected discounted return when starting from s and following π. The action‑value function q(s,a)=E₍π₎[Gₜ | sₜ=s, aₜ=a] estimates the expected return after taking action a in state s and thereafter following π.

Example: a 3×3 maze where the agent starts at A1 and must reach the terminal state C1. Cell A3 contains a large reward; cells B1 and C3 are walls. A random policy selects each allowed move with equal probability. The resulting V‑values are higher near the high‑reward cell A3, while the terminal state’s V‑value is 0.

Q(s,a) differs from V(s′) because Q includes the immediate reward Rₜ, whereas V(s′) starts counting from the next step.

Bellman Equations

The Bellman equation for the state‑value function expresses the value of a state as the expected immediate reward plus the discounted value of the next state:

V(s)= Σₐ π(a|s) Σ_{s′,r} p(s′,r|s,a)[r + γ V(s′)]

The analogous recursion for the action‑value function is:

Q(s,a)= Σ_{s′,r} p(s′,r|s,a)[r + γ Σ_{a′} π(a′|s′) Q(s′,a′)]

Optimal Policy

Policy π₁ dominates π₂ if for every state s, V^{π₁}(s) ≥ V^{π₂}(s). An optimal policy π* dominates all others and is associated with optimal value functions V* and Q*. The Bellman optimality equations replace the policy term with a max operator:

V*(s)= maxₐ Σ_{s′,r} p(s′,r|s,a)[r + γ V*(s′)]

Q*(s,a)= Σ_{s′,r} p(s′,r|s,a)[r + γ max_{a′} Q*(s′,a′)]

Exact solution of these equations is infeasible for large state spaces; therefore RL algorithms approximate optimal policies with far fewer computational resources.

Conclusion

The agent learns from experience by iteratively improving its policy based on the Bellman equations for V and Q. While exact optimal value functions are rarely obtainable in practice, approximation methods achieve strong performance on real‑world problems.

reinforcement learningPolicyMarkov Decision ProcessBellman EquationValue FunctionOptimal PolicyReward
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.