Artificial Intelligence 16 min read

What Is Reinforcement Learning? Core Concepts and Key Algorithms Explained

This article introduces reinforcement learning, compares it with supervised and unsupervised learning, explains its components and Markov Decision Processes, and reviews fundamental model‑free and model‑based algorithms such as Q‑Learning, SARSA, TD learning, and exploration strategies.

Model Perspective
Model Perspective
Model Perspective
What Is Reinforcement Learning? Core Concepts and Key Algorithms Explained

Reinforcement Learning

Machine learning (ML) has three branches: supervised learning, unsupervised learning, and reinforcement learning (RL).

Supervised Learning (SL): obtains correct output given labeled training data.

Unsupervised Learning (UL): discovers patterns in data without pre‑existing labels.

Reinforcement Learning (RL): an agent interacts with an environment to maximize cumulative reward.

RL is similar to a baby learning: positive reinforcement encourages actions, negative reinforcement discourages them. The main difference between supervised and reinforcement learning is that the former learns from a static dataset, while the latter learns through trial‑and‑error.

Before exploring RL algorithms, we introduce its basic components.

Agent: program that perceives the environment and takes actions.

Environment: the real or virtual world the agent inhabits.

State: the situation the agent is in.

Action: possible moves the agent can take in a given state.

Reward: feedback that may depend on the action, the state, or both.

The output of RL is an optimal policy that specifies which action to take in each state, unlike supervised learning which yields a single prediction.

The goal of RL is to maximize total cumulative reward.

Markov Decision Process (MDP)

RL problems can be modeled as a continuous decision problem using an MDP.

States are denoted S, actions A, and transition probabilities T(S, A, S'). For simplicity we assume deterministic transitions.

Example grid with rewards: yellow (+1), red (‑1), purple (+100). The optimal route A2‑A1‑A1 yields total reward +103. A discount factor is applied to future rewards.

The general expected utility equation (Bellman equation) accounts for transition probabilities, rewards, and discounting, and can be extended to non‑deterministic transitions, unknown optimal actions, reward functions depending on state‑action‑next‑state triples, and potentially infinite horizons.

Model‑Free vs. Model‑Based Reinforcement Learning

Model‑based RL uses known transition and reward functions (e.g., value iteration, policy iteration).

Model‑free RL learns policies directly without explicit models (e.g., Q‑Learning, policy search).

Offline Learning vs. Online Learning

Offline Learning

Also called passive learning; the agent learns a utility function from a fixed policy while the environment’s transition and reward functions are unknown.

Examples include value iteration, policy iteration, direct utility estimation, adaptive dynamic programming (ADP), and temporal‑difference (TD) learning.

Online Learning

Also called active learning; the agent alternates between exploration (updating the policy) and exploitation (using the current policy).

Examples include exploration strategies, Q‑Learning, SARSA, and others.

Direct Utility Estimation

Model‑free offline method where the agent follows a fixed policy and estimates the expected total reward from each state.

Advantage: with infinite trials the sample average converges to the true expected reward.

Disadvantage: learning occurs only after each trial ends, leading to slow convergence.

Adaptive Dynamic Programming

Model‑based offline method where the agent learns transition and reward models from experience and then solves the MDP.

Advantage: easy to learn models in fully observable environments.

Disadvantage: scaling to large state spaces is difficult due to the need for many trials and many equations.

Temporal‑Difference (TD) Learning

Model‑free offline method that updates the utility function after each transition using a learning rate.

TD learning updates more frequently than direct utility estimation, offering higher efficiency, while not requiring a model of transitions or rewards.

Exploration

Model‑based online method that adds a curiosity function to encourage the agent to visit states with high uncertainty.

Advantage: quickly converges to a zero‑policy‑loss (optimal) strategy.

Disadvantage: convergence of utility estimates may be slower than with policy‑based methods.

Q‑Learning

Model‑free online, off‑policy TD algorithm that learns a state‑action value function Q.

Advantage: applicable to complex domains without needing a model.

Disadvantage: struggles when rewards are sparse and learns slower than ADP.

SARSA

Model‑free online TD algorithm that updates Q using the action actually taken in the next state.

Advantage: works well when the policy is controlled by another agent or program.

Disadvantage: less flexible than Q‑Learning and may learn more slowly.

The six algorithms above provide a foundation for understanding reinforcement learning; more advanced methods include Deep Q‑Network (DQN) and Deep Deterministic Policy Gradient (DDPG).

References:

Fundamentals of Reinforcement Learning and 6 Basic Algorithms, https://mp.weixin.qq.com/s/typiXCKrM1Z1uCsCjxcL9A

6 Reinforcement Learning Algorithms Explained, https://towardsdatascience.com/6-reinforcement-learning-algorithms-explained-237a79dbd8e

Machine Learningreinforcement learningMarkov decision processQ-learningSARSA
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.