Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

This article explains why tabular reinforcement‑learning methods scale poorly, introduces supervised‑learning‑based value‑function approximation using a parameterized vector w, discusses loss design, stochastic‑gradient updates, bootstrapping, semi‑gradient techniques, and linear function approximation, and summarizes practical implications.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

Introduction

Reinforcement learning (RL) studies how an agent can learn optimal policies by interacting with an environment and receiving rewards. Unlike other machine‑learning fields, RL must handle sequential decision making in potentially unknown, complex settings.

Limitations of Tabular Methods

Tabular RL assumes that all states and actions can be enumerated, allowing the value function V or Q to be stored as a table. This approach quickly becomes infeasible when the state‑action space grows. For example, a 128×128 RGB image yields roughly 2.74×10<sup>11</sup> possible states, far beyond current computational capabilities. Most real‑world RL problems involve massive state spaces, making tabular value estimation impractical.

Value Function Approximation via Supervised Learning

To overcome scalability issues, the article proposes approximating the value function with a parameterized vector w. The approximated value v̂(s,w) is expressed as a function of the state s and the weight vector w, typically as a dot product between w and a feature vector x(s). Supervised learning algorithms—linear regression, decision trees, or neural networks—can be used to learn w, because they generalize well from a training set (X₁, y₁) to unseen samples X₂.

Why use supervised learning for v̂ ? Because after training on known state‑value pairs, the model can predict values for previously unseen states, addressing the RL generalization problem.

Challenges Specific to RL

Computational complexity : Tabular methods require exhaustive enumeration, leading to exponential growth in computation.

Lack of true target values : RL environments do not provide ground‑truth state values, so traditional loss functions are undefined.

These challenges motivate the design of a surrogate loss that incorporates a state‑distribution weighting function μ(s), which reflects how often the agent visits each state.

Loss Function and Objective

The weighted mean‑squared error (MSE) is adopted as an example loss: L(w) = Σ_s μ(s)·(v̂(s,w) - U(s))² where U(s) denotes a target value (e.g., a Monte‑Carlo return or a bootstrapped estimate). The ultimate goal is to find the weight vector w that minimizes the objective VE(w). In practice, convergence to a global optimum is rare; most algorithms guarantee convergence only to a local optimum w*, which is often sufficient.

Stochastic Gradient Descent (SGD)

Assuming the loss is differentiable, the SGD update for a single sampled state at iteration t with current weights wₜ is derived from the MSE loss. The update rule (shown in the accompanying figure) adjusts w in the direction that reduces the weighted error.

Because true state values are unavailable, the article introduces a surrogate target U (denoted as U) to replace the unknown v(s). This surrogate can be a Monte‑Carlo return Gₜ or a bootstrapped estimate.

Bootstrapping and Semi‑Gradient Methods

Bootstrapping replaces the full return with a one‑step reward plus the estimated value of the next state, leading to faster, online learning. However, two difficulties arise:

Bias in the target because the initial weight vector is random.

Dependence of the target on the weight vector itself, complicating the gradient computation.

Both issues can be mitigated using semi‑gradient methods, which treat the target as a constant with respect to w during the gradient step.

Linear Function Approximation

When the value function is linear, v̂(s,w) = w·x(s), the gradient simplifies to the feature vector x(s). The update rule becomes extremely simple (illustrated in the figure). Linear approximation is attractive because it turns the optimization into a convex problem with a unique global optimum. Besides SGD, ordinary least squares can also be applied.

Gradient Monte Carlo and Temporal‑Difference (TD) Methods

Gradient Monte Carlo uses the full return Gₜ as the target, guaranteeing convergence to a local optimum; with linear function approximation, this local optimum is also the global optimum. TD methods, while lacking a strict convergence guarantee, often converge to a point near the global optimum (the TD fixed point) and converge faster in practice.

Conclusion

Tabular RL algorithms do not scale to large state spaces, motivating the use of value‑function approximation. By casting RL as a supervised‑learning problem, we can apply powerful function‑approximation techniques. Gradient Monte Carlo offers strong theoretical guarantees, but bootstrapped TD methods remain the practical choice due to their faster convergence.

Reference: Sutton & Barto, Reinforcement Learning (2020), Chapter 9.

reinforcement learningsupervised learningtemporal-differencegradient Monte Carlolinear function approximationvalue function approximation
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.