Understanding Policy Evaluation and Improvement in Reinforcement Learning

This article explains how to solve Bellman equations, use iterative policy‑evaluation methods, apply the policy‑improvement theorem, and combine both steps in policy iteration, value iteration, and asynchronous variants, illustrated with a 5‑state example and a 4×4 gridworld.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Understanding Policy Evaluation and Improvement in Reinforcement Learning

Introduction

Reinforcement learning (RL) studies agents that learn optimal policies by interacting with complex environments and receiving rewards. Because RL tackles problems that other machine‑learning methods cannot solve, it requires specialized analysis of state‑transition dynamics and reward structures.

Solving the Bellman Equation

Assuming the environment dynamics are fully known for a set of |S| states, the Bellman equation for the state‑value function V forms a linear system with |S| variables (or |S|\times|A| equations for the action‑value function Q). Solving this system yields the exact v values for each state.

Example : A toy environment with five states (including a terminal state T) is shown, where blue numbers denote transition probabilities and red numbers denote rewards. Because v(T)=0, only four equations need to be solved.

Policy Evaluation

Directly solving the linear system has cubic complexity O(|S|^3), which is impractical for large state spaces. Instead, the iterative policy‑evaluation algorithm proceeds as follows:

Randomly initialise V(s) for all non‑terminal states.

Update each non‑terminal state using the Bellman backup.

Repeat step 2 until the maximum change across states is below a threshold θ.

If the state space |S| is finite, the iterative estimates are guaranteed to converge to the true V values for a given policy π.

The update for a single state is called an expected update because it averages over all possible successor states. A full pass over all states is termed a sweep . The same idea can be applied to compute Q values.

Update Variants

Two implementation styles exist:

Double‑array method : New values are computed from a separate array that holds the old values, preserving the original values throughout the sweep.

Single‑array method : New values overwrite the old ones immediately, allowing later updates in the same sweep to use the most recent estimates. In practice the single‑array approach converges faster.

The order in which variables are updated is not prescribed, but it can significantly affect convergence speed.

Policy Improvement

Using the evaluated V (or Q) function, the policy can be greedily improved: for each state, select the action that yields the highest expected return. In a 4×4 gridworld example, the random policy gives expected rewards of –14 or –20 from state B3; switching to the action that leads to A3 or B4 (the –14 outcome) improves the policy.

Greedy selection of the action with the highest expected reward indeed yields a better policy.

The policy‑improvement theorem (from Sutton & Barto) states that if a new deterministic policy π' chooses actions that are at least as good as those of π in every state, then π' is guaranteed to be no worse than π. Formally, for all s∈S:

Thus, by repeatedly applying greedy improvement after each evaluation, the policy strictly improves until it becomes optimal.

Policy Iteration

Starting from any policy π, compute its V, improve to π', recompute V', and repeat. With a finite state space, this alternating process converges to the optimal policy and optimal value function.

Value Iteration

Because full policy evaluation can be costly, value iteration interleaves a single sweep of evaluation with a greedy improvement step. This yields the same optimal solution with fewer iterations. The algorithm can be stopped early when the change in V falls below a threshold.

Asynchronous Value Iteration

When the state space is huge, updating all states each sweep is expensive. Asynchronous versions update only a subset of states (or a single state) in each iteration, possibly revisiting some states more often than others. Convergence is still guaranteed as long as every state is updated infinitely often, though in practice a few passes often suffice.

Generalized Policy Iteration (GPI)

Policy iteration, value iteration, and their asynchronous variants are all instances of Generalized Policy Iteration: the repeated, independent execution of a policy‑evaluation step and a policy‑improvement step. Sutton & Barto illustrate GPI with a geometric diagram where the value‑function curve and the greedy‑policy curve converge to their intersection, representing the optimal value function and optimal policy.

Conclusion

The article reviewed the core ideas behind policy evaluation and policy improvement, showing how their interaction leads to optimal solutions. While the presented methods assume full knowledge of transition probabilities, many modern RL algorithms build on the Generalized Policy Iteration framework and employ asynchronous updates to balance computational cost and convergence speed in large‑scale problems.

reinforcement learningpolicy evaluationvalue iterationBellman Equationgeneralized policy iterationgridworldpolicy improvement
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.