Artificial Intelligence 21 min read

Demystifying MaxEnt Inverse Reinforcement Learning: Theory, Algorithms, and Practical Implementation

This article provides a comprehensive, step‑by‑step exploration of MaxEnt Inverse Reinforcement Learning, covering its statistical foundations, feature‑expectation matching, algorithmic details, deep extensions, and practical engineering considerations for complex decision‑making tasks.

Data Party THU

Aug 9, 2025

Demystifying MaxEnt Inverse Reinforcement Learning: Theory, Algorithms, and Practical Implementation

Imitation Learning (IL) and Inverse Reinforcement Learning (IRL)

IL extracts a decision policy from expert demonstration trajectories. When the task environment is complex or open‑ended (e.g., autonomous driving, robotic manipulation), learning‑based methods are required. Two main IL approaches exist:

Behavior cloning : treat the problem as supervised learning.

Inverse reinforcement learning : first infer a reward function that explains the demonstrations, then solve a reinforcement‑learning problem using the learned reward.

This summary focuses on the second approach, specifically the Maximum‑Entropy (MaxEnt) IRL algorithm.

Maximum‑Entropy Principle

Given limited information, the distribution with the highest Shannon entropy is the least biased estimate. In IRL this principle is used to select, among all reward functions that match the expert’s feature expectations, the one that yields the most uncertain (maximum‑entropy) trajectory distribution.

Formally, for a set of trajectories \(\tau\) the MaxEnt distribution is

P(\tau) = \frac{\exp(\omega^{\top}\Phi(\tau))}{Z(\omega)}\

where \(\Phi(\tau)=\sum_{(s,a)\in\tau}\phi(s,a)\) is the cumulative feature vector of trajectory \(\tau\), \(\omega\) are the reward parameters, and \(Z(\omega)=\sum_{\tau}\exp(\omega^{\top}\Phi(\tau))\) is the partition function.

Feature Expectation Matching (Linear Reward Assumption)

Assume the expert maximises a linear reward \(R(s,a)=\omega^{\top}\phi(s,a)\). The goal is to find \(\omega\) such that the expected feature counts under the learned policy \(\pi\) match those of the expert demonstrations \(D\):

\mu_E = \frac{1}{|D|}\sum_{\tau\in D}\Phi(\tau)\
\mu_\pi(\omega) = \mathbb{E}_{\tau\sim\pi_{\omega}}[\Phi(\tau)]\
\text{Find }\omega\text{ s.t. }\mu_E = \mu_\pi(\omega)

Because the constraints are linear in \(\omega\), the problem can be cast as a quadratic‑programming (QP) classification task that separates the expert feature expectation from those of candidate policies.

Algorithm (Linear Reward)

Initialize a random policy \(\pi_0\) and compute its feature expectation \(\mu(\pi_0)\).

Solve the QP to obtain reward parameters \(\omega_i\) that separate \(\mu_E\) from \(\mu(\pi_{i-1})\).

If the margin error \(t \le \epsilon\) stop; otherwise train a new policy \(\pi_i\) using the reward \(R(s,a)=\omega_i^{\top}\phi(s,a)\), compute \(\mu(\pi_i)\), add \(\pi_i\) to the policy pool and repeat.

Optionally, a convex combination of policies in the pool can be used to better approximate \(\mu_E\).

Reverse Pass (Policy Training)

Given reward parameters \(\omega\), train a policy by solving a value‑based problem (e.g., DQN). Define the state‑action partition function \(Z(s,a)\) recursively from terminal states:

Z(s,a) = \sum_{s'} P(s'\mid s,a)\,\exp(R(s,a))\,Z(s')\

The resulting stochastic policy is

\pi(a\mid s) = \frac{Z(s,a)}{\sum_{a'} Z(s,a')}\

Forward Pass (State Visitation Frequencies)

Compute the expected state visitation frequencies \(D(s)\) by propagating from the initial state distribution through the MDP using the current policy:

D_{t+1}(s') = \sum_{s}\sum_{a}\pi(a\mid s)\,P(s'\mid s,a)\,D_t(s)\

Iterate until convergence; the final \(D(s)\) is used in the gradient of the log‑likelihood.

Limitations of Linear Rewards

Linear reward models cannot capture inherently non‑linear relationships (e.g., gravitational force \(\propto 1/h^2\)). Proper feature engineering is required to embed such non‑linearities, otherwise the learned reward will be a poor approximation of the true objective.

Maximum‑Entropy Deep IRL

To remove the linearity restriction, replace the linear reward with a deep neural network \(r_{\psi}(\phi(s))\) parameterised by \(\psi\). The network can process high‑dimensional inputs (e.g., images) and represent arbitrary non‑linear reward functions.

The training objective maximises the log‑likelihood of expert trajectories while regularising the network parameters:

L(\psi) = \sum_{\tau\in D}\bigl[ r_{\psi}(\tau) - \log Z(\psi) \bigr] + \lambda\,\|\psi\|_p\

where \(r_{\psi}(\tau)=\sum_{(s,a)\in\tau} r_{\psi}(\phi(s))\) and \(Z(\psi)=\sum_{\tau}\exp(r_{\psi}(\tau))\).

Deep IRL Training Loop

Randomly initialise the reward network parameters \(\psi\).

Generate a batch of trajectories using a stochastic policy (e.g., soft‑max over \(r_{\psi}\)).

For each trajectory compute the unnormalised probability \(\tilde{P}(\tau)=\exp(r_{\psi}(\tau))\) and approximate the partition function \(Z\) by averaging over the batch.

Normalise to obtain \(P(\tau)=\tilde{P}(\tau)/Z\). Compute the loss as the squared difference between expert and learner state‑visitation frequencies (or directly the negative log‑likelihood) and back‑propagate through \(r_{\psi}\).

Repeat until the loss plateaus; the final \(\psi\) defines the reward model used for downstream policy optimisation.

This approximation avoids explicit value‑function learning and direct state‑frequency estimation while preserving the MaxEnt objective.

Key Technical Notes

When the environment is stochastic, the trajectory probability must include the product of transition probabilities; extensions such as Maximum Causal Entropy handle this case.

Evaluation can be performed by comparing the expected return under the learned reward with the true reward (if available) or by measuring the divergence between expert and learner state‑visitation distributions.

The linear FM method can be interpreted as a projection onto the subspace spanned by the last two feature expectations, providing a closed‑form \(\omega\) without solving a QP.