Artificial Intelligence 18 min read

Offline Multi-Agent Reinforcement Learning via In‑Sample Sequential Policy Optimization (InSPO)

Offline multi‑agent reinforcement learning (MARL) faces challenges such as out‑of‑distribution joint actions and local optima, and this article introduces the In‑Sample Sequential Policy Optimization (InSPO) algorithm—leveraging inverse KL regularization, maximum‑entropy, and cooperative Markov games—to achieve monotonic policy improvement and superior performance across benchmark tasks.

Meituan Technology Team

Feb 20, 2025

Offline Multi-Agent Reinforcement Learning via In‑Sample Sequential Policy Optimization (InSPO)

Cooperative Markov Game

Cooperative Markov games model multi‑agent interactions as a tuple G=⟨N,S,A,P,r,γ,d⟩, where N is the set of agents, S the state space, A the joint action space, P the transition dynamics, r a shared reward, γ the discount factor, and d the initial state distribution. At each timestep each agent selects an action based on the current state, receives a joint reward, and transitions to the next state.

IGM Principle and Value Decomposition

Value‑decomposition methods rewrite the joint Q‑function Q(s,a) as a combination of individual agent Q‑functions, relying on the Individual‑Global‑Max (IGM) principle: the optimal joint action can be obtained by taking the greedy action of each agent. This simplifies computation but can break when the environment exhibits multimodal reward structures.

Behavior‑Regularized Markov Game for Offline MARL

To mitigate distribution‑shift in offline MARL, a behavior‑regularization term is added to the reward, penalizing policies that deviate from the behavior policy present in the dataset. The objective maximizes expected discounted return while subtracting the regularization term, balancing exploration and exploitation and preventing convergence to poor local optima.

In‑Sample Sequential Policy Optimization (InSPO)

InSPO addresses OOD joint actions and coordination failures by sequentially updating each agent’s policy within a behavior‑regularized Markov game framework. It combines inverse KL divergence regularization with a maximum‑entropy term, ensuring policies stay within the support of the behavior data while encouraging exploration of low‑probability actions.

Mathematical Derivation

The core objective includes an inverse KL term that decomposes across agents, enabling sequential updates. Using Karush‑Kuhn‑Tucker (KKT) conditions, a closed‑form solution for each agent’s update is derived, yielding a tractable optimization that minimizes KL divergence and guarantees monotonic policy improvement.

Maximum‑Entropy Behavior‑Regularized Markov Game

InSPO augments the inverse KL objective with an entropy term, forming the Maximum‑Entropy Behavior‑Regularized Markov Game (MEBR‑MG). This promotes balanced exploration of high‑ and low‑probability actions and theoretically ensures convergence to a Quantized Response Equilibrium (QRE), which remains robust under perturbed rewards.

Algorithm Details

Algorithm 1: InSPO Steps

Input: offline dataset D, initial policies, and initial Q‑functions.

Output: final policies.

Compute a behavior policy via simple behavior cloning.

Iteratively compute the current Q‑functions.

Randomly permute the agents and update each agent’s policy sequentially using the derived objective.

Repeat until a predefined number of iterations K is reached.

Policy Evaluation

Policy evaluation approximates the joint Q‑function with local Q‑functions to handle the exponential growth of the joint action space. Importance‑resampling is employed to construct a low‑variance dataset, stabilizing training.

Policy Improvement

With the updated local Q‑functions, each agent’s policy is improved by minimizing the KL divergence to the behavior policy while respecting the inverse KL regularizer, guaranteeing convergence.

Practical Implementation

Key engineering tricks include optimizing local Q‑functions to avoid exponential blow‑up, applying importance‑resampling to reduce variance, and automatically adjusting the temperature parameter α to control conservatism.

Experimental Validation

In the M‑NE (bridge) game, InSPO uniquely identifies the global optimum on imbalanced datasets where other methods fail. In the StarCraft II micro‑management benchmark (four maps, multiple data regimes), InSPO achieves state‑of‑the‑art win rates, demonstrating scalability to high‑dimensional environments.

Ablation Studies

Removing the entropy term prevents escape from local optima; synchronous updates cause conflicting gradient directions and OOD joint actions; adaptive α tuning yields better performance than fixed temperatures.

Conclusion

InSPO introduces a novel offline MARL algorithm that resolves OOD joint‑action issues and local‑optimum traps through inverse KL regularization and entropy maximization, guaranteeing monotonic improvement and convergence to QRE, and empirically outperforms existing baselines across diverse tasks.

Future Directions

Potential extensions include combining InSPO with other MARL advances, generating high‑quality offline datasets via GANs, designing regularizers for multimodal reward landscapes, and deploying the method in real‑world domains such as intelligent scheduling, autonomous driving, and smart manufacturing.

policy optimization Maximum Entropy cooperative Markov game InSPO inverse KL regularization offline MARL

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.