Artificial Intelligence 8 min read

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

MemPO introduces a self‑memory policy optimization framework that lets long‑horizon LLM agents autonomously manage and refine their memory via reinforcement learning, using global‑trajectory and informative‑memory advantage estimates, achieving up to 25.98% F1 gain and 73% token reduction on benchmark tasks.

Data Party THU

Apr 20, 2026

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

Problem

Long‑horizon LLM agents accumulate interaction histories, causing linear context inflation, soaring token costs, and the “Lost in the Middle” degradation of accuracy and stability. Existing solutions rely on external memory stores or RAG retrieval, which only passively fetch similar fragments and cannot be jointly optimized with the task objective. Reinforcement‑learning‑based memory managers also lack explicit signals that guide the quality of written memory, leading to redundant or noisy context.

MemPO Overview

MemPO (Self‑Memory Policy Optimization) treats the memory buffer as a trainable policy component. During multi‑turn rollout sampling, the agent interacts with the environment and, at each turn, writes a memory fragment that will be available for subsequent steps. The memory writing operation is incorporated into the RL credit‑assignment chain, so the agent receives direct feedback on how useful each fragment is for the final answer.

Advantage Estimation

Two advantage estimators are combined:

Advantages of Global Trajectory – reward based on overall answer correctness and format compliance, measuring the quality of the entire interaction sequence.

Advantages of Informative Memory – reward based on the posterior probability of the correct answer given a specific memory fragment. The posterior is computed as the geometric mean of token‑level probabilities of the correct answer.

The combined advantage is the weighted sum of the two terms (see Figure 1).

Training Signal

During training the agent receives the summed advantage as a scalar reward. This drives the policy to generate memory that is both concise and highly informative for the downstream task, suppressing uncontrolled growth of irrelevant fragments.

Experimental Evaluation

Benchmarks: a multi‑objective web‑search dataset with varying numbers of objectives. Baselines: ReAct, Agentic‑RL, and several RAG‑based memory methods.

Key results:

F1 improvement up to 25.98 % over the base model and 7.1 % over the previous state‑of‑the‑art.

Token consumption reduced by 67.58 %–73.12 % , i.e., only one‑third of the tokens used by ReAct.

Effective performance gain of roughly three‑fold compared with ReAct.

Analysis of task complexity shows that as the number of objectives increases, MemPO’s advantage over GRPO widens. Ablation studies reveal a trade‑off: simple tasks benefit from richer context, while overly long histories introduce noise that harms accuracy, confirming the need for selective memory compression.

Conclusion

MemPO converts the memory buffer into a learnable policy variable that is jointly optimized with the agent’s thinking and acting stages. By embedding memory writing into the RL credit‑assignment loop, the agent learns to allocate context budget to truly useful intermediate information and discard noise, achieving shorter contexts, higher information density, lower token cost, and superior task performance. The results suggest that future long‑horizon agent research should shift from purely retrieval‑based memory toward learned, controllable internal memory generation.

Paper: https://arxiv.org/abs/2603.00680

Code: https://github.com/TheNewBeeKing/MemPO

Code example

本文
约1700字
，建议阅读
5
分钟
本文介绍了 MemPO 用强化学习优化长程 Agent 记忆，提质降本。

Memory Optimization LLM benchmark reinforcement learning Token Efficiency Long-Horizon Agents MemPO

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.