How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

The paper introduces MemPO, a self‑memory strategy optimization algorithm that lets large language model agents actively manage their memory, dramatically improving accuracy on complex multi‑step tasks while reducing token consumption by up to 73%, and validates the approach with extensive experiments and analysis.

SuanNi
SuanNi
SuanNi
How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

Problem: Long‑term Forgetting in AI Agents

When an AI agent must perform many interaction rounds (e.g., deep research, data analysis, complex code generation), the context window grows linearly. Once the context exceeds the model’s limit, the agent experiences “context overload”, leading to forgetting of earlier information and collapse of reasoning.

MemPO: Self‑Memory Strategy Optimization

MemPO (Self‑Memory Strategy Optimization) is an algorithm that gives large language models the ability to actively manage their own memory. At each step the agent emits one of three explicit actions:

memory : a compact memory block that summarizes the most relevant information from the previous interaction.

reasoning : the usual chain‑of‑thought step.

tool call : invocation of an external tool (search, calculator, etc.) and its response.

The agent discards the full history and feeds only the memory block together with the current observation into the next forward pass. This reduces the effective context length by up to 70 % and cuts token consumption dramatically.

Training with Fine‑grained Memory Rewards

Standard reinforcement‑learning‑from‑human‑feedback (RLHF) uses a single trajectory‑level reward, which is too coarse for long‑horizon tasks: the model cannot tell which intermediate memory contributed to the final answer. MemPO introduces an additional memory‑level reward that evaluates each generated memory segment independently.

During policy optimization the total reward for a token belonging to a memory segment is: R_total = R_trajectory + R_memory - b where b is a baseline bias term that normalizes for difficulty differences across trajectories.

Probability‑based Metric for Memory Quality

Because language models generate text by estimating conditional probabilities, MemPO uses the probability of producing the correct answer given a memory as a quantitative quality metric. For a memory m and target answer a : Q(m) = P(a | m, prompt) A higher Q(m) indicates that the memory contains more useful information. The baseline bias b is estimated by averaging Q(m) over a set of reference trajectories.

Experimental Setup

All experiments use the Qwen2.5‑7B model as the base LLM. A multi‑goal benchmark was built that requires the agent to locate up to ten distinct retrieval targets in a single episode, with difficulty increasing with the number of targets.

Key hyper‑parameters:

Learning rate: 5e‑5

RL steps: 30 k

Memory block size: 256 tokens (max)

Tool‑call budget: ≤ 5 per episode

Results

Compared with the baseline (no memory optimization), MemPO achieves:

+25.98 absolute F1 score (7.1 points higher than the previous state‑of‑the‑art).

67.58 % reduction in total token usage per episode.

73.12 % reduction in peak token consumption per step.

Higher concentration of samples in high‑probability bins, indicating better memory quality.

Ablation studies show that removing the independent memory reward drops performance to baseline levels, while retaining the full interaction history (no memory compression) leads to rapid degradation on long‑horizon tasks.

Limitations

The reward signal is still sensitive to token count fluctuations caused by tool calls, and the bias term only partially compensates. Extending MemPO to open‑world environments and reducing the compute overhead of large models remain open challenges.

Resources

arXiv pre‑print: https://arxiv.org/pdf/2603.00680

Hugging Face collection: https://huggingface.co/collections/NewBeeKing/mempo

GitHub repository: https://github.com/TheNewBeeKing/MemPO

efficiencyAImemory optimizationlarge language modelsreinforcement learninglong-term memory
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.