How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%
The paper introduces MemPO, a self‑memory strategy optimization algorithm that lets large language model agents actively manage their memory, dramatically improving accuracy on complex multi‑step tasks while reducing token consumption by up to 73%, and validates the approach with extensive experiments and analysis.
Problem: Long‑term Forgetting in AI Agents
When an AI agent must perform many interaction rounds (e.g., deep research, data analysis, complex code generation), the context window grows linearly. Once the context exceeds the model’s limit, the agent experiences “context overload”, leading to forgetting of earlier information and collapse of reasoning.
MemPO: Self‑Memory Strategy Optimization
MemPO (Self‑Memory Strategy Optimization) is an algorithm that gives large language models the ability to actively manage their own memory. At each step the agent emits one of three explicit actions:
memory : a compact memory block that summarizes the most relevant information from the previous interaction.
reasoning : the usual chain‑of‑thought step.
tool call : invocation of an external tool (search, calculator, etc.) and its response.
The agent discards the full history and feeds only the memory block together with the current observation into the next forward pass. This reduces the effective context length by up to 70 % and cuts token consumption dramatically.
Training with Fine‑grained Memory Rewards
Standard reinforcement‑learning‑from‑human‑feedback (RLHF) uses a single trajectory‑level reward, which is too coarse for long‑horizon tasks: the model cannot tell which intermediate memory contributed to the final answer. MemPO introduces an additional memory‑level reward that evaluates each generated memory segment independently.
During policy optimization the total reward for a token belonging to a memory segment is: R_total = R_trajectory + R_memory - b where b is a baseline bias term that normalizes for difficulty differences across trajectories.
Probability‑based Metric for Memory Quality
Because language models generate text by estimating conditional probabilities, MemPO uses the probability of producing the correct answer given a memory as a quantitative quality metric. For a memory m and target answer a : Q(m) = P(a | m, prompt) A higher Q(m) indicates that the memory contains more useful information. The baseline bias b is estimated by averaging Q(m) over a set of reference trajectories.
Experimental Setup
All experiments use the Qwen2.5‑7B model as the base LLM. A multi‑goal benchmark was built that requires the agent to locate up to ten distinct retrieval targets in a single episode, with difficulty increasing with the number of targets.
Key hyper‑parameters:
Learning rate: 5e‑5
RL steps: 30 k
Memory block size: 256 tokens (max)
Tool‑call budget: ≤ 5 per episode
Results
Compared with the baseline (no memory optimization), MemPO achieves:
+25.98 absolute F1 score (7.1 points higher than the previous state‑of‑the‑art).
67.58 % reduction in total token usage per episode.
73.12 % reduction in peak token consumption per step.
Higher concentration of samples in high‑probability bins, indicating better memory quality.
Ablation studies show that removing the independent memory reward drops performance to baseline levels, while retaining the full interaction history (no memory compression) leads to rapid degradation on long‑horizon tasks.
Limitations
The reward signal is still sensitive to token count fluctuations caused by tool calls, and the bias term only partially compensates. Extending MemPO to open‑world environments and reducing the compute overhead of large models remain open challenges.
Resources
arXiv pre‑print: https://arxiv.org/pdf/2603.00680
Hugging Face collection: https://huggingface.co/collections/NewBeeKing/mempo
GitHub repository: https://github.com/TheNewBeeKing/MemPO
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
