Can Online Evaluation Unlock AI Assistants' Long-Term Memory? Inside AMemGym

AMemGym introduces an on‑policy, interactive benchmark that evaluates and trains AI assistants' long‑term memory by structuring state evolution, diagnosing memory failures, and enabling agents to self‑evolve, revealing that selective memory writing outperforms passive approaches across various LLM and agent architectures.

PaperAgent
PaperAgent
PaperAgent
Can Online Evaluation Unlock AI Assistants' Long-Term Memory? Inside AMemGym

Memory Evaluation: Limitations of Static Offline Policies

Most existing benchmarks (e.g., MSC, LoCoMo) evaluate AI assistants with static, off‑policy scripts. This creates reuse bias : models can exploit fixed dialogues without truly capturing user information during interaction. Consequently, retrieval‑augmented generation (RAG) systems appear over‑estimated, while agent‑based memory systems are often under‑estimated when judged off‑policy.

To measure real interactive ability, an on‑policy evaluation is required.

AMemGym Framework

Phase 1 – Offline Structured Data Generation

User profile sampling : generate diverse personas.

Question & state sampling : define evaluation questions and state patterns.

State evolution : simulate natural preference changes over time (e.g., fitness intensity from “medium” to “high”).

Personalized response generation : produce ground‑truth benchmark answers for each trajectory.

Phase 2 – Online Interactive Evaluation

A large language model (e.g., GPT‑4) plays the role of a virtual user, revealing its profile and current state through role‑play. The AI assistant under test must actively recognize, remember, and use this personal information during free‑form dialogue.

paper: https://openreview.net/forum?id=sfrVLzsmlf
code: https://github.com/AGI-Eval-Official/amemgym

Comprehensive Scan of Mainstream Memory Systems

Pure LLMs with long context : When all required facts are supplied directly, models such as GPT, Claude, Gemini, and DeepSeek achieve >80% accuracy, demonstrating strong information utilization. However, in multi‑turn conversations where the model must retrieve key facts from a long dialogue history, performance drops sharply, confirming that longer context windows do not equal effective memory.

Agent memory architectures : Four designs were compared:

Pure context model

Standard RAG

Agent‑Write with external storage (AWE)

Agent‑Write within context

Results show that allowing the model to selectively write and retrieve memories (Agent‑Write) yields far higher scores than passive storage (RAG) or no explicit memory handling (pure context). Memory quality matters more than quantity.

Diagnostic Analysis of Memory Failures

AMemGym decomposes failures into three stages:

Write failure : needed information is not stored.

Read failure : stored information cannot be retrieved when required.

Use failure : retrieved information is misapplied during reasoning.

External‑storage agents (AWE) reduce read failures by curating memories, but they exhibit higher write‑failure rates, similar to standard RAG.

Self‑Evolving Memory Agents

AMemGym can also train agents. By making the prompt that controls memory writing evolvable, agents interact with the environment, receive feedback, and update their memory‑writing strategy. After a few iterations, agents significantly improve their memory scores, primarily by lowering write failures. Qualitative analysis shows prompt evolution from vague instructions (e.g., “skill level”) to concrete, actionable rules (e.g., “teaching method”), and the emergence of new memory patterns for emerging topics.

Conclusion

AMemGym shifts evaluation from static, post‑hoc testing to dynamic, online, interaction‑aligned assessment and training of AI‑assistant memory. It provides a reproducible benchmark, diagnostic tooling, and a training loop that enables memory agents to self‑evolve, offering a concrete foundation for building long‑term personalized AI services.

LLMAgentbenchmarkonline evaluationAI memoryMemory Systems
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.