Artificial Intelligence 25 min read

A Comprehensive Survey of Agent Memory: Benchmarks, Evaluation Frameworks, and System Designs

This article systematically reviews the state of agent long‑term memory by covering three core dimensions—benchmark datasets such as MUSE and LOCOMO, evaluation frameworks like MemoryAgentBench, LONGMEMEVAL and MemBench, and representative memory system implementations (THEANINE, RMM, M3‑Agent, Mem0)—while highlighting key capabilities, performance gaps, and future research directions.

DaTaobao Tech

Jun 3, 2026

A Comprehensive Survey of Agent Memory: Benchmarks, Evaluation Frameworks, and System Designs

Introduction

As large language models (LLMs) become central to dialogue systems and intelligent agents, long‑term memory is no longer an optional feature but a decisive factor for consistency, knowledge reuse, and cross‑session reasoning. The article outlines the current research landscape, identifies four essential memory capabilities—Accurate Retrieval (AR), Test‑Time Learning (TTL), Long‑Range Understanding (LRU), and Conflict Resolution (CR)—and argues that evaluation must answer three questions: what to remember, how to store it, and whether it yields measurable task gains.

Benchmark Datasets (Memory Benchmark)

MUSE (ACL 2025) provides 7 000 cases and 83 000 dialogues covering multimodal conversational recommendation scenarios.

LOCOMO (ACL 2024) contains 274 citations and focuses on realistic interaction data.

MemoryAgentBench reconstructs existing datasets and adds EventQA and FactConsolidation to evaluate AR and CR.

LONGMEMEVAL offers two configurations: LONGMEMEVAL‑S (~115 k tokens) and LONGMEMEVAL‑M (500 conversations, ~1.5 M tokens) to test information extraction, multi‑session inference, temporal reasoning, knowledge update, and refusal handling.

MemBench introduces factual and reflective memory across participation and observation scenarios, measuring accuracy, recall, capacity, and efficiency.

Evaluation Frameworks (Memory Evaluation)

The article describes unified evaluation pipelines that consist of indexing, retrieval, reading, and scoring. For example, LONGMEMEVAL employs session decomposition, fact‑augmented key expansion, and time‑aware query expansion to improve temporal reasoning. MemoryAgentBench uses a four‑stage process (index, retrieve, read, evaluate) and reports that Retrieval‑Augmented Generation (RAG) excels at AR, long‑context models dominate TTL and LRU, while all methods achieve ≤6 % accuracy on CR, especially in multi‑hop settings.

Memory System Implementations (Memory System)

THEANINE & TeaFarm (NAACL 2025) builds timeline‑based memory graphs and evaluates agents with counterfactual questions to ensure correct citation of past dialogue.

RMM (ICLR 2026) combines prospective reflection (summarizing dialogue history into topics) with retrospective reflection (online RL‑based retrieval refinement).

M3‑Agent (ICLR 2026) integrates multimodal perception, episodic (plot) and semantic memory graphs, and reinforcement‑learning‑driven control for long‑term reasoning.

Mem0 (2025) and Mem0g (2025) introduce dynamic extraction, integration, and graph‑based memory representations, achieving superior single‑hop and multi‑hop performance and lower latency compared to full‑context baselines.

Key Findings

RAG achieves the highest scores on accurate retrieval tasks, but its benefit diminishes as the number of retrieved observations grows.

Long‑context models (e.g., GPT‑4‑turbo‑16k) perform best on TTL and LRU, yet still lag far behind human baselines (e.g., 32.4 vs. 87.9 on QA).

All current approaches struggle with conflict resolution; the best reported accuracy is only 6 % on multi‑hop CR tasks.

Mem0 and Mem0g improve both accuracy (up to 7 % absolute gain) and efficiency (median latency reduced by >50 % compared to full‑context methods).

Synthetic datasets such as MUSE may not fully capture real‑world user behavior, highlighting a limitation for future work.

Discussion and Future Directions

The survey emphasizes that meaningful agent‑memory evaluation must jointly consider retrieval correctness, usage effectiveness, temporal dynamics (cross‑session, updates, forgetting), and cost constraints (latency, token usage, storage, privacy compliance). Only a unified, reproducible framework can guide system selection and engineering iteration. The authors invite collaboration on real‑world memory systems, dynamic update handling, and comprehensive benchmarking.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Agent benchmark memory Evaluation Dataset

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.