Artificial Intelligence 8 min read

Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory

CAMBRIDGE's new ATM‑Bench evaluates AI assistants' ability to retrieve personal memories spanning years across multimodal data, revealing that leading agents like OpenClaw, Codex, and Claude Code achieve under 40% accuracy and struggle despite extensive toolchains, highlighting a fundamental long‑term memory challenge.

Machine Heart

Apr 20, 2026

Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory

CAMBRIDGE's machine intelligence lab released ATM‑Bench, a benchmark that tests whether AI personal assistants can recall a user's multi‑year, multimodal personal history and answer questions accurately.

The benchmark covers roughly four years of data, includes over ten thousand memory items from three modalities—photos, videos, and emails—and contains more than 1,000 fully human‑annotated question‑answer‑evidence triples. All memory data come from real personal life rather than synthetic dialogues, and the visual data are enriched with timestamps, locations across four continents, and other metadata.

Key challenges identified by the authors are:

Personal referents such as a pet name or a specific trip that require disambiguation.

Cross‑source stitching, e.g., aligning photo timestamps with email confirmations.

Memory conflicts, for example differing amounts in a reservation email versus the final invoice.

Metadata noise, such as GPS inaccuracies.

Personal referent parsing example – “Who is Grace?” The system must determine whether Grace is a friend, family member, or pet, locate the corresponding visual material, and interpret subjective descriptors like “stealthily.”

Identify Grace’s role (friend, family, pet).

Detect Grace in images or videos.

Interpret the phrase “stealthily” in context.

Evidence‑conflict example – “How much did I spend on the Portugal hotel?” Multiple pieces of evidence exist (an outdated reservation email and a final invoice). The AI must judge which source is more recent and trustworthy, a difficulty that even advanced models such as GPT‑5.2 or Opus‑4.6 can mishandle.

Invisible‑clue example – “What did I order at Fancett?” The restaurant name appears only in an email; the photos lack GPS tags. The required reasoning steps are:

Find the email containing the Fancett reservation.

Extract the timestamp and define a time window.

Search the photo album for images from that window.

Visually infer the ordered dish from the selected photo.

This multi‑step, cross‑modal process demonstrates why single‑modality approaches fail.

Experimental results on the hard version of ATM‑Bench show that specialized memory systems (A‑Mem, HippoRAG2, mem0, MemoryOS) achieve less than 20% accuracy. Among generalist agents, the best performer, Codex, reaches only 39.7% accuracy; Claude Code + Opus 4.6 scores 33.8%; OpenCode (Kimi K2.5) 30.3%; and OpenClaw (Kimi K2.5) 25.4%.

Token consumption is also high: Codex uses 15.46 M tokens, while OpenClaw consumes 9.63 M tokens, indicating that even massive tool‑chain usage yields limited gains.

The authors conclude that, despite equipping agents with code execution, file search, and indexing capabilities, long‑term personalized memory QA remains a fundamental obstacle. They hope the benchmark will spur research into more robust memory architectures for truly personalized AI assistants.