RoboMemArena: A Comprehensive Benchmark that Truly Tests Robot Memory for Embodied AI

RoboMemArena introduces a systematic, long‑horizon robot memory benchmark with 26 tasks, 151 sub‑tasks, multimodal annotations, and real‑robot evaluations, exposing the limitations of existing benchmarks and demonstrating that the dual‑system PrediMem model markedly outperforms baselines both in simulation and on physical robots.

Machine Heart
Machine Heart
Machine Heart
RoboMemArena: A Comprehensive Benchmark that Truly Tests Robot Memory for Embodied AI

Motivation

Embodied foundation models such as VLA and world models have progressed rapidly, but long‑duration, complex tasks expose a practical limitation: robots often “can’t remember”. Questions like whether a cabinet was previously opened, where an occluded object was placed, how many times an action was repeated, or the exact demonstrated order cannot be solved from a single frame observation. Existing robot benchmarks therefore insufficiently characterize memory‑dependent long‑horizon manipulation.

RoboMemArena benchmark

RoboMemArena is a systematic benchmark focused on robot memory. It defines four core memory scenarios— Transferring , Occlusion , Counting , and Sequence —across 26 long‑horizon tasks and 151 fine‑grained sub‑tasks. The benchmark provides 2 600 expert demonstration trajectories, which are further split into 15 100 keyframe‑aligned short segments. Statistics:

Average task length > 1 000 steps

68.9 % of the 151 sub‑tasks are memory‑dependent

Annotations include:

Subtask‑level annotations : decomposition of long trajectories into executable sub‑tasks

Native keyframe annotations : explicit labeling of critical physical state transitions

Aligned visual observations, actions, and robot states for each trajectory

Five real‑robot evaluation tasks are provided: Pour Bottle ×2 (counting), Brush Plates with Swap (state occlusion), Transfer Objects (sequence), Shell Game (hidden‑state tracking), and IHMB – Imitate Human to Make Breakfast (long‑horizon imitation, > 3 min). The benchmark supplies BDDL task definitions, LIBERO‑compatible evaluation environments, and code compatible with MuJoCo, RoboSuite, and OpenGL/EGL.

PrediMem architecture

PrediMem is a dual‑system VLA baseline. A high‑level visual‑language model (VLM) performs planning and memory management, while a low‑level VLA executes action chunks. Key components:

Recent‑frame buffer : stores the latest observations

Keyframe buffer : retains selected keyframes that mark important state changes

Predictive‑coding head : makes the high‑level representation sensitive to physical state transitions

This lightweight design explicitly organizes historical information rather than relying on increased model size.

Experimental results

Simulation

PrediMem: 38.5 % Task Success Rate (TSR) / 55.2 % Completion Success Rate (CSR)

MemER: 27.3 % TSR / 49.1 % CSR

π0.5: 21.5 % TSR / 38.7 % CSR

Across the four memory scenarios, PrediMem achieves the highest average performance. The strongest gain appears in the Sequence setting (72.5 % TSR / 89.5 % CSR). It also outperforms baselines on the most memory‑intensive Occlusion and Counting tasks.

Real‑robot

PrediMem: 52 % average success

MemER: 40 % average success

π0.5: 20 % average success

Only PrediMem succeeds on the longest and most complex IHMB task. Ablation studies show that removing either the predictive‑coding head or the keyframe bank degrades performance, confirming that the advantage stems from improved organization of historical information.

Resources

Paper: https://arxiv.org/abs/2605.10921

Project site: https://robomemarena.github.io/

Code repository: https://github.com/OpenHelix-Team/RoboMemArena

Dataset: https://huggingface.co/datasets/RoboMemArenaBenchmark/RoboMemArena

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkEmbodied AIlong-horizon manipulationPrediMemRoboMemArenarobotic memory
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.