RoboMemArena: A Comprehensive Benchmark that Truly Tests Robot Memory for Embodied AI
RoboMemArena introduces a systematic, long‑horizon robot memory benchmark with 26 tasks, 151 sub‑tasks, multimodal annotations, and real‑robot evaluations, exposing the limitations of existing benchmarks and demonstrating that the dual‑system PrediMem model markedly outperforms baselines both in simulation and on physical robots.
Motivation
Embodied foundation models such as VLA and world models have progressed rapidly, but long‑duration, complex tasks expose a practical limitation: robots often “can’t remember”. Questions like whether a cabinet was previously opened, where an occluded object was placed, how many times an action was repeated, or the exact demonstrated order cannot be solved from a single frame observation. Existing robot benchmarks therefore insufficiently characterize memory‑dependent long‑horizon manipulation.
RoboMemArena benchmark
RoboMemArena is a systematic benchmark focused on robot memory. It defines four core memory scenarios— Transferring , Occlusion , Counting , and Sequence —across 26 long‑horizon tasks and 151 fine‑grained sub‑tasks. The benchmark provides 2 600 expert demonstration trajectories, which are further split into 15 100 keyframe‑aligned short segments. Statistics:
Average task length > 1 000 steps
68.9 % of the 151 sub‑tasks are memory‑dependent
Annotations include:
Subtask‑level annotations : decomposition of long trajectories into executable sub‑tasks
Native keyframe annotations : explicit labeling of critical physical state transitions
Aligned visual observations, actions, and robot states for each trajectory
Five real‑robot evaluation tasks are provided: Pour Bottle ×2 (counting), Brush Plates with Swap (state occlusion), Transfer Objects (sequence), Shell Game (hidden‑state tracking), and IHMB – Imitate Human to Make Breakfast (long‑horizon imitation, > 3 min). The benchmark supplies BDDL task definitions, LIBERO‑compatible evaluation environments, and code compatible with MuJoCo, RoboSuite, and OpenGL/EGL.
PrediMem architecture
PrediMem is a dual‑system VLA baseline. A high‑level visual‑language model (VLM) performs planning and memory management, while a low‑level VLA executes action chunks. Key components:
Recent‑frame buffer : stores the latest observations
Keyframe buffer : retains selected keyframes that mark important state changes
Predictive‑coding head : makes the high‑level representation sensitive to physical state transitions
This lightweight design explicitly organizes historical information rather than relying on increased model size.
Experimental results
Simulation
PrediMem: 38.5 % Task Success Rate (TSR) / 55.2 % Completion Success Rate (CSR)
MemER: 27.3 % TSR / 49.1 % CSR
π0.5: 21.5 % TSR / 38.7 % CSR
Across the four memory scenarios, PrediMem achieves the highest average performance. The strongest gain appears in the Sequence setting (72.5 % TSR / 89.5 % CSR). It also outperforms baselines on the most memory‑intensive Occlusion and Counting tasks.
Real‑robot
PrediMem: 52 % average success
MemER: 40 % average success
π0.5: 20 % average success
Only PrediMem succeeds on the longest and most complex IHMB task. Ablation studies show that removing either the predictive‑coding head or the keyframe bank degrades performance, confirming that the advantage stems from improved organization of historical information.
Resources
Paper: https://arxiv.org/abs/2605.10921
Project site: https://robomemarena.github.io/
Code repository: https://github.com/OpenHelix-Team/RoboMemArena
Dataset: https://huggingface.co/datasets/RoboMemArenaBenchmark/RoboMemArena
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
