ArXiv

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Authors
Huashuo Lei, Wenxuan Song, Huarui Zhang...
Categories
cs.RO
arXiv
https://arxiv.org/abs/2605.10921v1
PDF
https://arxiv.org/pdf/2605.10921v1

Brief

RoboMemArena is a large-scale benchmark addressing robotic memory shortcomings by offering 26 long-horizon tasks (avg >1,000 steps) with multimodal memory annotations and paired real-world evaluations; a VLM-based pipeline composes subtasks and generates trajectories via atomic functions. The authors introduce PrediMem, a dual-system VLA with recent/keyframe memory buffers and a predictive coding head that outperforms baselines on this benchmark.

Why it matters

RoboMemArena is a large-scale robotic memory benchmark (ArXiv preprint 2026-05-11) that contains 26 tasks with average trajectory lengths exceeding 1,000 steps and 68.9% of subtasks labeled as memory-dependent; it supplies multimodal memory annotations (subtask instructions, native keyframe annotations), a VLM-driven generation pipeline using atomic functions, and paired real-world tasks (project: https://robomemarena.github.io).

Key details

  • PrediMem is a proposed dual-system vision–language agent in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and a predictive coding head; extensive experiments on RoboMemArena show PrediMem outperforms all baselines and provides empirical insights into memory management, model architecture choices, and scaling laws.
Source evidence

Abstract

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.

Comment: Project website: https://robomemarena.github.io