ArXiv

MEME: Multi-entity & Evolving Memory Evaluation

Authors
Seokwon Jung, Alexander Rubinstein, Arnas Uselis...
Categories
cs.LG, cs.CL
arXiv
https://arxiv.org/abs/2605.12477v1
PDF
https://arxiv.org/pdf/2605.12477v1

Brief

MEME (Multi-entity & Evolving Memory Evaluation) targets LLM-agent failures when storing, updating, and reasoning about many entities across sessions. The benchmark defines six tasks (including Cascade, Absence, Deletion) and tests six memory systems across three paradigms on 100 controlled episodes. Results show catastrophic collapse on dependency reasoning (Cascade 3%, Absence 1%), and only an expensive file-based agent + Claude Opus 4.7 partially closes the gap, highlighting a practical-performance tradeoff.

Why it matters

MEME introduces six memory-evaluation tasks across the multi-entity and evolving axes (including three tasks not previously scored: Cascade, Absence, and Deletion) and evaluates six memory systems across three paradigms on 100 controlled episodes.

Key details

  • Systems fail at dependency reasoning: average accuracy under the default configuration was 3% on Cascade and 1% on Absence, despite adequate static retrieval performance.
  • Mitigations (prompt tuning, deeper retrieval, less filler noise, stronger LLMs) largely do not close the gap; only a file-based agent paired with Claude Opus 4.7 partially recovers performance, but at ~70× the baseline cost. Code and data: https://seokwonjung-jay.github.io/meme-eval/.
Source evidence

Abstract

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.