ArXiv

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

2026-05-12 · 17:59 UTC ·Di Wu, Zixiang Ji, Asmi Kawatkar... ·1 min read

Authors: Di Wu, Zixiang Ji, Asmi Kawatkar...
Categories: cs.CL
arXiv: https://arxiv.org/abs/2605.12493v1
PDF: https://arxiv.org/pdf/2605.12493v1

Brief

LongMemEval-V2 (LME-V2) is a benchmark for assessing whether memory systems let agents acquire environment-specific experience; it provides 451 questions across five memory abilities with histories up to 500 trajectories (115M tokens). The authors evaluate a RAG-style AgentRunbook-R and a file+coding-agent AgentRunbook-C, reporting 72.5% accuracy for AgentRunbook-C versus 48.5% for the best RAG baseline and 69.3% for a coding-agent baseline, while noting higher latency; summary based on the abstract.

Why it matters

LongMemEval-V2 (LME-V2) is a benchmark of 451 manually curated questions covering five agent memory abilities (static state recall, dynamic state tracking, workflow knowledge, environment gotchas, premise awareness) paired with histories up to 500 trajectories and 115M tokens.

Key details

The paper introduces two memory methods—AgentRunbook-R (RAG with knowledge pools) and AgentRunbook-C (file-backed trajectories plus a coding agent in an augmented sandbox); AgentRunbook-C achieves 72.5% average accuracy versus 48.5% for the strongest RAG baseline and 69.3% for an off-the-shelf coding-agent baseline, though coding-agent methods incur high latency.

Cleaned source text

Abstract

Comment: Work in Progress