Abstract
Comment: Work in Progress
LongMemEval-V2 (LME-V2) is a benchmark for assessing whether memory systems let agents acquire environment-specific experience; it provides 451 questions across five memory abilities with histories up to 500 trajectories (115M tokens). The authors evaluate a RAG-style AgentRunbook-R and a file+coding-agent AgentRunbook-C, reporting 72.5% accuracy for AgentRunbook-C versus 48.5% for the best RAG baseline and 69.3% for a coding-agent baseline, while noting higher latency; summary based on the abstract.
LongMemEval-V2 (LME-V2) is a benchmark of 451 manually curated questions covering five agent memory abilities (static state recall, dynamic state tracking, workflow knowledge, environment gotchas, premise awareness) paired with histories up to 500 trajectories and 115M tokens.
Abstract
Comment: Work in Progress