ArXiv

Personal Visual Context Learning in Large Multimodal Models

Authors
Zihui Xue, Ami Baid, Sangho Kim...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.10936v1
PDF
https://arxiv.org/pdf/2605.10936v1

Brief

Personal Visual Context Learning (Personal VCL) formalizes how wearable-driven LMMs should use user-specific first-person visual evidence to resolve personalized queries. Xue et al. introduce Personal-VCL-Bench (covers persons, objects, behaviors), identify a pronounced context-utilization gap in frontier LMMs, and propose the Agentic Context Bank—a self‑refining memory plus query‑adaptive evidence selection—that consistently improves performance over standard context prompting across tasks and evaluated backbones.

Why it matters

Formalized Personal Visual Context Learning (Personal VCL) and released Personal-VCL-Bench, a benchmark capturing personal first‑person visual world across persons, objects, and behaviors (arXiv preprint 2026-05-11 by Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman).

Key details

  • Analyzed frontier LMMs and found a pronounced context-utilization gap; proposed the Agentic Context Bank (a self‑refining memory bank with query‑adaptive evidence selection) which consistently improves over standard context prompting across tasks and evaluated backbones.
Source evidence

Abstract

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

Comment: Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/