ArXiv

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

Authors
Giacomo Spigler
Categories
cs.RO, cs.AI, cs.CV, cs.LG
arXiv
https://arxiv.org/abs/2605.07943v1
PDF
https://arxiv.org/pdf/2605.07943v1

Brief

TAVIS is a benchmark for egocentric active vision and anticipatory gaze in imitation learning that provides two complementary suites (Head: 5 tasks; Hands: 3 tasks) on GR1T2 and Reachy2 torsos in IsaacLab, plus ~2,200 demonstrations. It introduces a paired headcam vs fixedcam protocol, the GALT anticipatory-gaze metric, and ID/OOD splits; baselines show selective benefits from active gaze, brittleness of multi-task policies under shift, and human-like anticipatory gaze emerging from imitation.

Why it matters

TAVIS (released 2026-05-08) provides two task suites — TAVIS-Head (5 tasks, global pan/tilt search) and TAVIS-Hands (3 tasks, wrist-camera occlusion) — on two humanoid torso embodiments (GR1T2, Reachy2) built on IsaacLab; dataset LeRobot v3.0 contains ~2,200 demonstration episodes and code/models are on GitHub and Hugging Face.

Key details

  • Evaluation primitives include a paired headcam-vs-fixedcam protocol, GALT (Gaze-Action Lead Time) — a new metric grounded in cognitive science/HRI to quantify anticipatory gaze — and procedural in-distribution/out-of-distribution (ID/OOD) splits.
  • Baselines (Diffusion Policy and π_0) show that active vision generally improves imitation performance but benefits are task-conditional; multi-task policies suffer sharp degradation under controlled distribution shifts; and imitation alone yields anticipatory gaze with median lead times comparable to the human teleoperator reference.
Source evidence

Abstract

Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.