ArXiv

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Authors
Xiaosong Jia, Bowen Yang, Zuhao Ge...
Categories
cs.RO
arXiv
https://arxiv.org/abs/2605.12369v1
PDF
https://arxiv.org/pdf/2605.12369v1

Brief

GuidedVLA proposes guiding Vision-Language-Action models by supervising individual attention heads with manually defined auxiliary signals, rather than relying on end-to-end implicit learning. The paper implements three specialized heads (object grounding, spatial geometry, temporal skill logic) and reports higher success rates on simulated and real-robot tasks versus strong VLA baselines. Full text was not available in the provided abstract.

Why it matters

GuidedVLA (paper posted 2026-05-12; accepted to RSS 2026) treats the action decoder as an assembly of functional components and supervises individual attention heads with manually defined auxiliary signals to focus action generation on task-relevant factors.

Key details

  • The authors instantiate three specialized attention heads — object grounding, spatial geometry, and temporal skill logic — and report improved success rates in both in-domain and out-of-domain simulation and real-robot experiments compared to strong VLA baselines.
  • Evaluation shows the quality of these specialized factors correlates positively with task performance and that the method yields decoupled, high-quality features, suggesting explicit guidance of action-decoder learning improves robustness and generalization.
Source evidence

Abstract

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

Comment: Accepted to RSS 2026. Project page: https://guidedvla.github.io/project_page/