ArXiv

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Authors
Wenxuan Song, Han Zhao, Fuhao Li...
Categories
cs.CV, cs.RO
arXiv
https://arxiv.org/abs/2605.10903v1
PDF
https://arxiv.org/pdf/2605.10903v1

Brief

CapVector introduces a method to decouple auxiliary-objective finetuning into two parameter-space components: general capability enhancement and task-specific action fitting. By finetuning two small-scale models with different strategies and taking their parameter difference as a capability vector, then merging it into pretrained VLA models (plus an orthogonal regularizer), standard SFT attains auxiliary-like gains with lower compute. Full text not provided here.

Why it matters

CapVector (Wenxuan Song et al., arXiv 2026-05-11) extracts transferable "capability vectors" by training two finetuned models on a small task set with distinct strategies (one enhancing general capabilities, one fitting task-specific action distributions) and using their parameter difference as the capability vector.

Key details

  • Merging capability vectors into pretrained vision-language-action models and using a lightweight orthogonal regularization during standard supervised finetuning achieves performance comparable to auxiliary-objective finetuning while reducing computational overhead; authors report the vectors are effective across diverse models and generalize to novel environments and embodiments.
Source evidence

Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.