ArXiv

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

Authors
Qiuxuan Feng, Jiale Yu, Jiaming Liu...
Categories
cs.RO
arXiv
https://arxiv.org/abs/2605.10942v1
PDF
https://arxiv.org/pdf/2605.10942v1

Brief

HarmoWAM presents an end-to-end World Action Model that unifies predictive and reactive control by conditioning a predictive expert and a reactive expert on spatio-temporal priors from a world model, with a Process-Adaptive Gating Mechanism to coordinate switching. Reported zero-shot generalization covers three unseen real-world environments and six manipulation tasks, improving over VLA and WAM baselines by 33% and 29%, respectively. Full paper text not available (abstract only).

Why it matters

HarmoWAM is an end-to-end World Action Model that conditions two complementary action experts—a predictive expert that generates iterative actions from latent dynamics and a reactive expert that infers actions from predicted visual evolution—coordinated by a Process-Adaptive Gating Mechanism to switch timing and location of use.

Key details

  • On three training-unseen real-world environments spanning six robotic manipulation tasks, HarmoWAM achieves zero-shot generalization and outperforms prior state-of-the-art VLA models by 33% and prior WAMs by 29%.
  • The authors identify a fundamental trade-off between paradigms: 'Imagine-then-Execute' provides generalizable transit but weaker interaction precision, while 'Joint Modeling' yields fine-grained, temporally coherent actions but is constrained by the training exploration space.
Source evidence

Abstract

World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio-temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process-Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training-unseen test environments across six real-world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero-shot generalization across these scenarios, significantly outperforming prior state-of-the-art VLA models and WAMs by margins of 33% and 29%, respectively.