ArXiv

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

2026-05-12 · 14:15 UTC ·Yajie Li, Bozhou Zhang, Chun Gu... ·1 min read

Authors: Yajie Li, Bozhou Zhang, Chun Gu...
Categories: cs.RO, cs.CV
arXiv: https://arxiv.org/abs/2605.12167v1
PDF: https://arxiv.org/pdf/2605.12167v1

Brief

MoLA (Mixture of Latent Actions) targets the gap between video-based imagination and actionable control: instead of feeding predicted frames to a policy or decoding videos directly into controls, it infers a mixture of latent actions via pretrained, modality-aware inverse-dynamics models (semantic, depth, flow) to produce a physically grounded action interface. Evaluated on LIBERO, CALVIN, LIBERO-Plus and real robots, the abstract reports consistent improvements in success rates, temporal consistency, and generalization; summary based on the abstract (full paper not reviewed).

Why it matters

MoLA (Mixture of Latent Actions), proposed 2026-05-12 by Yajie Li et al., converts imagined future videos into executable control representations by using a mixture of pretrained inverse-dynamics models to infer latent actions from predicted visual transitions; the modality-aware inverse dynamics models explicitly exploit semantic, depth, and optical-flow cues.

Key details

The method was evaluated on simulated benchmarks LIBERO, CALVIN, and LIBERO-Plus and on real-world robot manipulation tasks, and the abstract reports consistent gains in task success, temporal consistency, and generalization; the work is listed as ICML 2026.

Source evidence

Abstract

Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

Comment: ICML 2026