ArXiv

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

Authors
Wen Huang, Haoran Sun, Yongjian Guo...
Categories
cs.RO
arXiv
https://arxiv.org/abs/2605.07794v1
PDF
https://arxiv.org/pdf/2605.07794v1

Brief

NoiseGate reframes per-latent timestep selection in joint video–action world-action models as a learnable information-gating policy: by adjusting each predicted latent frame's noise level, a Gating Policy Network controls its Key/Value reliability for action generation. The approach (independent per-latent timestep sampling plus task‑reward optimization) improves RoboTwin random-scene manipulation performance. Summary based on the paper's abstract; full text not available here.

Why it matters

NoiseGate (Wen Huang et al., arXiv:2605.07794v1, published 2026-05-08) replaces the common single shared timestep t in Mixture-of-Transformers (MoT) world-action models with learnable per-latent timestep schedules, treating each predicted latent frame's noise level as an information-gating policy.

Key details

  • The method combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization to train schedules without hand-crafted shape priors.
  • Built on a joint video–action MoT backbone, NoiseGate yields consistent gains on diverse RoboTwin random-scene manipulation tasks (reported in the paper's abstract).
Source evidence

Abstract

World Action Models (WAMs) are an emerging family of policies that tie robot action generation to future-observation modeling. In this work, we focus on the joint video--action modeling paradigm, where actions and imagined future observations are co-generated along a shared denoising or flow trajectory, so that perception, prediction, and control are coupled within one generative process. Existing WAMs typically realize this paradigm with a Mixture-of-Transformers (MoT), where video and action tokens interact through shared self-attention. This architecture can in principle assign a separate timestep $t_f$ to each predicted latent frame, yet current systems collapse this degree of freedom onto a single shared scalar $t$. Under the noise-as-masking view of Diffusion Forcing, this shared schedule imposes the unjustified prior that every predicted latent is equally reliable for action generation. We instead view the per-latent schedule as a \emph{learnable information-gating policy}: by changing a latent frame's noise level, the policy modulates the reliability of its Key/Value contribution to the action tokens. We propose \textbf{NoiseGate}, which combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization that trains the schedule policy without hand-crafted shape priors. Built on a joint video--action MoT backbone, NoiseGate delivers consistent gains on diverse RoboTwin random-scene manipulation tasks.