ArXiv

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

Authors
Matthew M. Hong, Jesse Zhang, Anusha Nagabandi...
Categories
cs.RO, cs.AI, cs.LG
arXiv
https://arxiv.org/abs/2605.12236v1
PDF
https://arxiv.org/pdf/2605.12236v1

Brief

TMRL and Context-Smoothed Pre-training (CSP) inject forward-diffusion noise into policy inputs during pretraining to create a continuum from precise imitation to broad action coverage, then train agents to modulate the diffusion timestep during RL fine-tuning to control exploration. The method works with states, 3D point clouds, and visual policies and enables sub-hour real-world manipulation fine-tuning; full paper and code on arXiv and project site.

Why it matters

TMRL (Timestep-Modulated Reinforcement Learning) together with Context-Smoothed Pre-training (CSP) injects forward-diffusion noise into policy inputs to bridge BC pretraining and RL fine-tuning; authors report successful real-world fine-tuning on complex manipulation tasks in under one hour (Hong et al., arXiv 2026-05-12).

Key details

  • TMRL trains agents to modulate the diffusion timestep during fine-tuning to explicitly control exploration, integrates with arbitrary inputs (states, 3D point clouds, image-based VLA policies), and improves RL fine-tuning sample efficiency; code and videos available at the project page.
Source evidence

Abstract

Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.