ArXiv

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Authors
Hanchao Liu, Fang-Lue Zhang, Shining Zhang...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.08054v1
PDF
https://arxiv.org/pdf/2605.08054v1

Brief

The paper addresses zero-shot human motion generation under very challenging spatiotemporal constraints by augmenting training-free diffusion noise optimization with retrieval guidance. It parses task constraints into groups (relational task parsing, powered by an LLM), retrieves reference motions for the hardest constraints, and forms a reward-guided mask to blend retrieved and random noise for improved diffusion initialization. The authors report this approach successfully handles tasks that prior methods struggle with, enabling more reliable constrained motion synthesis; accepted to CVPR 2026 (arXiv 2026-05-08).

Why it matters

Proposes a retrieval-guided diffusion noise optimization method that mixes retrieved noise with random noise via a reward-guided mask to better initialize diffusion sampling, and claims this enables generation that satisfies highly-constrained spatiotemporal goals (e.g., severe spatial obstacles or specified numbers of walking steps).

Key details

  • Adds relational task parsing (using an LLM) to identify the hardest constraints and select retrieval references; the framework is training-free and was released on arXiv 2026-05-08 (authors: Hanchao Liu et al.), and the paper is accepted to CVPR 2026.
Source evidence

Abstract

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

Comment: Accepted to CVPR2026