ArXiv

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Authors
Guohui Zhang, XiaoXiao Ma, Jie Huang...
Categories
cs.CV, cs.AI
arXiv
https://arxiv.org/abs/2605.12480v1
PDF
https://arxiv.org/pdf/2605.12480v1

Brief

OmniNFT (Zhang et al., arXiv 2026-05-12) targets RL fine-tuning for joint audio–video generation by diagnosing three failure modes—advantages inconsistency, gradient imbalance, and uniform credit assignment—and proposing modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion-RL pipeline. Tests on JavisBench and VBench with the LTX‑2 backbone report improved per-modality quality, cross-modal alignment, and synchronization. Summary based on the abstract.

Why it matters

OmniNFT (Guohui Zhang et al., arXiv 2026-05-12) introduces a modality-aware online diffusion RL framework with three technical components: modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting to improve joint audio–video generation.

Key details

  • The paper identifies three RL obstacles for joint audio–video generation: (i) multi-objective advantages inconsistency, (ii) multi-modal gradients imbalance (video-branch gradients leaking into shallow audio layers), and (iii) uniform credit assignment that overlooks fine-grained alignment regions.
  • Evaluated on JavisBench and VBench using the LTX-2 backbone, OmniNFT reportedly yields comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio–video synchronization (details in paper/project page).
Source evidence

Abstract

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

Comment: Project page: https://zghhui.github.io/OmniNFT/