ArXiv

Normalizing Trajectory Models

Authors
Jiatao Gu, Tianrong Chen, Ying Shen...
Categories
cs.CV, cs.LG
arXiv
https://arxiv.org/abs/2605.08078v1
PDF
https://arxiv.org/pdf/2605.08078v1

Brief

Normalizing Trajectory Models (NTM) address the failure of Gaussian denoising assumptions when compressing diffusion to few steps by modeling each reverse step as a conditional normalizing flow trained with exact trajectory likelihood. The design stacks shallow invertible blocks per step with a deep parallel trajectory predictor, is trainable from scratch or from flow-matching initials, and—per the abstract—enables self-distillation that yields high-quality text-to-image samples in four steps while retaining exact likelihood.

Why it matters

Normalizing Trajectory Models (NTM) represent each reverse diffusion step as an expressive conditional normalizing flow and optimize exact trajectory likelihood; the architecture pairs shallow invertible blocks inside each step with a deep parallel predictor and can be trained end-to-end or initialized from pretrained flow-matching models.

Key details

  • NTM's exact-trajectory likelihood enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality text-to-image samples in four sampling steps, and NTM matches or outperforms strong image-generation baselines on text-to-image benchmarks.
Source evidence

Abstract

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.