ArXiv

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Authors
Haoyuan Sun, Jing Wang, Yuxin Song...
Categories
cs.CV
arXiv
https://arxiv.org/abs/2605.10937v1
PDF
https://arxiv.org/pdf/2605.10937v1

Brief

Text-to-image post-training via reinforcement learning faces reward-hacking and miscalibration from normalization. Based on the abstract, the authors propose Super-Linear Advantage Shaping (SLAS), which extends the Fisher–Rao metric with advantage-dependent weighting to amplify informative updates and suppress illusory gradients, plus batch-level normalization. Evaluations report consistent gains over DanceGRPO, faster training, better GenEval and UniGenBench++ OOD performance, and improved scaling robustness and fidelity.

Why it matters

Introduces Super-Linear Advantage Shaping (SLAS): an advantage-dependent extension of the Fisher–Rao information metric that reshapes local policy geometry to amplify high-advantage update directions and tighten low-advantage ones; also uses batch-level normalization to stabilize training.

Key details

  • SLAS consistently outperforms the DanceGRPO baseline across multiple backbones and benchmarks, delivering faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, greater robustness to model scaling, and reduced reward hacking while preserving semantic and compositional fidelity.
  • Paper identifies that removing the prompt-level standard-deviation term yields an optimal policy ascent linear in the advantage but still limits separation of genuine signal from noise, motivating the super-linear geometric reshaping in SLAS.
Source evidence

Abstract

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.