ArXiv

What should post-training optimize? A test-time scaling law perspective

Authors
Muheng Li, Jian Qian, Wenlong Mou
Categories
cs.LG, stat.ML
arXiv
https://arxiv.org/abs/2605.10716v1
PDF
https://arxiv.org/pdf/2605.10716v1

Brief

Large-language-model post-training often optimizes single-response mean reward, creating a mismatch with best-of-N test-time selection. The paper analyzes the budget-mismatch setting (m << N) and proves that, assuming structure in reward tails, best-of-N policy gradients can be extrapolated from small-rollout samples. It introduces Tail-Extrapolated estimators (TEA, Prefix-TEA) and shows improved best-of-N performance on instruction-following benchmarks. Summary based on the abstract only.

Why it matters

Defines a 'budget-mismatch' regime where training has m << N per-prompt rollouts but deployment uses best-of-N selection; under structural assumptions on reward tails, the authors show the best-of-N policy gradient can be approximated by extrapolating upper-tail statistics from much smaller rollout groups.

Key details

  • Proposes a family of Tail-Extrapolated estimators — a direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA based on moment cancellation — and reports that TEA and Prefix-TEA improve best-of-N performance on instruction-following tasks across different LMs, reward models, and datasets.
  • Paper: Muheng Li, Jian Qian, Wenlong Mou; arXiv:2605.10716v1 (posted 2026-05-11). PDF available at https://arxiv.org/pdf/2605.10716v1.
Source evidence

Abstract

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.