ArXiv

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Authors
Gugan Thoppe, L. A. Prashanth, Ankur Naskar...
Categories
cs.LG
arXiv
https://arxiv.org/abs/2605.08053v1
PDF
https://arxiv.org/pdf/2605.08053v1

Brief

Reinforcement-learning for exponential-utility optimization in discounted MDPs: the paper derives two Q-value–style Bellman extensions whose operators are contractions in L_infty and sup-log/Thompson metrics, proves fixed-point structure and optimality of the induced greedy stationary policy among stationary policies, and presents two model-free algorithms — a two-timescale Q-learning with a.s. convergence and finite-time rates, and a one-timescale power-law method whose convergence is established via delicate local arguments. Full text on arXiv (abstract used).

Why it matters

The authors derive two Q-value–style extensions of the Bellman equation for exponential-utility optimization in discounted MDPs and show the associated operators are contractions in the L_infty and sup-log/Thompson metrics; they characterize fixed points and prove the induced greedy stationary policy is optimal among stationary policies.

Key details

  • They propose two model-free algorithms: a two-timescale Q-learning–style method with almost-sure convergence and finite-time convergence rates obtained via timescale separation, and a one-timescale algorithm driven by a sublinear power-law operator that lacks a global contraction but is shown to converge using local Lipschitzness, monotonicity, homogeneity, and Dini-derivative arguments (scalar finite-time analysis only).
  • Preprint: Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat; arXiv:2605.08053v1 (cs.LG), published 2026-05-08; builds on Bellman-type exponential-utility work (e.g., Porteus 1975) to provide a foundation for value-based RL under fixed risk-aversion.
Source evidence

Abstract

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.