ArXiv

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

2026-05-11 · 17:49 UTC ·Alex DeWeese, Guannan Qu ·1 min read

Authors: Alex DeWeese, Guannan Qu
Categories: cs.LG, stat.ML
arXiv: https://arxiv.org/abs/2605.10909v1
PDF: https://arxiv.org/pdf/2605.10909v1

Brief

Revisiting policy gradients for restricted policy classes, the paper identifies one-step myopia (policy gradients depending only on the one-step Q) as a cause of suboptimal critical points and proposes a k-step policy gradient that couples randomness across k steps. The method yields performance exponentially close in k to the optimal deterministic policy, converges in O(1/T) with projected/mirror descent under mild smoothness/differentiability assumptions, avoids common distribution-mismatch factors, and targets applications like state aggregation and partially observable cooperative multi-agent problems. Full text on arXiv.

Why it matters

Authors Alex DeWeese and Guannan Qu (arXiv 2026-05-11) introduce a generalized k-step policy gradient that couples randomness over a k-step window; they prove its solution is exponentially close (in k) to the optimal deterministic policy.

Key details

Projected gradient descent and mirror descent using the k-step policy gradient achieve the exponential guarantee in O(1/T) iterations under only smoothness and differentiability assumptions, and the analysis avoids distribution-mismatch factors ||d_μ^{π^*}/d_μ^π||_∞ and ||d_μ^{π^*}/μ||_∞.

Source evidence

Abstract

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||dμ^{π^*} / dμ^π||\infty$ and $||dμ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.