ArXiv

Model-based Bootstrap of Controlled Markov Chains

Authors
Ziwei Su, Imon Banerjee, Diego Klabjan
Categories
stat.ML, cs.LG, math.OC, math.ST
arXiv
https://arxiv.org/abs/2605.12410v1
PDF
https://arxiv.org/pdf/2605.12410v1

Brief

The paper develops a model-based bootstrap for transition kernels in finite controlled Markov chains with possibly nonstationary or history-dependent policies, proving distributional consistency in both long-chain and episodic offline RL regimes. Using a novel bootstrap LLN and a martingale CLT, the authors extend results to OPE and OPR via Hadamard-differentiable Bellman operators, producing asymptotically valid CIs; RiverSwim experiments show strong empirical calibration.

Why it matters

Introduces a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) that is distributionally consistent in both the single long-chain regime and the episodic offline RL regime; technical contributions include a bootstrap law of large numbers for visitation counts and a martingale CLT for bootstrap transition increments.

Key details

  • Extends bootstrap consistency to downstream offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of Bellman operators, yielding asymptotically valid confidence intervals for value and Q-functions.
  • Empirical results on RiverSwim (Ziwei Su, Imon Banerjee, Diego Klabjan; arXiv 2026-05-12) show percentile bootstrap CIs outperform episodic bootstrap and plug-in CLT CIs, often achieving near-nominal 50%, 90%, and 95% coverage, while baselines are poorly calibrated for small sample sizes and short episodes (paper: 45 pages, 7 figures, 19 tables).
Source evidence

Abstract

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

Comment: 45 pages, 7 figures, 19 tables