Loading [MathJax]/jax/output/CommonHTML/jax.js

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Bibtex Paper Reviews And Public Comment » Supplemental

Authors

Ming Yin, Yu-Xiang Wang

Abstract

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy μ. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matchesO(Hh=1sh,ahdπh(sh,ah)VarPsh,ah(Vh+1+rh)dμh(sh,ah)1n).We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that dμh(sh,ah)>0 if dπh(sh,ah)>0. Here π is a optimal policy, μ is the behavior policy and d(sh,ah) is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on μ) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.