Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Ming Yin, Yu-Xiang Wang
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy μ. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matchesO(H∑h=1∑sh,ahdπ⋆h(sh,ah)√VarPsh,ah(V⋆h+1+rh)dμh(sh,ah)√1n).We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that dμh(sh,ah)>0 if dπ⋆h(sh,ah)>0. Here π⋆ is a optimal policy, μ is the behavior policy and d(sh,ah) is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on μ) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.