Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, Yu Bai
Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of *policy finetuning*, that is, online RL where the learner has additional access to a "reference policy" μ close to the optimal policy π⋆ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H. We first design a sharp *offline reduction* algorithm---which simply executes μ and runs offline policy optimization on the collected dataset---that finds an ε near-optimal policy within ˜O(H3SC⋆/ε2) episodes, where C⋆ is the single-policy concentrability coefficient between μ and π⋆. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an Ω(H3Smin sample complexity lower bound for *any* policy finetuning algorithm, including those that can adaptively explore the environment. This implies that---perhaps surprisingly---the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use \mu. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where \mu only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.