Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Ahmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini, Vaneet Aggarwal
In this paper, we prove state-of-the-art Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We present a refined analysis of the information ratio, and show an upper bound of order ˜O(H√dl1T) in the time inhomogeneous reinforcement learning problem where H is the episode length and dl1 is the Kolmogorov l1−dimension of the space of environments. We then find concrete bounds of dl1 in a variety of settings, such as tabular, linear and finite mixtures, and discuss how our results improve the state-of-the-art.