Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du
This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs).With the new confidence sets, we obtain the follow regret bounds:For linear bandits, we obtain an ˜O(poly(d)√1+∑Kk=1σ2k) data-dependent regret bound, where d is the feature dimension, K is the number of rounds, and σ2k is the \emph{unknown} variance of the reward at the k-th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on K}.When variances are small, this bound can be significantly smaller than the ˜Θ(d√K) worst-case regret bound.For linear mixture MDPs, we obtain an ˜O(poly(d,logH)√K) regret bound, where d is the number of base models, K is the number of episodes, and H is the planning horizon. This is the first regret bound that only scales \emph{logarithmically} with H in the reinforcement learning with linear function approximation setting, thus \emph{exponentially improving} existing results, and resolving an open problem in \citep{zhou2020nearly}.We develop three technical ideas that may be of independent interest:1) applications of the peeling technique to both the input norm and the variance magnitude, 2) a recursion-based estimator for the variance, and 3) a new convex potential lemma that generalizes the seminal elliptical potential lemma.