Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Zeyu Jia, Jian Qian, Alexander Rakhlin, Chen-Yu Wei
We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax regret bounds, we show that the eluder dimension delu−a measure of the complexity of the function class−plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of Ω(min is unavoidable when d_{\text{elu}} \leq \sqrt{A T}, where A is the number of actions, T is the total number of rounds, and \Lambda is the total variance over T rounds. For the A\leq d_{\text{elu}} regime, we derive a nearly matching upper bound \tilde{O}( \sqrt{ A\Lambda } + d_{\text{elu} } ) for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of \Omega( \sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} ) is unavoidable when \sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} \leq \sqrt{A T}. In this setting, we provide an upper bound of order \tilde{O}( d_{\text{elu}}\sqrt{ \Lambda } + d_{\text{elu}} ).Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound \tilde{O}(\sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}}) established in their work is unimprovable when \sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}}\leq \sqrt{AT}. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of \tilde{O}(\sqrt{A\Lambda} + d_{\text{elu}}).