Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Shinji Ito, Kei Takemura
In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020] works well. Specifically, an algorithm constructed using this approach achieves O(d√TlogT)-regret in adversarial environments and O(d2logTΔmin)-regret in stochastic environments. Symbols d, T and \Delta_{\min} here represent the dimensionality of the action set, the time horizon, and the minimum sub-optimality gap, respectively. We also show that this algorithm has even better theoretical guarantees for important special cases including the multi-armed bandit problem and multitask bandits.