Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
David Simchi-Levi, Zeyu Zheng, Feng Zhu
We consider the stochastic multi-armed bandit problem and fully characterize the interplays among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to achieve the optimal regret tail risk for any regret threshold. Concretely, for any given α∈[1/2,1) and β∈[0,1), our policy achieves a worst-case expected regret of ˜O(Tα) and instance-dependent expected regret of ˜O(Tβ), while enjoys a probability of incurring an Ω(Tδ) regret that decays exponentially with a polynomial T term. Such decaying rate is proved to be best achievable. We also generalize our analysis to the stochastic multi-armed bandit problem with non-stationary baseline rewards, where in each time period t, the decision maker pulls one of K arms and collects a reward which is the sum of three terms: the mean of the pulled arm, an independent noise, and a non-stationary baseline reward as a function of t. Our results reveal insights on the trade-off between expected regret and tail risk for both worst-case and instance-dependent scenario, indicating that more sub-optimality and inconsistency leaves space for more light-tailed risk of incurring a large regret.