Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Julia Olkhovskaya, Jack Mayo, Tim van Erven, Gergely Neu, Chen-Yu Wei
We consider the adversarial linear contextual bandit setting, whichallows for the loss functions associated with each of K arms to changeover time without restriction. Assuming the d-dimensional contexts aredrawn from a fixed known distribution, the worst-case expected regretover the course of T rounds is known to scale as ˜O(√KdT). Under the additional assumption that the density of the contextsis log-concave, we obtain a second-order bound of order \tildeO(K√dVT) in terms of the cumulative second moment of thelearner's losses VT, and a closely related first-order bound of order˜O(K√dL∗T) in terms of the cumulative loss of the bestpolicy L∗T. Since VT or L∗T may be significantly smaller thanT, these improve over the worst-case regret whenever the environmentis relatively benign. Our results are obtained using a truncated versionof the continuous exponential weights algorithm over the probabilitysimplex, which we analyse by exploiting a novel connection to the linearbandit setting without contexts.