Part of Advances in Neural Information Processing Systems 28 (NIPS 2015)
Julien Audiffren, Liva Ralaivola
We study the restless bandit problem where arms are associated with stationary $\varphi$-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of carefully recovering some independence by `ignoring' the values of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off, which we do by considering the idea of a {\em waiting arm} in the new Remix-UCB algorithm, a generalization of Improved-UCB for the problem at hand, that we introduce. We provide a regret analysis for this bandit strategy; two noticeable features of Remix-UCB are that i) it reduces to the regular Improved-UCB when the $\varphi$-mixing coefficients are all $0$, i.e. when the i.i.d scenario is recovered, and ii) when $\varphi(n)=O(n^{-\alpha})$, it is able to ensure a controlled regret of order $\Ot\left( \Delta_*^{(\alpha- 2)/\alpha} \log^{1/\alpha} T\right),$ where $\Delta_*$ encodes the distance between the best arm and the best suboptimal arm, even in the case when $\alpha<1$, i.e. the case when the $\varphi$-mixing coefficients {\em are not} summable.