Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Aadirupa Saha, Pierre Gaillard
We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this s≤eπngaspect′ofduel∈gbanditshas≠verbeenstudied∈theliterature.Likeduel∈gbandits,thegoalis→competewiththebestarmbysequentiallyquery∈gthepreferencefeedbackofitempairs.Thenon-trivialityhoweverrest̲sdue→thenon-stationaryitemspacestˆallowanyarbitrary⊂sitems→gounavailab≤everyround.Thegoalis→f∈danoptimalno-regret policy that can identify the best available item at each round, as opposed to the standard `fixed best-arm regret objective' of dueling bandits. We first derive an instance-specific lower bound for DB-SPAA Ω(∑K−1i=1∑Kj=i+1logTΔ(i,j)), where K is the number of items and Δ(i,j) is the gap between items i and j. This indicates that the sleeping problem with preference feedback is inherently more difficult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically.