Loading [MathJax]/jax/output/CommonHTML/jax.js

Dueling Bandits with Adversarial Sleeping

Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Bibtex Paper Reviews And Public Comment » Supplemental

Authors

Aadirupa Saha, Pierre Gaillard

Abstract

We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this seπngaspectofduelgbanditshasverbeenstudiedtheliterature.Likeduelgbandits,thegoaliscompetewiththebestarmbysequentiallyquerygthepreferencefeedbackofitempairs.Thenon-trivialityhoweverrest̲sduethenon-stationaryitemspacestˆallowanyarbitrarysitemsgounavailabeveryround.Thegoalisfdanoptimalno-regret policy that can identify the best available item at each round, as opposed to the standard `fixed best-arm regret objective' of dueling bandits. We first derive an instance-specific lower bound for DB-SPAA Ω(K1i=1Kj=i+1logTΔ(i,j)), where K is the number of items and Δ(i,j) is the gap between items i and j. This indicates that the sleeping problem with preference feedback is inherently more difficult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically.