Choice Bandits

Part of Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020)

Bibtex »Paper »Supplemental »

Bibtek download is not availble in the pre-proceeding


Arpit Agarwal, Nicholas Johnson, Shivani Agarwal


There has been much interest in recent years in the problem of dueling bandits, where on each round the learner plays a pair of arms and receives as feedback the outcome of a relative pairwise comparison between them. Here we study a natural generalization, that we term \emph{choice bandits}, where the learner plays a set of up to $k \geq 2$ arms and receives limited relative feedback in the form of a single multiway choice among the pulled arms, drawn from an underlying multiway choice model. We study choice bandits under a very general class of choice models that is characterized by the existence of a unique `best' arm (which we term generalized Condorcet winner), and includes as special cases the well-studied multinomial logit (MNL) and multinomial probit (MNP) choice models, and more generally, the class of random utility models with i.i.d. noise (IID-RUMs). We propose an algorithm for choice bandits, termed Winner Beats All (WBA), with distribution dependent $O(\log T)$ regret bound under all these choice models. The challenge in our setting is that the decision space is $\Theta(n^k)$, which is large for even moderate $k$. Our algorithm addresses this challenge by extracting just $O(n^2)$ statistics from multiway choices and exploiting the existence of a unique `best' arm to find arms that are competitive to this arm in order to construct sets with low regret. Since these statistics are extracted from the same choice observations, one needs a careful martingale analysis in order to show that these statistics are concentrated. We complement our upper bound result with a lower bound result, which shows that our upper bound is order-wise optimal. Our experiments demonstrate that for the special case of $k=2$, our algorithm is competitive when compared to previous dueling bandit algorithms, and for the more general case $k>2$, outperforms the recently proposed MaxMinUCB algorithm designed for the MNL model.