NeurIPS 2020

Choice Bandits


Review 1

Summary and Contributions: The authors consider a generalization of duelling bandits where, at each time t, a learner selects a subset S_t of items S_t \subset \{1,...,n \} of size k < n or less, and the envinronment picks one of the items of S_t. The item picked most often is the preferred item of S_t, and we assume that there exists a Condorcet winner, i.e. an item which is preferred over any other item. For k=2 this is simply duelling bandits, for k > 2 this is a generalization of the problem. The authors prove a O(n^2 ln n + n (ln T)/Delta) regret lower bound, and they propse an algorithm which is order optimal, so that its regret is upper bounded by an expression of the same order.

Strengths: The problem is interesting, the proposed algorithm is both novel and order optimal, and the authors perfom numerical experiments to demonstrate that their approach is superior to the current state of the art. The paper is well written and easy to follow.

Weaknesses: - The statement of Thereom 1 seems imprecise. Is \Delta simply a parameter, or is it related to the reward gap for all models (I understand that it is linked to the reward gap for the MNL model, but for other models this seems unclear). If Delta is just a parameter, this would imply that the proposed algorithm cannot be claimed to be order optimal outside of the MNL model, contrary to what the authors seem to claim. - For this type of general structured bandit problems, general regret lower bounds are known (see for instance http://papers.nips.cc/paper/6773-minimal-exploration-in-structured-stochastic-bandits ), so that it is a pity that the authors did not investigate these. - The proposed algorithm has an input parameter C, which must be chosen "large enough" for the regret bound of Theorem 2 to hold. However the authors never specify what information is necessary to make this choice. Must C be greater than some universal constant ? Must it be chosen in a problem specific manner ? If so this greatly diminishes the value of the proposed algorithm, unless one has some sort of method to select C automatically.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: The paper introduces the choice bandit problem, a natural generalization of the dueling bandits problem that has attracted significant interest from researchers over the last few years. In particular, the paper studies a problem where the learner is allowed to play more than k (>= 2) arms and receives the feedback in the form of which arm "wins" among the selected set of arms. Here, the learner is interested in identifying the "best arm" that has the highest probability of winning among any subset of arms. Assuming a fairly generic random utility framework for choice models (the mechanism by which winning arm is decided), the paper 1. establishes a fundamental lower bound for this problem and 2. presents an algorithm that achieves near optimal performance (with respect to the regret notion defined in the paper).

Strengths: 1. The problem considered in the paper, online learning in choice models is a an important problem with growing number of real world use cases. One particular strength of the paper is that they present an algorithm (and analysis) that is applicable beyond the multinomial logit choice model (MNL) whose limitations are well known to be used in practice. 2. The paper presents theoretical guarantees on the algorithm presented and backs it up by showing a fundamental lower bound on the problem.

Weaknesses: Though the paper is largely well written, there is room for improvement in the technical presentation of the paper. In particular, the paper as written is not completely self-containing (see below for more details) and a general reader might find it hard to understand all the details without reading the previous work.

Correctness: Yes.

Clarity: The theoretical notions (and the motivation) considered in the paper are not self-containing. In particular, the paper relies a lot more than usual on existing literature to justify many algorithmic and benchmark choices. For example, after reading the paper, I'm not sure about the high level goal of the paper, is it to identify one best arm or play assortments with many arms close to best arm? Related to this point, the notion of benchmark and regret also needs more description. There's a single line that this is motivated from the previous work. However, there's minimal justification on why this is the right notion of regret and what it aims to achieve. It is clear regret for playing S_t = i* is 0, but what happens when we play arms whose p(i|S) is much lower? For example, in the MNL model, including items whose v_i ~ 0 shouldn't hurt us if the goal is to ensure the best arm gets maximum chance to win. It is ideal to elaborate on the reasons why this notion of regret is in line with what they want. Lastly, I'd also like to see the paper motivating the problem from a real application point of view as well.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Missing reference with respect to dynamic learning in the MNL model: Mnl-bandit: A dynamic learning approach to assortment selection. Post rebuttal: Acknowledging that I've read the author's response, other reviewers comment and happy to maintain the original score.


Review 3

Summary and Contributions: The paper deals with a variant of the dueling or battling bandit problem, where it is allowed to pull up to k many arms in each time step. In addition, a more general class of multiwise comparison models is considered as the ones investigated in the battling bandit problem or the previously used Multinomial Logit model of related settings. A lower bound on the expected regret is shown for the general class of multiwise comparison models, which in particular gives a more refined bound as the previously shown for the case of MNL models. Moreover, one learning algorithm is suggested for the considered problem and analyzed theoretically with respect to its expected regret bound. Finally, the proposed algorithm is investigated in numerous experiments on synthetic as well as real-world datasets and compared to related algorithms. ==================== Post Rebuttal ==================== After reading the other reviews as well as the author's response, I think that the authors have not addressed two of the concerns mentioned in the reviews thoroughly enough, namely the independence of k for the complexity terms as well as the suggestion to compare with the lower bounds coming from the structured bandits. In particular, I think it should be possible to recover a lower bound for their setting from the structured bandits, as the lower bound for the dueling bandit problem with Condorcet Winner can be recovered as well. However, as the authors give an own proof for their lower bound, I think this is appropriate. Nevertheless, it would have been interesting to include the corresponding OSSB algorithm for their problem in the experiments. The response regarding the independence of k for the complexity terms is somehow unsatisfactory, as it hasn't become clear what the "more subtle dependence" is. I really would have liked to see more details on this issue, for instance, an enlightening example. On the other side, the remaining issues such as the hidden dependencies of some terms in the upper bound, the high-level goal and the choice of C are addressed in an appropriate way and if the authors make the promised adaptations, I will be fine with that. All things considered, I would keep my original score.

Strengths: The paper is dealing with an interesting as well as practical relevant variant of the dueling/battling bandit problem. The multiwise comparison models considered in the paper are more general as the ones investigated in the battling bandit problem and also go beyond the previously considered Multinomial Logit model of some related works. Lower bounds on the expected regret for these general choice models are shown. The theoretical analysis is, except for some little things, sound and the paper is easy to read. The suggested algorithm is shown to be empirically appealing by comparison with existing methods, especially in the MNL case and moreover the theoretical guarantees for the MNL case are strict.

Weaknesses: I think that some relevant terms are missing in the regret upper bound (please see the detailed comments below), which will slightly reduce the attractiveness of the resulting upper bounds. Also, it seems that there is a gap between the upper and lower bounds for the general GCW model case.

Correctness: As already mentioned, there seems to be some missing terms in the regret upper bound. Also some statements in the discussion following Theorem 2 are questionable or speculative (please see the detailed comments below).

Clarity: The paper is well-written and easy to read. The main contributions are stated clearly. However, the authors are not specifying the admissible range for their parameter of the algorithm in order to obtain the stated regret upper bound and are unspecific with one relevant problem parameter in their lower bound result (please see the detailed comments below).

Relation to Prior Work: Yes, as the authors provide a paragraph on related work and also have an extensive section in the supplementary material.

Reproducibility: Yes

Additional Feedback: % Regret lower bound: The statement of the lower bound is peculiar as the role of the parameter \Delta is unclear. How does it relate to the underlying GCW problem instance? This relationship is of utmost importance in order to compare the lower bound result with the upper bound result. As far as I can tell, \Delta here is equivalent to \Delta_min^{GCC} of the underlying GCW problem, right? % Regret upper bound: I think the stated upper bound is hiding important terms in the log-term. More precisely, in your proof of Theorem 2 you are dropping at one point the factor "n times C "in the rightmost log-term. However, in my opinion, this is not entirely justified, since the parameter C has to be chosen larger than C' as defined on page 11, which in turn is large if P_{i^*,i}^GCC is near 1/2 or equivalently d(P_{i^*,i}^GCC,1/2) is near 0. Thus, d(P_{i^*,i}^GCC,1/2) is yet another hardness parameter occurring in the regret upper bound just as the already considered \Delta terms. By the way, each O-term in the proof of Theorem 2 should be O(n^2 log(n)). % Role of the parameter C: Related to the previous issue, it would be relevant to be more explicit on the smallest possible choice for the parameter C in the statement of Theorem 2, as this choice is depending on the relevant hardness parameters of the underlying problem instance as mentioned in the previous point. % Discussion after Theorem 2: If my conjecture regarding the \Delta term of the lower bound is correct, then the statement that the regret upper bound is asymptotically optimal is only valid for the case of the MNL model. In other words, there is a gap between the upper and lower bounds for the general GCW model case. Also the statement "It is also important to note that our regret bound does not depend directly on the choice set size k" is questionable. Both hardness parameter \Delta_min^{GCC} and \Delta_max^{GCC} depend on k and in particular lead to increasing regret bounds if k increases. Only if you could show that the ratio of \Delta_min^{GCC} and \Delta_max^{GCC} is bounded by a constant independent of k for any admissible k, you could infer the statement above. % Other little things: Main paper: - line 205: 'anchor arm a_t' - line 242: Delete "the" - line 272: measurable "set" Supplement: - Some of the long mathematical displays miss a period at the end. - Preselection Bandits also use a different notion of regret. Also it would make sense to mention them in the main part and include them in Table 1. - The weights w_i are defined incorrectly on page 5, as it should be exp instead of log. Also \kappa is not introduced. - In the proof of Lemma 2, there are various S's instead of S_t's. Moreover, Y_t has nothing to do with the event y_t=i^* and there should be Z_C = 1 for Y_t = i^* and Z_C = 0 for Y_t = i. - "Regret" after Lemma 4 on page 13. - n_i^suf is not an integer in general, so that you need to an appropriate rounding in the sums, which occur in the proof of Lemma 4. - Don't you need a union bound for "\exists t \in [T]" at the passage, where Lemma 2 is applied? If not you could be more precise here. - One real-world dataset should be the "IrishDublin" dataset on page 16. - The sum in the proof of Fact 1 is infinite, while you only have a finite set of coins according to the assertion.