NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6225
Title:Thresholding Bandit with Optimal Aggregate Regret

Reviewer 1

Overview: This paper studies the thresholding bandit problem, under an alternative definition of regret to that which has been considered previously in the literature. The authors then provide an algorithm specifically tailored to minimizing this regret (up to some potentially enormous constant factors). They provide an analysis of the regret of this algorithm and demonstrate experimentally that it outperforms algorithms developed for minimizing a different regret definition. However, I would have liked to see more discussion and comparison of the performance of the different algorithms under different regret definitions and details of the parameter choice for their algorithm. Comments: - I don’t find the motivation for the need to consider aggregate regret particularly convincing. - Given that there is a lot of discussion about the simple regret, it would be good to include a definition of it, and discussion of how they relate to each other, e.g. does one imply the other? - In the discussion of Locatelli et al. (2016), it is stated that the algorithm is parameter free then that it takes parameter epsilon. - The uniform sampling method introduced in line 69 seems very naive so I am surprised it is only a factor of K worse than the proposed method. - Although space is tight, I would have liked to see at least some discussion of the best arm identification problem. - I think the discussion of the optimization problem in the case where the gaps are known is nice. It is also nice to see how this motivates the algorithm used. - The outline of the proof of Theorem 1 would be much better placed next to Theorem 1 rather than its current position in the introduction (lines 107-118). It also seems like a fairly standard analysis with the only interesting aspect the avoidance of a union bound. - A lot of the constants seem quite arbitrary (e.g. why do we need alpha<=8 and T>40?). - The constant in remark 2 is enormous. The authors should at least attempt to optimize it to give a meaningful result rater than making ‘no effort’ to do so. - How should alpha be chosen? In remark 2 it is 1/20, whereas in the experiments it is (somewhat arbitrarily) 1.35. - How are a and b chosen in Lemma 19? It seems somewhat un-intuitive that for large values of b, the distance between the empirical mean and expected value is large but the constant in front of the exponentially small probability more or less stays the same. - why is this analysis applicable to other areas but the analysis of MOSS not (line 267)? - I would like to see the aggregate regret of the APT algorithm of Locatelli et al(2016) and the simple regret of this algorithm for comparison. - In Figure 1, why is APT given with lots of different parameter settings where as their algorithm is only given with one. - In Figure 3 of Appendix F.3, the range of alphas considered is quite small. What happens if we set alpha=1/20 as in remark 2? - It would be good to see confidence bounds on the experimental results to see if the differences in performance are significant. - I would like to experimental results for the simple regret considered by Locatelli et al (2016) . Clarity: Often, the paper is difficult to read and care should be put into proof-reading to eliminate typos/grammatical errors. The introduction could also be significantly improved. After rebuttal: Thank you to the authors for providing a detailed rebuttal that addressed most of my concerns. I was particularly pleased to see a discussion of the relationships between the two regrets, the difference between their analysis and that of MOSS and the significance of the suboptimality of the uniform sampling method. I have therefore raised my score. For the final version, I hope that the authors will work on the readability of the paper and add more details about the choice of alpha.

Reviewer 2

Significance and originality : There is a huge number (and it's growing) of small variations of settings for pure exploration in MABs, and this paper adds to them by changing the notion of regret in thresholding bandits. The algorithm LSA is quite similar in form the APT of Locatelli et. al . That said, LSA is well motivated (with the optimisation problem), and the authors provide a well-rounded analysis of the problem with non-trivial methods, which is always interesting for experts. Quality and clarity : The paper is quite complete and thorough, answering spontaneously many natural questions (e.g. illustrating how the uniform sampling algorithm behaves, how to tune alpha, ...). Almost all mathematical operations are well-motivated, which is a pleasure for the reader, and the mathematical statements are followed by insightful comments. The experiments are quite complete too and seem to honestly show that their algorithm behaves well in practice. There is however a caveat: the theoretical dependence on \alpha is a little worrying, as the lambda_i's depend themselves on \alpha. Therefore, Remark 2 is not completely clear, and the claim of instance optimality is not properly proven. I do not think this is a bad problem, but it should be clarified. In general, I have a feeling that the exposition in Section 4 could be improved, as it confronts very suddenly the reader with some heavy notations that hinder readability. All in all, I think that this is a strong contribution, currently with an exposition problem in a critical part of the paper. Some typos : l32 : z_i = 1 iff \theta_i \geq 1/2 overall the description of the problem seems weird to me, why not just say we want to predict whether \theta_i \geq 1/2, without talking about the z_i ? l.90 : why is that ? l.162 : “Note that this is all AN algorithm can do […]” (the AN is missing). Why is this ? l.170 : approximates P_c^\star well l.173 : I think you mean max ? l.174 : function on x <- function of x l.182 : value <- values l.203 : alpha = 1/ 20 <= alpha = 1/10 ? l.206 : demonstrated <- proved, or stated ?? l268 : MOSS is not asymptotically optimal in the usual sense, although it is minimax optimax. l302 : the delta’s in the definition of H are not the delta's in the TBP l332 : the celebrated Hoeffding’s maximal inequality <- Hoeffding’s celebrated maximal inequality l.408 : applies <- apply to l454 : performing THE uniform Post rebuttal edit. The setting and the algorithm are interesting. However, the issue I find critical, about optimality, has not been answered properly in the author feedback and I maintain my score. To me, this is a potentially strong but incomplete work as long as the regret bounds are not more readable/comparable to lower bounds.

Reviewer 3

The paper considers the thresholding bandit problem, where aggregate regret is minimized instead of simple regret (errors on all arms, instead of at least one). A new algorithm is proposed for the slightly modified setting. The paper derives the aggregate regret bound for the algorithm, which roughly matches the lower bound provided in the appendix. Experiments on synthetic data show that the algorithm performs better than some existing algorithms (which were designed for slightly different variant of the problem). Minimizing aggregate regret (as opposed to simple regret) seems to make sense in some scenarios. Consequently, it does warrant additional analysis. Given that the existing algorithms were designed for different problems it is difficult to assess the improvement in performance (regret bound or empirical). The theoretical analysis contain sufficient new elements (including the variable confidence level bound) to be considered worthy of publication. The article is well written, and the split in content between the main submission and the appendix is also reasonable. Maybe, the lower bound could be stated in the main part as well. Update after the author response: I appreciate the authors’ effort, in particular the discussion on simple and aggregate regret and the additional experiments.