NeurIPS 2020

Finite Continuum-Armed Bandits


Review 1

Summary and Contributions: This paper introduces the problem of finite continuum armed bandits with N arms in [0,1] and the constraint that each arm can be pulled at most once. Learning remains possible thanks to regularity of the reward function. For this problem, it provides an algorithm that is optimal (up to constants and polylog factors). Surprisingly, its regret bound is T^{1/3}, while the minimax bound is actually T^{1/2} in most bandits problem. Besides proving its upper bound, the paper thus also proves a lower bound.

Strengths: By adding a new constraint to the continuum armed bandit problem, this work shows that the reached regret is actually way smaller (but not the total reward). This result is rather surprising at first and very nice. The theoretical analysis is smooth and rather complete.

Weaknesses: Although the elements of understanding why we get a T^{1/3} bound instead of T^{1/2} are given in the paper, I think they might have been better highlighted. (I give my suggestions and how I felt when reading the paper in the additional feedback section). Slightly related to this, I think that the assumptions could be better justified when presented. (see additional feedback section as well). These are only minor weaknesses in my opinion, justifying my overall score.

Correctness: The proofs seem correct to me. I might have missed some minor bugs in them, but the major elements seem ok.

Clarity: I find the paper well written globally. But as mentioned in the weaknesses section, I think that some points might have been presented differently, and especially the reasons of why we have T^{1/3} and not T^{1/2}.

Relation to Prior Work: The relation to prior work is clearly discussed. Yet when reading the introduction (esp. the motivations of this model), I think it is necessary to explain the differences/complementarity with contextual bandits.

Reproducibility: Yes

Additional Feedback: I have read the authors' response and the other reviews. I had no personal concern about this work and thus maintain my score. ********************************************************* - While reading the paper at first, I thought for a long time that the regret T^{1/3} (which is mainly a consequence of remark 1) was more due to Assumption 3 than to the constraint of one pull per arm as explained by the authors. It was actually a combination of the two, as assumption 3 already allowed to improve from T^{2/3} to T^{1/2} in the continuum armed bandit (Auer et al., 2007). This is mentioned in the paper but only at page 7. I would personally prefer to see it just after introducing Assumption 3. Besides justifying this assumption, it would also suggest that it already improves the regret in the continuum armed bandit, but not enough to reach T^{1/3}. - How is assumption 2 local lipschitz ? This is not the definition of local lipschitz I know and taking the max with M-m(x) is actually weird. It would be great to have a discussion about this. - Still on assumption 2, I think that this work can be easily extended to any apha-Hölder reward function. It might be great to at least quickly mention it after assumption 2. ******* Minor remarks ********** - Line 83: the sentence about the work of Chakrabarti et al. is confusing. These are not the rewards that are drawn i.i.d. from some known distribution, but the "mean rewards". - In the for loop of the algo, (last - ), it seems that you actually wanted to define I_k as [k-1/K, k/K) \cap { a_1, ..., a_N } - there is a typo in the title on the CMT submission. This might cause trouble in the proceedings in case of acceptance Some typos I noticed: - line 80 "where" -> "were" - line 165 "maximas" -> "maxima" - line 174 "obtains" -> "obtained"


Review 2

Summary and Contributions: This work proposed a novel bandit study framework called finite continuum-armed bandits, where on top of the classical continuum-armed bandits setting, the authors added a constraint that each arm could be pulled at most once. Furthermore, they assumed that the total budget is a fixed proportion p \in (0,1) of the number arms. The authors proved a O(T^(1/3)) lower bound and designed a matching algorithm that achieves the optimal bound up to logarithmic factors. After rebuttal and discussion: I'm pretty satisfied with the authors answer regarding the dependency on p. It seems however that the assumption 1 is still questionable, I thus decide to set my final score to 6.

Strengths: This paper achieved better regret bounds under a rather novel setting (to the best of my knowledge) compared to the classical CAB setting, which is an interesting result.

Weaknesses: I have several questions however that I would like the authors to elaborate a bit more. First of all, could the authors elaborate a bit more on the motivation of this setting (I agree that it is theoretically interesting)? The motivating examples are not that clear to justify the assumption of fixing T as a fraction p of N. For me an unknown p would be more realistic (which is the case for the classical CAB or many-armed bandits). This is important to me since it seems that the gain of the bound comes from this p. If we take the extreme case where N -> \infty, we will reduce to the classical CAB case (as mentioned by the authors). This is equvalent to consider p -> 0 actually, and in that case, it appears to me that the current lower bound in the paper does not make any sense. So I would like to ask if the authors could give more intuition on how this p affects the analysis. Finally, is that possible to provide some light experimental illutrations? The algorithm looks implementable, and it could be interesting to see how it behaves in practice.

Correctness: The paper has pure theoretical contributions, the claims are correct.

Clarity: The paper is sound in terms of writting.

Relation to Prior Work: It is quite clear for me that how the new setting differs from the previous ones. However, it seems to me that the paper still lacks some discussion over some related work, in particular the line of research on X-armed bandits such as Bubeck et al. 2011, Slivkins 2011, Munos 2011, Bull 2015, Locatelli and Carpentier 2018, etc (the list is non-exhaustive). Indeed, in those papers, even though it is not explicitely defined, you usually do not sample one arm more than once. I would also suggest the authors to discuss a bit the assumptions they made compared to e.g. Kleinberg 2004, Minsker 2013, Locatelli and Carpentier 2018.

Reproducibility: No

Additional Feedback: The major minor comment: the title in CMT is wrong (missing 'n'). Minor comments: - Line 133: in the F-CAB setting than in the less constrained -> in the F-CAB setting than that in the less constrained... - Line 149: the cumulative rewards are lower than in the classical multi-armed -> are lower than those in the classical...


Review 3

Summary and Contributions: The paper introduces a continuum-armed bandit variant, where the player can pull discrete arms only once. Under some assumptions, the algorithm proposed for the new variant achieves O(T^1/3) regret. A matching lower bound is also provided.

Strengths: The new problem is interesting, and the theoretical work is valuable.

Weaknesses: Some assumptions seems unnatural, resulting in a questionable applicability of the problem.

Correctness: The paper is mostly sound.

Clarity: The paper is mostly clearly written.

Relation to Prior Work: It is sufficient.

Reproducibility: Yes

Additional Feedback: There are many applications where arms die (or degrade) after being pulled. Assumptions 2 and 3 are similar to the ones in [Auer et al. 2007] with \alpha=1 and \beta=1, which are on the easier side. The order of the regret shifting from T^2/3 of Kleinberg to T^1/2 is in fact due to \beta (\beta=1 being the easier problem). The problematic assumption is Assumption 1. While it seems an `innocent' one, it is making much more difficult for nature to choose a difficult instance. Nature can choose a function, and then Assumption 1 is smoothing it (in expectation). It is also difficult for me to imagine an application where this assumption would be natural. The paper places a strong focus on the Mth arm. Choosing the first M arms is imperative for the optimal policy, but for a slightly sub-optimal policy, including the top M-1 arms would result only in constant regret. Probably it would be possible to construct an easy problem where the Mth arm is difficult but the top M-1 are easy (much better than the rest), but this is not working because Assumption 1. In Algorithm 1, the alive intervals are defined as those the have at least two arms. I assume that this a incorrect, and an alive interval is one that has at least one arm that has not been pulled. This should be corrected or clarified. While the regret of the algorithm is stated as O(T^1/3), the constant C_{L,Q,p} hides the nature of the dependency on p. It is clear that for some values of p, the regret should at least get closer to O(T^1/2). Given the pivotal role of the constant, it is necessary to make it clear and probably have a discussion on it. Right now, it is difficult to figure out even the dependence on p. ------------ I have read the authors' feedback. I hope that the autors will indeed include the discussion on $p$ in the revised version (in case of acceptance). I am not convinced at all of their argument on Assumption 1.