Paper ID: | 277 |
---|---|

Title: | Batched Multi-armed Bandits Problem |

i. Novelty and significance: The problem addressed in the paper seems new and the proposed "BaSE" algorithm seems novel for the purpose. The technical contribution of the paper seems sound as authors analysed both minimax and adaptive regret of their proposed algorithm with pre-specified (hence fixed), and (more natural and challenging) adaptive batch sizes, and also prove a matching lower bound guarantee, establishing optimality of their proposed method. ii. Clarity on some results: a) Its confusing that as opposed to what is claimed in Cor 1, I do not see how Thm. 1 recovers the optimal regret O(\srt{KT} bound of classical MAB framework (i.e. when M=T), unless of course T \to infty which is an asymptotic guarantee. --- A thorough derivation of Cor. 1 statement would be appreciated. b)I am also surprised with the matching lower bound statement (Thm. 2), as for M=T it definitely seems to be higher than the classical \Omega(\sqrt{KT}) lower bound for MABs -- what am I missing? c) Intuitively, the learner is supposed to have better control for the data-driven grid setting --- It is however reflected from the analysis that the regret bounds obtained for the two setting are exactly similar (Thm. 1 and 4 or Thm 2 and 3). Why is that -- an elaboration of Line 154-157 would be useful. iii. Experiments: Its however slightly disappointing that it does not provide any empirical studies to validate the theoretical guarantees. iv. Organization and presentation: The paper is overall well written and easy to follow.

In the multi-armed bandits problem, it is assumed that the player interacts with the environment in a round-robin manner: the decision in round t can depend on the feedback received during rounds 1,...,t-1. In some applications, this is not realistic, e.g., in clinical trials, you may test a drug on several patients simultaneously, and receive the feedback all at once. This paper studies such batched multi-armed bandits problems, where the player can have M rounds of interaction with the environment. The rounds of interaction can be predetermined by the policy (the static grid model), or it can be based on the received data (the data-driven grid). The case M=T corresponds to the classical multi-armed bandits problem. The paper provides an algorithm (called BaSE) for the static grid model, which is based on eliminating bad arms. They give minimax and problem-dependent regret upper bounds for this algorithm (Theorem 1). Certainly, these also give upper bounds for the data-driven model, since in the data-driven model the player has more power and so the regret cannot be larger. Theorem 1 implies that M=O(log log T) batches are sufficient to obtain optimal minimax regret, and M = O(log T) batches are sufficient to obtain optimal problem-dependent regret. The paper also discusses lower bounds. In Theorem 2, information-theoretic lower bounds are proved for the static model, which are within polylog factors of the upper bounds. Finally, in Theorem 3, information-theoretic lower bounds are proved for the data-driven model, which are within M^2 times polylog factors of the upper bounds. The paper is well-written, the technical contributions are new and solid, and the results are strong (especially the static model). A systematic treatment of this problem is missing in the literature, and this paper fills this gap. It is likely that some of the technical lemmas of this paper (e.g., Lemma 3) may be useful for other researchers. I haven't checked the proofs. Comments for authors: - using the word "adaptive" in definition (2) is unfortunate. Adaptive usually means that the algorithm can adapt its future decisions based on the past. So, when the reader first sees the word adaptive, she thinks you're talking about choosing the interaction rounds t1, t2,... So I suggest using, for instance, minimax bounds vs. problem-dependent bound. You can then use "adaptive grid" instead of "data-driven grid." - An interesting open question is finding the correct rates for the data-driven grid model. Maybe pose this explicitly as an open question. - The algorithm pseudocode is not very clear. Instead of line (a), write something like: during rounds t_{m-1} to t_m - 1, pull each arm of A for the same number of times. - Fix author names in the reference [BPR13] == After reading the authors responses: thanks. Please add these responses to the revised version as well.

- originality The task is a straightforward extension of the existing work [PRCS16]. However, the algorithm is not a simple extension of the existing work. - quality, clarity Their writing is clear and easy to follow. In their analysis, no errors can be found. - significance I believe this is a significant work as the batched setting is common in practice, and this work gives a complete characterization on the minimax regret on the batched multi-armed bandits.