NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2523 Monte-Carlo Tree Search by Best Arm Identification

### Reviewer 1

This work uses best arm identification (BAI) techniques applies to the monte-carlo tree search problem with two-players (in a turn-based setting). The goal is to find the best action for player A to take by carefully considering all the next actions that player B and A can take in the following rounds. Access to a stochastic oracle to evaluate the values of leaves is supposed, hence the goal is to find approximately (with precision \epsilon) the best action at the root with high confidence (at least 1-\delta). Algorithms based on confidence intervals and upwards propagation (from leaf to the root) of the upper (for the MAX nodes, action by player A) and lower (for the MIN nodes, action by player B the opponent) confidence bounds are proposed. The algorithms are intuitive and well described, and well rooted in the fixed confidence BAI literature. In particular, recent refinements in this line of work are used, but also well separated from the new contributions (Section 3.1) which makes the paper very clear. A theorem on the number of leaf evaluations is given, and the complexity H (3) exhibited and improves on previous work. The comparison with existing algorithms and strategies is precise. Finally, experiments show a big improvement over existing methods, and some insights are given on lower bounds and future directions, which again are grounded in recent results in the BAI fixed confidence setting. The paper is very well written, and the contributions are made easy to understand and contextualize, thank you. Some comments to maybe further improve the clarity of the work. Figure 1 should be pointed to in the text (it is easier to understand immediately what’s going on than line 111). There could be a discussion to explain why you need two candidates a_t and b_t at time t+1, which can be surprising at first. In the experiments, why do you set \delta = 0.1 |\mathcal L|? It is strange in Figure 4 to have \delta = 2.7 (though experimentally this of course does not seem to be a problem). Is it common in practice to run these algorithms with such little confidence parameter? I would suggest running the experiments with \delta = 0.01*27 instead to avoid confusion, or just \delta = 0.1 (without L). Lemma 7 is hard to parse in its current form, but the example makes this task easier. Could it be possible to add a line in Figure 3 with the optimal proportions (making sure the same \delta is used)? (extremely minor but 259.9 x 1.76 = 457.424, maybe remove the numerical value of kl(\delta, 1-\delta)) Figure 4 is unreadable without 300% zoom on my computer, maybe it could be removed. I suggest the following change to the figures: adding two columns with name of the algorithm and the total number of samples used, and then displaying proportions instead of raw numbers (same as w^* ine line 281). Overall, I believe this is a clear and significant contribution with respect to the existing literature.

### Reviewer 2

\noindent The paper considers the game tree of a two-person zero-sum game. The payoff at each leaf of the tree is assumed to be stochastic and the tree can be of arbitrary length. The goal is to identify the best move at the root of the tree that gives the highest payoff. A PAC setting is considered in this paper where the move can be sub-optimal within a parameter $\epsilon$. Notice that the Best Arm Identification (MAI) problem is such a tree with depth 1. Two algorithms are proposed which are based respectively on UGapE and LUCB. They are essentially the same as in the MAI problem. However, in the present context, the algorithms need to deal with the structure (i.e. depth, branches) of the tree. The algorithms maintain confidence intervals at each leaf node and propagate them upwards to the root. They also maintain a representative" leaf for each internal node of the tree, which corresponds to the leaf that should be sampled if the confidence interval of that internal node needs to be updated. Then the algorithm proceeds similarly as in MAI. In each round, the algorithm identifies the two marginal children of the root, and samples the represetative leaf of one (or both, depending on the algorithm) of them. The stopping rule is exactly the same as the one typically used for MAI. Theoretical upper and lower bounds are derived. Upper bounds are similar to the results of LUCB in MAI and lower bounds are also similar to that in MAI. In general, the paper is well-motivated and the results presented in the current paper indeed improves upon earlier results both theoretically and experimentally (especially experimentally, in which the new algorithms take only about 1/7 the number of samples by the state-of-the-art algorithm). Moreover, the paper is clear written and easy to follow. However, while the problem considered is a generalization of the MAI problem, the algorithms and all the analysis are too similar to that in MAI. In addition, while the paper by Jamieson et al. 2014 on LIL is listed in the reference, it is not mentioned if that techinique will help to improve the results both experimentally and theoretically. It it known that LIL can improve the CBlike bound in MAI setting. It would be better to investigate this point. To sum up, the paper is in general a good paper, especially given the significant improvement in experimental results.