__ Summary and Contributions__: The paper presents an adaptive MCTS algorithm called MDP-GapE, and proves a gap-dependent sample complexity upper bound. It is also verified empirically that MDP-GapE algorithm is more sample efficient than baseline algorithms.

__ Strengths__: The paper proves a sample complexity upper bound that depends on suboptimal gaps, which improves previous results. Empirically, the sample complexity of MDP-GapE algorithm is an order of magnitude smaller than baseline algorithms.

__ Weaknesses__: The empirical comparison is not totally fair (see comments below).

__ Correctness__: Proofs are morally correct, but the empirical comparison needs more discussion.

__ Clarity__: The paper is self-contained, and the correctness can be verified given modest amount of time. It's better to have more high-level explanation of the algorithm.

__ Relation to Prior Work__: The paper discussed previous results very clearly.

__ Reproducibility__: Yes

__ Additional Feedback__: =====Post-rebuttal=====
The authors addressed some of my concerns. As the authors would redesign some of the experiments in the revision, I'd raise my score to 6.
Comments and questions:
1. Are there any lower bound results on the sample complexity of planning?
2. In the MDP-GapE algorithm, the rule for selecting actions in the first step is much more complicated than the remaining steps. Are there any particular reasons, and what is the high-level idea of this algorithm? If I understand correctly this rule is to get the gap-dependent sample complexity. What if we use the simple greedy policy for the first action, and what will go wrong in the proof? It's a little bit confusing because algorithm with gap-dependent sample complexity in RL still uses greedy policies [1].
3. In the experiment, the actual number of samples is much lower than the sample complexity bound, and I'm wondering what is the reason.
- In the experiment setting, the exploration function beta is much smaller than that in the theoretical analysis. Therefore the actual number of samples needed will be smaller than the sample complexity upper bound. However, the baseline algorithm (SS) is measured by its theoretical sample complexity upper bound. Although Sparse Sampling is not an adaptive algorithm, it is still possible to reduce the number of samples required by tuning parameters (or exploration functions) of the algorithm. Is it more reasonable to compare the performance with a tuned version of SS?
- The actual number of samples is even much smaller than (BK)^{H-1}. In the MCTS world, it means that there are unexplored nodes in the search tree. But in the experiment, BK^{H-1}>>|SA|. So the analysis, in this case, is very loose. Actually I think similar techniques in RL literature suggest that the \sqrt{BK}^{H-1} term in Theorem 1 could be improved to \sqrt{SA}. Is it still fair to test the empirical performance in this setting? Note that the baseline performance of SS is given by the sample complexity *upper bound*, analyzed in general MDPs (e.g., MDPs with unbounded state space).
4. The gap-dependent sample complexity is indeed an improvement. [1] has shown that for RL problems the upper bound can dependent on the suboptimal gap of all state-action pairs, instead of gaps of the first action. Is it possible to get similar results in the planning setting?
[1] Max Simchowitz and Kevin G Jamieson. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs.

__ Summary and Contributions__: This paper considers a trajectory-based Monte-Carlo tree search algorithm for planning in MDPs. The algorithm follows the idea of UGapE-MCTS: it uses a best arm identification algorithm to choose the first action in the trajectory, and performs optimistic planning thereafter, by constructing upper and lower confidence bounds on the Q-values. The authors establish a problem-dependent sample complexity of the proposed algorithm for general MDPs with stochastic rewards and transitions. The gap-dependent bounds are interesting and of importance for understanding hard problem instances.

__ Strengths__: Compared to piror work on Monte-Carlo planning algorithm with theoretical guarantees, the proposed algorithm MDP-GapE is shown to achieve a better gap-dependent sample compelxity. Also, MDP-GapE is effective in practice, as shown in the experiments.

__ Weaknesses__: The idea of combing a best-arm-identification and optimistic planning is not new. It has been explored in prior work UGapE-MCTS for two-player games. The proposed MDP-GapE is adapted from UGapE-MCTS, with a different construction of upper/lower confidence bounds. The analysis of these two also appear similar. It would be good if the authors comment on the motivation/advantage of using KL confidence sets. Will similar results hold if using standard confidence bounds as UGapE-MCTS?
There is no lower bound, so that it is difficult to state whether the upper bound presented in Corollary 1 is tight.

__ Correctness__: The theory is sound and the proofs seem to be valid (though I did not verify all the details in appendix).

__ Clarity__: Overall, the paper is well written and relatively easy to follow.

__ Relation to Prior Work__: Yes.
The following two papers are also related: The finite sample complexity of UCT has recently been investigated in [1]. And another related work establishes the convergence rate of MCTS with entropy regularization [2].
[1] D. Shah, Q. Xie, and Z. Xu. Non-Asymptotic Analysis of Monte Carlo Tree Search, 2019
[2] C. Xiao, R. Huang, J. Mei, D. Schuurmans, and M. MÃ¼ller. Maximum entropy monte-carlo planning. In Advances in Neural Information Processing Systems, 2019.

__ Reproducibility__: Yes

__ Additional Feedback__: Scaling in epsilon: As pointed out in the paper, the dependence on epsilon is worse compared to previous work. It would be good if the authors comment on where this loss of performance occurs, either due to analysis or the algorithm itself.
The authors are suggested to add discussion of the results for benign/hard problem settings to illustrate an improved/worst-case sample complexity.
MDP-GapE only adopts BAI for the first step. Will applying BAI algorithm for multiple steps help improve the sample complexity?
===Post-rebuttal update=== The authors addressed most of my questions. Overall the results are interesting and solid. The paper points to some interesting future directions. I would keep my score and suggest accepting this paper.

__ Summary and Contributions__: This paper considers the problem of on-line planning in MDPs. Given a state, the problem is to compute an epsilon-optimal action with probability at least 1 - delta. The agent has access to a sampling model that can generate a next state and a reward for any given state and action--the compute time can be taken as the number of calls to this sampling model. The paper presents a sampling algorithm, its analysis (shown to enjoy a tighter bound than previous approaches), and also experimental validation of its efficiency.
Addendum: I have read the authors' response. I retain my assessment and score.

__ Strengths__: The paper is very well-written: clear and easy to follow. The algorithm, theoretical analysis, and experimental comparisons are all useful contributions--together making up a good paper.

__ Weaknesses__: No significant weaknesses.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Although you fix a finite horizon H, there appears to be no need to suffix the MDP's rewards r and transitions p by h in [1, 2, , H]--they are h-indepdent, are they not?
Please explain the intuition behind having different rules for selecting the first action (greedy w.r.t. a confidence width) and subsequent actions (greedy w.r.t. a UCB). A related question is why you need to go along the generated trajectory. You are allowed to sample any arbitrary state-action pair at each step--so why the choice of sticking to a trajectory starting at s_{1}?
In Equation (4), "<= epsilon" must be placed outside the curly brace.