Reviews: Planning in entropy-regularized Markov decision processes and games

This paper has an interesting algorithmic idea and the properties related to such an algorithm is well investigated. This idea and algorithm should be interesting to the readers. The paper is well written and easy to follow. My main concern of this paper is that it is missing experimental results. As a paper with algorithm invention as its core contribution, I believe experimental results should be crucial to demonstrate the possible future impact and would be important for many readers to understand and follow the work. Any result should be very helpful.

This theoretical paper considers the problem of computing optimal value function in entropy-regularized MDPs and two-player games. It shows that the smoothness property of the Bellman operator in the presence of entropy regularized policies (and possibly other forms of regularization), can be used to derive a sample complexity which is polynomial of order O((1/ε)^{4+c}), with c being a problem independent constant and ε the precision of the value function estimate. The proof is built upon the proposed algorithm, SmoothCruiser, an algorithm motivated in the sparse sampling algorithm of Kearns et al that recursively estimates V through samples and subsequently aggregates the results. This sampling dynamic programming is done up to a depth when the required number of samples is no longer polynomial. The paper is very well written and provides a solid result. Despite the practical limitations of the settings, e.g., a model oracle that can sample anywhere in the state-action space, this is a common assumption in theoretical work and the results are significant and deserve publication. It connects the sparse sampling algorithm with the literature on entropy-regularized MDPs. Although this regularization is applied widely in many practical situations and there are a few theoretical results in the direction taken in this work. Besides this model assumption, the little applicability of this result/algorithm is still a concern. Although the constant c is problem independent, the value of \kappa grows very quickly, since it is inversely proportional to the product of K, the number on the of actions, and the strength of the regularization. It would be very nice if the authors can come up with a minimally interesting MDP that can illustrate this in practice. The feeling is that in most of the problems, the interesting region between \lambda infinity and 0 is small, and the algorithm will roll-back to Kearns spase sampling most of the times. I also found a bit stretched the mapping of SmoothCruiser to a generic MCTS algorithm. In MCTS, selecting an action does not involve any sampling. I wonder what is the real motivation of making this parallelism if it is not exploited in the paper. Finally, I would like to hear whether these complexity results can be improved for restricted class of problems, for example, in the case of MDPs or games with deterministic dynamics. ------------- POST-REBUTTAL I read the rebuttal and I am largely satisfied with the authors' replies.

Paper ID:	6711
Title:	Planning in entropy-regularized Markov decision processes and games

Reviewer 1

Reviewer 2

Reviewer 3