NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6614
Title:Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Reviewer 1


		
Originality: The paper proceeds in a direction somewhat orthogonal to recent literature in the area, by swapping out the full-planning strategy for a one-step greedy strategy. As such, it opens up potential new research approaches along with providing an improvement on the SOTA. Quality: The argument is well-developed, and extensive proofs are provided in the supplementary materials or referenced in existing literature. The greedy approach is directly applied to two existing SOTA full-planning-based algorithms, suggesting it is a generalizable alternative. Clarity: The paper is generally well-organized and clear; the paper gives an intuitive sense of the results, although the bulk of the proofs are confined to the supplementary material. Several scattered clarity issues are described in the detailed comments below. Significance: The paper provides both a direct theoretical contribution to the field of regret analysis. More generally, challenging the assumed supremacy of full-planning-based approaches may have ripple-on effects beyond the specific problem domain (tabular finite-horizon undiscounted MDPs). Detailed comments: Abstract: I think it's important to point out that the results apply to undiscounted MDPs. Table 1: Why is the Regret bound for UCLR2 listed in terms of the horizon H and not the diameter D, as in the original paper? Line 124: It's unclear here what is former and what is latter; I would state "since \bar{V}^{k-1}(s_1^k) >= V*(s_1^k) >= V_1^{\pi_k}(s_1^k)". Line 212 and Algorithm 2: Algorithm 5 in the supplementary materials requires updating the lower bound variables after action selection, and therefore seemingly can't be used as a drop-in replacement for ModelBaseOptimisticQ as suggested. Ideally Algorithm 2 should include an additional line to be fully general to both UCRL2 and EULER, or it should at least be mentioned in a footnote. Line 263: Prioritized sweeping, particularly prioritized sweeping with small backups (van Seijen & Sutton 2013) should be referenced as a highly efficient full-planning approach.

Reviewer 2


		
The paper presents an original and novel contribution that is of interest to the field. However, the paper is not very accessible and makes little effort to summaries the found results in an intuitive manner and why the results are true. For the NeurIPS audience, the writing of the paper should put far more emphasis on building intuitive examples that explain the theoretical findings, rather than just stating them. Including a proof sketch into the main body of the paper does not help the overall presentation. In Sec. 4, why are the model-based algorithms optimistic (as stated in lines 177)? According to my understanding, Alg. 2 builds an approximate model (rhat_k, phat_k,...). Couldn’t approximation errors bias the model so that the algorithm is not strictly optimistic? Regarding Lemma 1 and line 109: Isn’t lemma 1 a re-statement of the proof presented at this link: https://m.tau.ac.il/~mansour/rl-course/scribe6/node4.html

Reviewer 3


		
This paper considers the interesting question of whether model-based reinforcement learning can achieve good performing by using a myopic policy instead of doing full planning at each step, and presents an affirmative answer for this. Specifically, it is shown that UCRL2 and EULER can be adapted to use a myopic policy that does 1-step planning, preserving the regret and improving the time complexity by a factor of S. How large are the constants in the complexity notations and how do they compare with their full-planning counterparts? Some experiments comparing the actual runing time of the greedy algorithms with their full-planning counterparts will be helpful. Overall, the paper is well-written and the results are very interesting. * Update after rebuttal The authors have clearly answered my questions about the the constants in the complexity notations and promised to provide experimental comparison between the greedy algorithms and their full-planning counterparts. I still vote for acceptance.