Reviews: Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

Originality: Though it seems that the proof for bounds obtained in the paper follows along the lines of [Srikant & Ying, 2019], I don't think the extension is trivial. The key difference being the construction of a Lyapunov function that is motivated by singular perturbation theory. The second contribution on an algorithm that adaptively schedules the learning rate is novel, and not something I have seen before. Quality and Clarity: Except for minor typos, the paper was well-written, and easy to read. Specifically, the intuition for Lemma 1 by studying the associated ODE's was very useful. Significance: As stated above, the paper provides good foundations as well as directions for future research. ** After reading the rebuttal ** The authors addressed my concerns well, and I would like to still recommend acceptance of the paper.

Reviewer 2

Nice paper, and very clear presentation of the main results. Here are few suggestions for expanding your presentation: 1. The literature review needs to be broadened. In particular, you should discuss the work of Liu et al., JAIR 2019 (Proximal Gradient TD Learning), which analyzes two time-scale algorithms that include a proximal gradient step. The results in that paper show improved finite sample bounds over classic gradient TD methods, like GTD2. How do your results compare with those in that paper, and in particular, can your analysis be extended to GTD2-MP (the mirror-prox variant of GTD2, which has an improved finite sample convergence rate compared to GTD2. 2. Two time scale algorithms are somewhat more complex than the standard TD method, and Sutton et al. and others have developed a variant of TD called emphatic TD (JMLR 2016: "An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning") that is stable under off-policy training. How does your analysis relate to emphatic TD methods? 3. Your analysis is largely set in the context of linear function approximation, but of course, all the recent excitement in RL is over deep nonlinear function approximation networks. Does your learning rate adaptation scheme apply to nonlinear deep neural networks and have you done any experiments on such networks? 4. The presentation can be improved. Some of the main theoretical results (e.g., Theorem 1) would benefit from some simpler exposition. Rather than just state the exact theorem, it would help to add a sentence or two distilling the main implication into easier to parse language for those who want to get a gist of the main result.

Reviewer 3

**Things to like** I like this paper. 1. Quality. First of all, it is well written and pleasant to read. I couldn’t find any grammatical mistakes. 2. Significance. The authors use their analysis to produce something useful: a heuristic for learning rate selection. Then they provide some experimental validation that this heuristic is useful. I appreciate seeing this in a theory paper. 3. Quality. The technical aspects of the experiments are solid. **Concerns** Take these concerns with a grain of salt. I set my confidence score to 2, as I have a Sutton & Barto-level knowledge of reinforcement learning theory. 1. Clarity. I am worried about how the amount of mathematical notation affects clarity. The authors are not obfuscating anything; I believe this topic simply involves a lot of math. However, I doubt if the majority of NeurIPS attendees will be able to understand this paper. Since there are previous RL theory papers accepted to NeurIPS with this density of math [2], I don’t believe this is a fatal flaw and I did not take this into account in my overall score. 2. Significance. As a practitioner of RL, I’d like the authors to provide a more convincing argument about why I should care about this analysis. I’m not saying I don’t care (I’m quite interested), but I would love to hear the authors make a case for why and how this work affects RL “experimentalists.” [1] Dalal, Gal, et al. "Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning." arXiv preprint arXiv:1703.05376 (2017). [2] Hasselt, Hado V. "Double Q-learning." Advances in Neural Information Processing Systems. 2010.

Paper ID:	2626
Title:	Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

Reviewer 1

Reviewer 2

Reviewer 3