Summary and Contributions: This work considers off-policy policy evaluation via a gradient temporal difference (GTD) algorithm when the function parameterization is a neural network (NN). The algorithm is developed in order to optimize the mean squared Bellman error (MSBE). By considering a saddle point reformulation, then a stochastic variant of primal-dual method is developed, where each is defined by an over-parameterized neural network. Global convergence of the proposed algorithm is established when the parameterization is a 2-layer ReLU NN architecture with m neurons as the number of neurons goes to infinity.
Strengths: The work considers a clearly important setting of off-policy evaluation, a canonical task in reinforcement learning that is used in imitation learning, and using prior experience to inform future policy improvement. Since neural parameterizations are commonly used in DRL today, the authors analyze how a gradient temporal difference scheme converges under various technical settings related to neural parameterization, and establish its convergence to global optimality. This is a substantial technical result that requires challenging convergence theory to be developed, which in my view is the main merit of the work.
Weaknesses: The philosophy of establishing convergence guarantees for neural networks under specific numbers of neurons is strange, because the number of neurons is a very coarse description of a network that can already be established by nonparametric estimators, i.e., Cho, Youngmin, and Lawrence K. Saul. "Kernel methods for deep learning." Advances in neural information processing systems. 2009. And numerous follow up works. Therefore, if the neural network analysis is to refine this approach, then it must also specify the *inter-layer* relationships and broader architectural choices to actually be useful to practitioners. As is, I don't see how the m of Lemma 4.1 can actually be used to inform choice of a neural architecture in any sharper manner than, e.g., a single layer RBF network. Also, reformulating Bellman's equations into saddle point problems has been previously studied: Shapiro, A. (2011). Analysis of stochastic dual dynamic programming method. European Journal of Operational Research, 209(1), 63-72. Dai, Bo, Niao He, Yunpeng Pan, Byron Boots, and Le Song. "Learning from conditional distributions via dual embeddings." In Artificial Intelligence and Statistics, pp. 1458-1467. 2017. B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1125–1134, 2018. The authors need to do a better job of explaining what is the technical similarity/difference between existing attempts to reformulate Bellman's equations into saddle point problems and that which is considered here. In my understanding, the only departing point is explicit theoretical analysis of how the algorithms for solving it interact with the neural parameterization, and establishing global convergence rather than local convergence. Given this aspect, I would be curious to see if there is any experimental corroboration of the phenomenon that enlarging the parameterization actually improves algorithm stability. I would expect then opposite to be true, because as the number of parameters increases, then the number of parameters one needs to estimate increases, and therefore algorithm convergence should *slow down.* These tradeoffs are not well-discussed in this work. For instance, I would be curious to see how H1 contrasts with a standard CNN or RNN architecture commonly used in DRL. Also, the importance and meaning of the algorithm being off-policy versus on-policy is not discussed. Are there any technical challenges or experimental merits to this setting? Why is it important? Convergence guarantees for this setting are discussed, but its inherent meaning is not.
Correctness: The claims and convergence theory seems well-justified and connected to existing recent trends in global optimality characterizations of algorithms for deep RL. But there is no experimentation so I decline to comment on empirical methodology.
Clarity: The paper is clear enough, although extremely densely packed so it is difficult to glean key importance of these details. Moreover, in appendices, there could be more connective discussion of how the different lemmas/propositions, etc. fit into the proofs, or a sketch of the logic pursued at the outset of the proofs.
Relation to Prior Work: Reference to finite-time performance of GTD is missing (section 5): Kumar, Harshat, Alec Koppel, and Alejandro Ribeiro. "On the sample complexity of actor-critic method for reinforcement learning with function approximation." arXiv preprint arXiv:1910.08412 (2019).
Additional Feedback: I have read the rebuttal and the comments of the other reviewers. I would like to point out that the other reviewers may have interpreted aspects of the manuscript as novel that were not, especially related to global optimality results for neural parameterizations in policy gradient, and saddle point reformulations of Bellman's equations, as the authors have commented upon in the rebuttal. Therefore, the main technical novelty is that (i) they consider the off-policy setting (ii) they have established convergence that alleviates the rate dependence on the number of neurons by focusing on convergence in terms of L^2 metric. This is interesting, but is it enough to warrant acceptance? Is there any experimental importance/manifestation of alleviating this dependence? The paper has not honestly presented the merit of their contribution in this light, and is abstrusely written so it is very difficult to discern. I feel they are trying to bombard the reader into submission with mathematical details, rather than honestly contextualize the importance of their convergence results in theory and practice. (iii) Regarding the experimental importance, it is unacceptable that a paper about convergence results for neural parameterizations does not demonstrate the tradeoffs of architectural choices on algorithm performance, as this is the most difficult and critical aspect to making DRL work for new application domains. For these reasons, I will keep my opinion as marginally below the acceptance threshold.
Summary and Contributions: The paper proposes a gradient temporal difference algorithm to minimize the mean squared Bellman error for policy evaluation. The authors analyze its convergence while using a two-layer neural network for function approximation and show that an approximately globally optimal solution is reached when the NN tends to become infinitely wide.
Strengths: The key strength of the work, and its difference from predecessors, is that the proposed algorithm does not involve computing the hessians of the NN which should make it scale better in practice.
Weaknesses: Please see my comment below.
Correctness: Yes, it seems to be the case although I have not gone over all the proofs in the appendix.
Clarity: Yes, the paper is written clearly and the authors have made a significant effort to structure it in a way which would make it easy to read and understand.
Relation to Prior Work: Yes
Additional Feedback: I understand that the paper is primarily a theory paper but whenever an algorithm is proposed for a task, it is good to see some experiments demonstrating its effectiveness against baselines. It would be great if some experiments can be conducted to showcase the strengths of the algorithm better. --- Post-rebuttal comments --- After reading the author response, other reviews and having a discussion with other reviewers, I think there is definitely some merit to the points R1 brings up. However, though the paper's novelty might have been a bit more limited than it appeared to be at the first glance, the paper is still very interesting and technically the results are correct and contribute well to the literature. The authors have also made a reasonable effort to address the reviewers' concerns. I believe that the paper can still be accepted with changes to the final draft which: (a) address R1's points about the paper's contribution and novelty, and (b) include their new experiment from the rebuttal. So I am still keeping my original score of 7.
Summary and Contributions: This is a theoretical paper which describes a GTD algorithm using two neural networks and off-policy samples which converges to a global optimum of the mean-squared Bellman error loss function given assumptions which are reasonable for a theoretical paper.
Strengths: - The paper is well presented, and the complex mathematical results are augmented with effective explanatory text. - The result is important and interesting, and the approach is creative. The derivation of the loss function as a mini-max problem was thought-provoking.
Weaknesses: - Some readers will dislike that it is a theoretical paper, and an algorithm is presented without it being tested or demonstrated. Without any application, it is not clear if this at all works or is relevant as assumptions are relaxed to a realistic setting or if the single-layer neural networks of the theorems are replaced with something more empirically successful.
Correctness: - I believe so. It is a dense theoretical derivation, and I cannot claim to have walked through or understood every step of the proofs. In parts, I leaned on the exposition, which made each step seem reasonable.
Clarity: Yes, the paper is quite well written and organized.
Relation to Prior Work: The paper nicely describes its influences in terms of neural network analysis and reinforcement learning. I would appreciate more exposition and less of a listing, but I'm not sure what I would cut to make room.
Additional Feedback: AFTER AUTHOR COMMENTS AND REVIEWER DISCUSSION: Some of Reviewer 1's comments are well-taken, and improved citations are needed. I urge the authors to more clearly state some of this context. This lessens my enthusiasm somewhat, but I still feel this is an excellent paper.