__ Summary and Contributions__: This paper studies learning Q-function in reinforcement learning problems with continuous state and action spaces, and where the optimal Q-function is known to be low rank (or approximately so). The main contribution is an algorithm for learning an epsilon-approximate of the Q-function by leveraging low-rank factorization, which enjoys a sample complexity of O(\epsilon^{-2 + max(d_1,d_2)}), where d_1 and d_2 denote the dimension of state and action spaces, respectively. This improves over the lower bound of \Omega(\epsilon^{-d_1-d_2-2}), obtained by looking at the problem through the lens of non-parametric regression. Moreover, when applied to RL problems with finite state and action spaces, the presented algorithm yields a sample complexity scaling with max(S,A), as opposed to lower bound for the unstructured case scaling as SA.
The presented algorithm relies on a novel low rank Matrix Estimation method, whose design is necessary to obtain the desired sample complexity.

__ Strengths__: RL with structured state and action spaces is an interesting yet challenging topic. This paper, to the best of my knowledge, happens to be among few works studying how to learn Q-function in continuous MDPs, and where some structure is present. The considered low-rank structure in Q-function makes sense in applications of RL, and is well-motivated.
The presented algorithm manages to achieve some gain of exploiting structure in the sample complexity – here being to replace the dimensions of spaces with their maximum. Also, when applied to problems with finite state and action spaces, the algorithm is shown to achieve a clear gain in sample complexity over structure-oblivious counterparts.
To perform the low-rank factorization, the algorithm crucially relies on a novel Matrix Estimation method tailored for this problem. Overall the various steps of the presented algorithm (Algorithm 1) intuitively makes sense to me. Moreover, the algorithms seems to be easy and straightforward to implement.
Finally, the presented algorithm is examined through extensive numerical experiments, and is shown to be empirically sound.

__ Weaknesses__: Below I list my concerns regarding the setup and reported results.
- The generative setup: Perhaps one salient weakness of the paper is the restriction to the generative setup. In the finite case, devising an algorithm for the online setup posed more serious challenges than the generative setup. The restriction of the results to the generative setup hides the price to pay for the need to navigate in the MDP. Could you at least elaborate on explaining the potential difficulties and challenges involved in extending the results to the online case? Could one hope for a similar gain in the sample complexity (over structure-oblivious algorithms)?
- Existence of anchor points: The proposed Matrix Estimation method seems to crucially rely on the existence (and availability) of a set of appropriate anchor states and actions. Although it is empirically motivated that it is the case in several RL applications, to me it sounds like a critical assumptions silently made by the authors. To me it is not clear whether the existing assumptions directly imply the existence of such anchor pairs. Please clarify.
- In the warm-up example in Sec. 5.1, c_me is proportional to R_max/R_min. There it is assumed that R(s,a)\ge R_min >0, which is a very respective assumption. Even if the result for r=1 would be valid if R_min is redefined to be the minimum non-zero reward, the corresponding c_me could be arbitrarily big, thus resulting in a prohibitive sample complexity. Therefore, unless I am missing something, the constant c_me could be very large in some problems. This then implies both a prohibitive sample complexity (in view of (4)) and a very limited range of \gamma for which the result is valid (in view of \gamma <1/(2c_me) ). Could you please clarify?
- In the finite unstructured case, existing results indicate that the sample complexity has to scale as \Omega ((1-\gamma)^{-3}). The paper does not clarify why this is no longer the case in the studied structured case (in both finite and infinite spaces). Could you please explain/motivate.

__ Correctness__: The considered low-rank structure (and its approximate notion) for Q-function sounds technically valid to me. I believe the studied structured RL problem is an interesting and significant problem.
The design of the algorithm makes perfect sense to me. I was unable to check all the proofs given the limited review time. The results, however, appear correct to me.
I also believe that the empirical evaluation of the presented algorithm is conducted correctly.

__ Clarity__: The paper is overall very well-written and well-organized. In particular, the various steps of the algorithm are described well making the paper easy-to-follow.

__ Relation to Prior Work__: The authors cited most relevant paper (the RL side) that I am aware of. In particular, they clarify the potential similarity between their contributions and existing results.
Regarding finite MDPs, I would like to add that [3] presents an algorithm whose sample complexity matches the lower bound. If the space permits, it would be nice and informative to briefly overview structured RL algorithms in the regret minimization setup for when some low-rank structure exists in the MDP.

__ Reproducibility__: Yes

__ Additional Feedback__: The paper is very well-polished. However, I found a few typos reported below:
l. 78: space ==> spaces
In several places: literature ==> the literature
Both \ell_\infty and L^\infty are used to indicate the same notion. For consistency, only one them should be used.
==== AFTER REBUTTAL ====
Thanks for your effort. I have read the other reviews and the authors' response, and I will keep my current score. In agreement with other reviewers, I believe that the paper presents novel algorithmic ideas as well as novel technical tools that could be useful for other RL setups too. It is however a pity that the current analysis excludes the interesting regime of \gamma ~ 1. I would like to ask the authors to clearly highlight this fact in the main text, and provide additional remark explaining the reason behind such a limitation.

__ Summary and Contributions__: This work proposes a new, sample efficient RL algorithm to estimate the optimal Q function for continuous state and action spaces. To this end, it introduces a representation of Q* based on a generalized singular value decomposition that decomposes each Lipschitz continuous Q function in an infinite sum of basis functions. Based on this decomposition, the authors propose an algorithm to estimate Q* that proceeds in 4 steps: (i) discretize the S-A space, (ii) build an estimate of Q* at a subset of the points in the discretization, (iii) obtain an accurate estimate of Q* over the whole discretization through matrix estimation, (iv) generalize to the continuous S-A space via interpolation. The authors establish probabilistic convergence results when a matrix estimation algorithm with bounded approximation error in l_infinity norm in presence of observations with arbitrary bounded noise is used. Subsequently, they present such a ME algorithm for Q* with exact or approximate low-rank. Finally, they show their method is effective in continuous control tasks even when using discount factors much higher than those dictated by the theory.

__ Strengths__: - Sound analysis of an important problem in RL: sample efficiency of RL in continuous spaces.
- Introduction of a novel ME algorithm that can be applied outside of RL scenarios.

__ Weaknesses__: - The method requires a simulator to be available to collect samples at desired target state-action pairs. While I believe this does not hinder the contribution as the sample complexity result is interesting in its own right, it would be beneficial to mention this earlier in the work (e.g in the introduction) to clarify what are the conditions for this method to be applicable.
- This is not really a weakness, rather a clarification. In step 4, you say that you use the nearest neighbor to generalize from the discretization to the continuous space. How does that influence the guarantees for the continuous space given in thm2? Could they be improved by using more sophisticated interpolation?
- The experimental section could be a bit clearer. For example, it would be interesting to see some quantitative evaluation supporting your claim that your method is “computationally much more efficient”. Moreover, it would be important to spend a few more words on the baselines.
AFTER REBUTTAL
After reading the other reviews and the author response, I am convinced the contribution of this paper is relevant. Therefore, I confirm my score.

__ Correctness__: The claims appear to be correct.

__ Clarity__: In general, the paper is well written and easy to follow.

__ Relation to Prior Work__: The relation to prior work is clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: In this paper, the authors focus on reinforcement learning (RL) problems in which the environment is modeled as a Markov decision process (MDP) with continuous state and continuous action spaces. Assuming the availability of a generative model, they develop a novel method to learn the optimal $Q$-function when the function is smooth and has low rank. Under the compact domain, bounded reward, and smoothness assumptions, they first show that, for a given confidence level $\delta$, the optimal $Q$-function can be `approximately' expressed as a finite sum of $r(\delta)$ functions. Using the derived $Q$-function representation, they develop an RL algorithm that utilizes a novel matrix estimation algorithm as a subroutine. Finally, they prove the sample complexity of the developed RL algorithm and demonstrate its performance numerically on various stochastic control problems.
The key contributions of the paper are the following. First, the authors provide a spectral representation of the optimal $Q$-function, which allows them to approach the RL problem from a matrix estimation perspective. Second, they develop a novel matrix estimation method, which is used as a subroutine in the proposed RL algorithm, and rigorously analyze its performance under various rank conditions. Third, they analyze the sample complexity of the proposed RL algorithm and show that, when the optimal $Q$-function has low rank, the proposed method introduces an exponential improvement in sample complexity with respect to the existing methods.

__ Strengths__: The paper introduces the idea of using a spectral representation of the optimal $Q$-function to improve the sample complexity of RL algorithms. I believe this new idea is highly promising and might provide inspirations to researchers for developing efficient RL algorithms that perform well on special classes of learning problems. Other strengths of the paper include the clarity of presentation, the rigorousness of the analysis, and the detailed explanations of the weaknesses of the proposed method.

__ Weaknesses__: As the authors mentioned in the paper, the sample complexity results apply only to scenarios in which the discount factor is small. In most practical scenarios, the required condition $\gamma$$<$$1/c_{\text{me}}$ on the discount factor is likely to be violated. Another weakness of the paper is the scenarios considered in the experiments section. For comparison purposes, the authors provide empirical results on stochastic control problems such as inverted pendulum, double-integrator, and cart-pole. However, there are no numerical examples that illustrate the performance of the proposed methods on more realistic and practical planning scenarios.

__ Correctness__: The authors provide proofs for the presented results, but they are highly dense and technical. Even though the techniques used in the proofs make sense, I cannot vouch for their correctness since I only skim through them.

__ Clarity__: The paper is very well-written.

__ Relation to Prior Work__: The authors provide a fairly detailed literature review and emphasize the significance of the presented results by comparing them with the existing work. One of the main contributions of the paper is to exponentially improve the sample complexity bound $\Omega(\frac{1}{\epsilon^{d_1+d_2+2}})$, which, according to the authors' claim, is a classical minimax theory result. Unfortunately, I am not familiar with the above-mentioned complexity bound. The authors provide two references, one of which is a book on nonparametric estimation, for this result. Even though I skimmed through those references, I was not able to find the mentioned complexity bound. Since, the exponential complexity improvement is one of the main contributions of the paper, I suggest the authors to either provide a reference to the theorem that states this result or explicitly show that this bound can be derived from existing results.

__ Reproducibility__: Yes

__ Additional Feedback__: The authors mention that the sample complexity results apply to RL problems in which the discount factor is small. If the authors briefly explain in plain English how the discount factor appears in their proofs and why they need small discount factors to prove their claim, that would be helpful to other researchers to improve the results of this paper.
In line 43, $Q$ $\rightarrow$ $Q^{\star}$.
Post author response: The response was satisfactory, and my positive evaluation of the paper stands.

__ Summary and Contributions__: This paper develops a spectral representation of the Q-function and designs a data-efficient learning algorithm if the Q-function has a low-dimensional structure. A novel component is a new low-rank matrix estimation method, which is shown to achieve theoretical bound in max-norm. Equipped with this matrix estimation, the proposed result is able to achieve improved sample complexity for learning a near optimal Q-function.

__ Strengths__: 1. This paper can be treated as a theoretical justification of the empirical success in Yang et al. (2020) that investigates low-rank Q function with matrix estimation. This result further pushes the boundary in the understanding of low-rank structure of Q-function.
Yuzhe Yang, Guo Zhang, Zhi Xu, and Dina Katabi. Harnessing structures for value-based planning and reinforcement learning. In International Conference on Learning Representations (ICLR), 2020.
2. A by-product of the theoretical analysis is a new matrix estimation with max-norm error bound. This is new in the matrix estimation literature when the data are adaptively collected.

__ Weaknesses__: 1. The proposed matrix estimation requires a crucial selection of anchor states and actions. Anchor states and actions contain sufficiently information to recover the low-rank Q function. In the matrix estimation algorithm, the authors assume that these anchor states and anchor actions are given, and in the experiments they pick a few states and actions that are far from each other in their respective metric spaces as the anchor states and actions. It would be more convincing if the authors can rigorously state the anchor selection algorithm and prove this algorithm indeed obtains anchor states and action. Otherwise, the challenge of low-rank matrix estimation just transfers to the finding of anchor states and actions.
2. In the main theory, the discounting factor $\gamma$ needs to be very small. For example, in Theorem 2, $\gamma < 1/(2 c_{me})$ and in Proposition 3 (rank 1 case) $c_{me} = 7 R_{max} / R_{min} > 7$. This implies that $\gamma < 1/14$, which seems to be unreasonably small. I notice that the authors used $\gamma = 0.9$ in all the experiments. More discussions on such inconsistency are needed.
3. As shown in all the five experiments, the nuclear norm matrix estimation works very well in practice. It is comparable or better than the proposed method. Given that nuclear norm is a common routine for low-rank matrix estimation, it would be helpful to provide more justifications on why the proposed new matrix estimation approach is needed. For example, some additional experiments might be added.
#########
I have read the rebuttal. My concerns have been addressed.

__ Correctness__: yes

__ Clarity__: yes

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: