__ Summary and Contributions__: This work describes a Q-learning with function approximation algorithm and
provides an upper bound on the samples required to converge to the optimal
policy in deterministic episodic MDPs.
Specifically, assuming Q\* can be approximated up to some error \delta with
approximation family F, the proposed algorithm converges to best policy linear
in the eluder dimensionality of F O(dim_E) trajectories. The proof depends on
an assumption of the size of the optimality gap \ro, which (informally) is the
difference of the return between the optimal action and the second best. The
upper bound holds for some error \delta, which is dependent on \ro.
The proof complements a lower bound found in previous work, providing a tight bound.

__ Strengths__: Despite significant advances in RL thanks to function approximation, little is
known or proven about their theoretical properties. Better theoretical
understanding and insights in guarantees are crucial to both further
development and application of said RL. The authors provide an upper bound on
the sample efficiency of arbitrary approximation functions, and hence should be
important to the majority of the RL community.
The paper is well written and provide both a specific example for linear
function approximation and the more general case of arbitrary functions. Even a
reader unfamiliar with the topic (e.g. myself) was able to follow most of the
key arguments in the derivation and proof.
Lastly, this work is closely related to the current literature which, to the
best of my knowledge, are well documented and described.

__ Weaknesses__: The proof, as described by the authors themselves, depend on the assumption on
the gap optimality. The relationship between the approximation error and this
optimality gap is crucial, a larger approximation error requires a larger gap
to ensure the favorable properties. It is not entirely clear whether these
bounds are meaningful in practice.
Secondly, the algorithm for the general case requires an oracle to determine
the most uncertain action given a state for the approximation family F. While
it is argued that a similar oracle is used in previous work, it is not clear
whether this is more realistic than previous work dismissed by the authors in
related work ("Know-What-It-Knows" oracle in Li et al. 2011).
The proof applies only to deterministic systems, restricting its application
significantly. Note that similar proofs for more general stochastic MDPs seem
out of reach for now and unrealistic at the moment.

__ Correctness__: Unfortunately, I do not have the background to be qualified to guarantee the
correctness. However, the proofs and argumentation looks rigid and thorough and
I found no red flags.

__ Clarity__: The paper is well written and, especially given the math-heavy content, is easy
to follow.

__ Relation to Prior Work__: This work is nicely put in the current literature, including insights in how it
answers open questions posed in Wen and Van Roy 2013 and complements a proof by
Du et al. 2002.

__ Reproducibility__: Yes

__ Additional Feedback__:
* The range of the return, as described in the background section, is assumed
to lie in [0,1]. This puts a strong bound on the values \ro can realistically
take. Does this have any implication on the provided method?
* The document goes into little detail into how to go about designing the
required oracle described with equation (1). This oracle seems to be
dependent on chosen F, is there a realistic scenario in which it is
implementable?
---- Post response ----
Thank you for addressing my questions in the response, they clarified some of my misunderstandings.

__ Summary and Contributions__: This paper makes significant contributions to theoretical reinforcement learning by giving a nearly complete characterization of the setting where the environment is deterministic and the optimal $Q$-function can be approximated by a function in a function class with bounded Eluder dimension. The optimality is measured in terms of the approximation error and the sample complexity. This paper gives a tight phase transition on the approximation error: if the approximation error is smaller than a threshold depending on the optimality gap and the Eluder dimension, then the agent can learn the optimal policy efficiently; otherwise, the agent requires an exponential number of samples. The algorithm proposed in this paper also achieves a near-optimal sample complexity bound. These results resolve an open problem raised in [Wen and Van Roy, 2013].

__ Strengths__: 1. Theoretical understanding of RL with function approximation is a very important topic. The setting studied in this paper is natural as many RL environments are deterministic, and bounded Eluder dimension is a standard assumption in the bandits/RL literature.
2. The theoretical results in this paper are technically strong. A complete theoretical characterization is rarely seen in the literature. This paper gives nearly complete characterizations in terms of the approximation error and the sample complexity. The results in this paper are conceptually interesting as there is a phase transition.
3. The recursion-based algorithm in this paper is intriguing and may inspire new algorithms.

__ Weaknesses__: 1. There is a log factor gap in the bounds.
2. No experimental experiments are given in this paper, though I think it's fine for a theory paper.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper studies reinforcement learning exploration with minimal assumptions on the Q* values, and misspecification of order the gap. It focuses on best policy identification as opposed to finding an approximate policy. It shows connection with the Eluder dimension.
I find the paper interesting but there are several weaknesses that may make the result uninteresting.

__ Strengths__: - connection with Eluder dimension
- based on reasonable oracles (maximum uncertainty)
- tight analysis for the setting they consider

__ Weaknesses__: - Deterministic systems seem highly restrictive
- The algorithm can do best policy identification only (it cannot find a near-optimal policy)
- The model has to be very correct, i.e., misspecification smaller than roughly the gap
- Cannot operate in the fairly common scenario where there exists more than one optimal policy.

__ Correctness__: I believe the result is correct

__ Clarity__: The paper is very well written indeed

__ Relation to Prior Work__: The paper certainly cites most of the related literature. However, they should cite the weakness as well when comparing to this literature.

__ Reproducibility__: Yes

__ Additional Feedback__: