__ Summary and Contributions__: This paper analyzes the sample complexity of two time-scale actor-critic (AC) algorithms in a finite time case. The analysis is focused on seeking the proof of required iteration for an AC algorithm to reach the first-order stationary point (i.e., the gradient of the actor is smaller than ɛ). The proof techniques are characterizing the bounds of all estimation errors during the training process (shown in Appendix) and connecting those bounds to the actor’s gradient. The primary contribution is the sample complexity of actor-critic algorithms which could be used to estimate the training time of a reinforcement learning algorithm in several applications.

__ Strengths__: Pros:
1. The finite-time analysis is more close to practical scenarios than infinite time analysis.
2. This paper doesn’t rely on the i.i.d. assumption of the data distribution.

__ Weaknesses__: Cons:
1. This paper is hard to follow for me. I would kindly suggest the author should provide more background knowledge for this paper; otherwise, it’s difficult for the readers from other fields to understand the jargons in this paper. For example, what is Markovian noises. I searched for it on the Internet and got a few different answers. So, I guess it probably shouldn’t be presumed as common knowledge. Another example is a sentence in Line 205: “It is known that the TD limiting point satisfies:”, where I think there should be a reference to support this statement.
2. The derivation of theorems better has an accessible outline or sketch of proof in the main text. Currently, the author directs the readers to read the Appendix without explaining the techniques used for the proofs. For example, in Theorem 4.5, there are no explanations, even simple, for the former three terms. I do read the Appendix when reading mathematical proof, but there should be a rough explanation in the text. Otherwise, it hurts the reading experience as I have to search the explanation in the Appendix before proceeding. Moreover, the reference to the Appendix should be more clear. The current version just refers to Appendix but didn’t specifically point out the section number.
3. The discussion of related works should be elaborated. Without a detailed comparison with related works, it’s difficult to properly position this paper in the literature. I do understand the author has mentioned a lot of relevant works, but as a reader not familiar with the history of analysis to actor-critic algorithms, the limitations or drawbacks of previous works are unclear to me.

__ Correctness__: The followings are the questions:
1. Why does this proof not require the “decouple actor and critic” assumption? It’s not obvious to me, but the author claims it’s the strength of this paper. So, I think it worth more explanation.
2. Why is the Equation 4.1 defined in this way?
3. Remark 4.6 should be elaborated for the following aspects: (i) How does the method relax the requirement for i.i.d. data assumption? (ii) Which term is Markovian noises? Or, what’s the definition of Markovian noises?
4. In Remark 4.8, how is “iterative refinement” used?
5. It would be nice to have a reference to how Corollary 4.9 is derived in the Appendix.
6. In Lemma B.3 (line 555), where is k on the right-hand side?
7. How is Lemma C.2 derived? To me, it’s not obvious.

__ Clarity__: This paper is very hard to follow. I would kindly suggest the author to reorganize the paper and improve the clarity of use of notations. For example, the notations used in Appendix C are confusing to me.
Minors:
1. r(s, a)/r(θ) and ɛ_app/ɛ_app(θ) are overloaded.
2. In Appendix C, the forms of h(O, η, ω, θ) and h′(O, θ) are similar. I do understand their uses in the proof, but it would be nice to simplify the notations used for the proof. That should definitely improve the readability.
3. 225: “such that” is repeated

__ Relation to Prior Work__: Overall, I think the analysis presented in this paper might have a great impact on the theory of actor-critic algorithms. However, the paper is difficult to follow and thus the contribution is obscure to me. Thus, at this review phase, I recommend a “Borderline, reject” decision for this paper.

__ Reproducibility__: No

__ Additional Feedback__: Overall, I think the analysis presented in this paper could have a great impact on the theory of actor-critic algorithms. However, the paper is difficult to follow and thus the contribution is obscure to me.

__ Summary and Contributions__: This paper considers a two time-scale actor-critic algorithm, in which actor and critic update simultaneously but with different learning rates. The authors provide a non-asymptotic convergence rate under Markovian sampling, in contrast to iid samples considered in prior work.

__ Strengths__: The actor-critic algorithm analyzed in this paper is the classic version---easy to implement and sample efficient, compared to other variants with theoretical performance guarantee. In particular, a finite sample analysis is provided for the Markovian sampling setting. As far as I know this constitutes a clear improvement over the state of the art.

__ Weaknesses__: The contribution in Introduction emphasizes the technical novelty without explanation. Remark 4.8 mainly provides comparison of the results rather than the ‘new technique’ itself. It is still unclear to me what the technique. Also, the analysis appears to be similar to [Y]. It would be good if the authors comment on the difference.
[Y] Kaledin, M. , Moulines, E. , Naumov, A. , Tadic, V. and Wai, H.-T. (2020). Finite time analysis of linear two-timescale stochastic approximation with Markovian noise.

__ Correctness__: The result appears interesting and seems to be correct (although I did not go through all the proofs in the appendix).

__ Clarity__: Overall the paper is easy to follow.

__ Relation to Prior Work__: The authors should improve the way related work is reported. For instance, on stochastic bias characterization, [23] uses the technique of Lyapunov analysis, which is quite different from [3]. There are two other papers that should be cited: [X] also studied the convergence of two-timescale AC algorithm; [Y] provides non-asymptotic analysis of two time-scale linear SA under Markovian noise.
[X] Zhang, S. , Liu, B. , Yao, H. and Whiteson, S. (2019). Provably convergent two-timescale off-policy actor-critic with function approximation.

__ Reproducibility__: Yes

__ Additional Feedback__: It would be good if authors could comment on where the first term in (4.3) and (4.4) come from.
Authors are suggested to add a proof outline for the main results.
====Post-rebuttal update==== Overall the results are interesting and valuable. So I would suggest accepting the paper, though the authors did not fully address my concern of the technical novelty.
p3, 130: ‘function r(s,a)’ should be ‘function r’
p6, 216: typo “TD(0) limiting point w satisfies Algorithm 1 contains a …”.
P7, 248-249: the description is inconsistent with the statement in Theorem 4.5

__ Summary and Contributions__: The paper provides the first finite sample complexity analysis for twotime-scale on-policy actor-critic with non-i.i.d. samples.
The algorithm the paper analyzed is more closely related to algorithms practitioners have been using than those analyzed by previous works.

__ Strengths__: The paper is very well written and easy to follow. I checked all the proofs in the appendix and I think they are right.
The paper makes a solid theoretical contribution by providing the first sample complexity analysis for twotime-scale on-policy actor-critic with non-i.i.d. samples, which, as far as I know, is novel.
The results are significant and relevant to the RL community. I can imagine many possible extensions based on this work.

__ Weaknesses__: 1. I think it is a bit overstated (Line 10 and Line 673) to use the term \epsilon-approximate stationary point of J -- there is still function approximation error as in Theorem 4.5. I think the existence of this function approximation error should be explicitly acknowledged whenever the conclusion about sample complexity is stated. Otherwise readers may have the impression that compatible features (Konda, 2002, Sutton et al 2000) are used to deal with these errors, which are not the case.
2. As shown by Konda (2002) and Sutton et al (2000), compatible features are useful tools to address the function approximation error of the critic. I'm wondering if it's possible to introduce compatible features and TD(1) critic in the finite sample complexity analysis in this paper to eliminate the \epsilon_app term.
3. I feel the analysis in the paper depends heavily on the property of the stationary distribution (e.g., Line 757). I'm wondering if it's possible to conduct a similar analysis for the discounted setting (instead of the average reward setting).
Although a discounted problem can be solved by methods for the average reward problem (e.g., discarding each transition w.p. 1 - \gamma, see Konda 2002), solving the discounted problem directly is more common in the RL community. It would be beneficial to have a discussion w.r.t. the discounted objective.
4. Although using advantage instead of q value is more common in practice, I'm wondering if there is other technical consideration for conducting the analysis with advantage instead of q value.
5. The assumption about maximum eigenvalues in Line 215 seems artificial. I can understand this assumption, as well as the projection in Line 8 in Algorithm 1, is mainly used to ensure the boundedness of the critic. However, as in Line 219, R_w indeed depends on \lambda, which we do not know in practice. So it means we cannot implement the exact Algorithm 1 in practice. Instead of using this assumption and the projection, is it possible to use regularization (e.g., ridge) for the critic to ensure it's bounded, as done in asymptotic analysis in Zhang et al (2020)? Also Line 216 is a bit misleading. Only the first half (negative definiteness) is used to ensure solvability. But as far as I know, in policy evaluation setting, we do not need the second half (maximum eigenvalue).
6. Some typos:
Line 463 should include \epsilon_app and replace the first + with \leq
\epsilon_app (the last term of Line 587) is missing in Line 585 and 586
There shouldn't be (1 - \gamma) in Line 589
In Line 618, there should be no need to introduce the summation from k=0 to t - \tau_t, as the summation from k=\tau_t to t is still used in Line 624.
In Line 625, it should be \tau_t instead of \tau
In Line 640, I personally think it's not proper to cite [25] (the S & B book) -- that book includes too many. Referring to the definition of w^* should be more easy to follow.
In Line 658, it should be ||z_k||^2
In Line 672, \epsilon_app is missing
In Line 692, it should be E[....] = 0
In Line 708, there shouldn't be \theta_1, \theta_2, \eta_1, \eta_2
In Line 774, I think expectation is missing in the LHS
Konda, V. R. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.
Zhang, S., Liu, B., Yao, H., & Whiteson. Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation. ICML 2020.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: I read the author response and would like to keep my score.

__ Summary and Contributions__: The paper under review performs a mathematical analysis of the loss of the A2C algorithm in the case where the actor and critic are simultaneously updated. The analysis is based on a number of assumptions which include the assumption that the critic uses linear function approximation.
The main results are Theorem 4.5, Theorem 4.7 and Corollary 4.9. Theorem 4.5 decomposes the gradient of the loss function as a sum of 4 components. Theorem 4.7 characterizes the convergence of the critic. Corollary 4.9 puts the previous two theorems together and bounds the gradient of the loss by epsilon under the assumption that the A2C performed T steps, where T is Õ(epsilon^{-2.5}). In Remark 4.12 authors argue that the estimate in Corollary 4.9 is the lowest one can achieve under the assumptions considered in the paper.

__ Strengths__: A well-written mathematical analysis of A2C.

__ Weaknesses__: It would be good to include a brief description how authors ended up with the 4 summands in the formulation of Theorem 4.5 - I am under impression that now one has to dig into the proof in the appendix to get any intuition.
The paper is interesting, though I am not sure to what extent this is a good fit for a conference presentation. The proofs are extending over more than 20 pages and they are marked as “sketches”.

__ Correctness__: The appendix contains sketches of proofs. The paper does not contain any experiments.

__ Clarity__: The paper is well-written, though I am afraid that the current presentation will satisfy only needs of readers interested in precise mathematical reasoning. Those interested in applications of RL likely will have hard time understanding merits of the paper.

__ Relation to Prior Work__: The prior work is carefully discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: A minor improvement of presentation may consist in inclusion of a more abstract version of A2C (without the specifics of the approximators used by the authors) side by side with the version considered in the mathematical analysis.
In practice (e.g. in https://github.com/openai/baselines/blob/master/baselines/a2c/a2c.py#L68) one considers also a joint actor-critic loss. I wonder whether the analysis holds for this case.
The rebuttal does not change my opinion about the paper.