__ Summary and Contributions__: The paper presents three theoretical results, focusing on sample-efficient learning in two-player zero-sum model-free games with discrete states and discrete actions:
- Optimistic Nash Q-Learning, that maintains an upper and lower bound on the Q-Value of every state-actionA-actionB tuple. This algorithms is proven to have a sample complexity of O(H^5 SAB)
- Optimistic Nash V-Learning, that learns an upper bound on the V-function of every state, and iterates on the actions of the A agent to compute a loss, used to train a policy. This algorithm has a complexity of O(H^6 S(A+B)).
- A theoretical result that learning a policy against an opponent with no regret cannot be done with a polynomial complexity.

__ Strengths__: The theoretical results are strong and well-explained. They seem novel, and largely improve compared to the current state of the art. The two algorithms proposed in the paper allow to choose which ones suits a problem the best, depending on the length of an episode and the number of actions that the A and B agents have. The Appendix contains many proofs and additional details, useful to understand the main results presented in the paper.

__ Weaknesses__: A "weakness" of the paper is that it introduces three contributions at once, and therefore is very compact. While the paper is well-written and easy to read for the amount of information it contains, many parts of the paper lack intuition (like Equation 5 and the whole meaning of alpha, and the impact on how it is computed), and many references to the Appendix make reading the paper a bit difficult at time. This is really a paper that the reader must accept to read while not understanding everything, even though the Appendix answers many questions after the paper is read.

__ Correctness__: The proposed algorithms are sound, and attentive reading (even though I'm not an expert in the field of the paper) allows to understand the proofs, and see that they have a high probability of being correct.

__ Clarity__: The paper is well-written, and the gist of it can be understood by a non-expert. However, parts of the paper lack intuition, and the paper does not really try/manage to teach the reader things he/she may not know yet.

__ Relation to Prior Work__: Related work is both nicely presented in a dedicated section, and discussed in the paper while relevant. The authors take care to compare two-player MDPs with single-agent MDPs when needed, and when a single-agent result does not apply to the two-players setting.

__ Reproducibility__: Yes

__ Additional Feedback__: The "broader impact" section at the end of the paper, and the general discussion of potential future work and applications of this theoretical paper, is generally a bit lacking. One or two paragraphs, discussing how the results presented in this paper would translate to real-world applications (examples of applications with discrete states), or to continuous-action settings (how would we do it? What would still hold and what wouldn't?) would have been interesting, even if only very high-level.
Author response: the authors acknowledge that the paper is fully theoretical. Because they do not propose a discussion about possible applications of this paper (which I understand, as the paper is already quite dense), my remark here above remains.

__ Summary and Contributions__: The authors study the self play RL setting and analyze algorithms with provable guarantees (when the the control is centralized, i.e, a single algorithm controls the actions of both players). Furthermore, the authors also give some hardness results which suggest that decentralized learning is generally hard for this setting.
** I read the author's response and decided to keep my score. Thanks for the clarifications.

__ Strengths__: *) The setting is interesting, and the fact the algorithm achieves performance nearly as the lower bound (up to factors of the horizon) is very impressive.

__ Weaknesses__: *) Memory complexity. As far as I understood, the algorithms need to store K policies. This makes the algorithm impractical. I also suspect that in practice using only the last policies would improve the algorithm.
*) The sampling of the policies. The authors suggest a policy certification algorithm from which they can read the final policies on which they prove the PAC guarantees. I felt that this procedure is not explained well enough (it is discussed in half a page, however, it seems to me there is not much of a discussion on why this procedure works). For example, there is no discussion on the question whether the policy certification procedure is necessary or why there is no 'natural' way to obtain a policy beside of performing this procedure. I believe that the lack of clarity on this subject harms the quality of the paper.
*) What is the reason to include algorithm 1 in the paper? it seems it results in worse performance relatively to algorithm 2 and does not improve it in any other way.

__ Correctness__: *) I've went over the proofs and they seem correct. I suggest the authors to add some references when using standard regret analysis techniques, or explain it in a bit more clarity (e.g., stating that \sum_i^N 1/\sqrt{i}\leq \sqrt{N}, and reference to the q-learning paper in 439 when changing the summation indices.) Think it will ease the reading for people that are less familiar with these techniques.

__ Clarity__: The paper is clearly written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: *) Is there a reason to mention algorithm 1? it seems algorithm 2 gives improved performance relatively to it. If so, why presenting the two algorithms and not just algorithm 2?
*) Although equation 9 can be thought of as a set of n+m linear constraints, why the optimization problem is always feasible? meaning, for any P and Q.
*) I think the authors should elaborate more on the policy certification procedure. Although the authors devoted half a page to explain on this procedure, I feel it is not well explained. Most of the discussion is not devoted to explaining the policy certification procedure.
Small comments.
*) Line 151. Why for a fixed \mu the best response is not markovian? isn't it true that when mu is fixed minimizing over nu means solving an MDP (for which exists an optimal markovian policy)?
*) Line 213. Where are \hat mu and \hat nu defined? I only see reference to \mu_k \nu_k.
*) Line 239. Why using D_kl(\mu | \mu_0) instead of the entropy? I think the relation to FTRL is then a bit more apparent.
*) Line 449. 'na'-> an?
*) Line 455. 'hostory'-> history.
*) Why the last equality relation in page 15 holds?

__ Summary and Contributions__: This work proves that for two player zero sum games, an optimistic variant of Nash Q learning can approximate nash equilibrium with sample complexity O(SAB), and an optimistic variant of Nash V learning with sample complexity O(S(A+B)), improved from previous results O(S^2AB). Additionally, achieving sublinear regret when
playing against adversarial opponents in Markov games is proved to be hard and non-polynomial.

__ Strengths__: This is the first time that sample complexity in two player zero-sum game to reduce to O(S(A+B)), which matches theoretical lower bound expect polynomial of episode length. The algorithm itself is described clearly, which is built upon Coarse Correlated Equilibrium subroutine and Follow-the-Regularized-Leader algorithm. A novel approach to compute the certified policies given the near-optimal Nash equilibrium values are also provided, for both Nash Q learning and Nash V learning. The new bound is a significant contribution in terms of sample complexity.

__ Weaknesses__: A major drawback is that not a single empirical experiment is done, in contract to a few prior works. e.g. Nash Q-Learning for General-Sum Stochastic Games, J. Hu and M. P. Wellman. At least experiments in matrix games or grid world should be conducted, to verify the correctness of the proof. It will be great to see that sample complexity increase linearly with S, A, B with detailed ablation studies, and what percentage of runs reach eps-nash with the new algorithms.
It is also not clear that why the algorithm can bring the sample complexity down. What is the intuition that the bound can be reduced? What is the intuition to pick the a_t and b_t and other hyperparameters in the formulas?
Since the algorithm is built upon Q learning, how can this algorithm extend to other more general cases? Is there any intuitions that can help with real world algorithms in terms of sample complexity?
===
Post rebuttal:
Some of the intuitions are explained. After the discussion with other reviewers I still think this paper needs to be improved with presentation and preliminary experiments if possible.

__ Correctness__: See weaknesses.

__ Clarity__: The presentation of this paper is slightly unclear, many details of in the proofs and not explained clearly about the intuitions in the main text. Also Sec. 5 seems to be slightly disconnected to the rest of the paper.

__ Relation to Prior Work__: Yes, it compares the new algorithm with prior related works in terms of sample complexity of S, A, B. The authors build upon existing algorithms CCE and FTRL, and with the proposed new algorithm the sample complexity can be reduced.

__ Reproducibility__: Yes

__ Additional Feedback__: