__ Summary and Contributions__: This paper takes a theoretical looks at the link between episodic RL and RL in continuing environments. It is demonstrated that with a slight perturbations to the conventions of the MDP, it can be proven that episodic RL problems are ergodic. The implications for the policy gradient over the steady-state distribution are analyzed. Additionally, in the appendix the mixing time properties for environments from the literature and the gradient estimate quality is empirically analyzed.

__ Strengths__: I found this paper to be an enjoyable read and quite interesting. The conclusion that episodic RL tasks are nearly always ergodic in my opinion is very insightful, opening up the doors to new viewpoints on results in this setting from a steady-state perspective. The authors provide a good example in the appendix where they analyze the mixing time properties of some environments from the literature. The paper is clearly written and does a good job taking the reader step by step through each result. I thought the point made at the end coming out of this work about how actually continuing environments are less likely to be ergodic than episodic environments was eye opening.

__ Weaknesses__: I personally find this topic very interesting, but admittedly the topic of ergodicity is quite theoretical and the paper will have a limited audience as a result. The insights do not directly lead to an approach that can be empirically validated. As such, the paper could potentially be improved by spending more time discussing the potential future impact and implications of the derived results so that it is clear to a wider audience.

__ Correctness__: The empirical analysis is only contained in the appendix. However, I found the results there quite relevant and interesting. It seems that results have been averaged over many runs to promote reproducibility. I have not thoroughly gone through the proofs of all 5 theorems, but the proof ideas did make sense to me.

__ Clarity__: I thought that this paper was very well written and clear as a whole. It takes on a pretty dense subject, but really goes through the analysis step by step in a way that I found quite straightforward to follow.

__ Relation to Prior Work__: Yes. It seems that while prior work has considered that episodic RL is a special case of continuing RL, it is a novel result to show that episodic problems are nearly always ergodic thus yielding a unique steady-state distribution.

__ Reproducibility__: Yes

__ Additional Feedback__: After reading the other reviews and author feedback, I still feel that my initial score is appropriate and that this paper should make a nice contribution to the conference.

__ Summary and Contributions__: The main contribution of this paper is that it proves a unique stationary distribution exists in every episodic learning process. Based on that, a formal relation between the stationary reward objective and the episode return objective is presented, and a steady-state policy gradient theorem for episodic learning processes is provided.

__ Strengths__: This paper makes an interesting theoretical contribution to the understanding of reinforcement learning task specification.
The paper is well written and should be easy to follow for a broad audience.

__ Weaknesses__: In general, I think this is a good theoretical paper (but my confidence is low as I am not very familiar with this field).
I only have some minor comments about its significance. While the contributions are primarily from a theoretical side, are the results also of practical interest? This paper claims that with the existence of steady-state distribution, one can first examine the properties of episodic learning tasks under the steady state, and then derive dedicated corollaries. However, I don't see clearly how this may help except the proposed Theorem 5. The significance could be further improved, if the paper addressed how the theorems shed light on the design of new learning algorithms, point out what is missing in existing algorithms, etc.
----------------------Updates after the rebuttal----------------------------------------------
The authors' rebuttal show that their theoretical results may also have some practical impacts, which addresses my concern to some degree. I suggest that the author make some room for such kind of discussion in the revised version. This will make this work interesting and relevant to a broader readership.

__ Correctness__: I didn't check the proofs, but the theoretical claims look ok.

__ Clarity__: Yes. The paper is well written and easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors consider a transformation from an episodic problem (finite horizon) to a continuing problem (infinite horizon) with extending the ideas of unifying task specification [25] and the product-form value functions [20] and define the transformed continuing problem as the episodic learning process. Since the episodic learning process is proper infinite horizon MDP, they almost directly derive the results known in infinite horizon MDP.
Although the presented results might be known implicitly in the RL experts, these have not been discussed in detail so far. I think this type of contribution about unification is important because
I think this type of contribution about task unification or simplification is important because there are a lot of RL task variations. Due to this work, we could use RL algorithms properly. This work helps us to use/select the RL algorithms more comfortably and properly.

__ Strengths__: I found the following strengths in the paper.
* Gives a useful transformation from an episodic process (finite horizon) to a continuing process (infinite horizon).
* Givens useful properties of the transformed process, which are corresponding to them known in infinite horizon RL.
* Gives a perturbed MDP model for ensuring aperiodicity.

__ Weaknesses__: The following are the weaknesses that I have noticed.
* The presented results seem to be straightforward and would be known implicitly in the RL experts. (But these have not been discussed in detail so far)
* It is still unclear how the results can contribute to the development of the algorithm.

__ Correctness__: I did not find out serious incorrect part.

__ Clarity__: The paper is well written so that I could follow most of it easily.

__ Relation to Prior Work__: To the best of my knowledge, the authors make a proper reference to the prior work and clearly discuss how their work extends the prior work.

__ Reproducibility__: Yes

__ Additional Feedback__: I have the following minor comments and questions.
* Introduction seems to be biased to the episodic RL, which might be misleading especially for non RL-experts. I believe that both episodic learning and continual learning are equally important. The appropriate learning type depends on the task.
* Line 111: E_{\zeta\sim\rho_0,…} should be E_{s_0\sim\rho_0,…}?
* It is helpful to explicitly define the episodic gamma-discounting in Line 294.
* The aperiodicity is really needed for the policy gradient theorem of episodic learning process, etc.? As the authors described in Line 275-276, the perturbation for aperiodicity could be omitted in reality.

__ Summary and Contributions__: The paper investigates the learning dynamics of episodic reinforcement learning (RL). The authors define a class of Markov decision processes (MDPs) called episodic learning processes which models the usual episodic formulation of RL. They then proceed to show that this class of MDPs is ergodic, which has several positive implications for learning. First, since now every policy has a stationary distribution, this provides a more clear, unified, definition of the term “on-policy distribution” (and also connects it with the definition used in continual RL). Second, the authors show that the usual practice of estimating the policy gradient using a single sample transition is sound under mild assumptions. Third, their analysis also sheds light on a well known discrepancy between theory and practice in policy gradient algorithms: namely, the fact that these methods use the discounted value function but the gradient is computed based on the undiscounted distribution.

__ Strengths__: Modelling and analyzing the learning dynamics of episodic RL is a fundamental question that is of much interest to the NeurIPS community.

__ Weaknesses__: As discussed below, the significance and implications of the theoretical findings is obfuscated by the excessive use of technical jargon and a slightly convoluted presentation.

__ Correctness__: The theoretical results seem to be correct; although I have not checked the proofs in the supplementary material, the description of the “proof ideas” in the main paper look sensible (and, more generally, the claims make intuitive sense).
The only empirical evaluation is in the supplementary material, and I did not check it carefully.

__ Clarity__: In my opinion the main weakness of the paper is its presentation. First, there is a lack of clear, direct, explanations of what the paper is trying to accomplish. Several crucial points are either only implied or mentioned in passing without the proper emphasis.
This is true for the positioning of the paper itself. The analysis seems to be mostly concerned with policy gradient methods, but this is never explicitly stated. Value-based methods are neither excluded nor directly discussed.
There is also a lack of clarity regarding the implications of the theoretical results. Take Theorem 4, for example. In Section 4 it is shown that an episodic learning process is always ergodic. This means that they automatically inherit the property in Propositions 3, namely, that the average return per episode approaches the expected policy return in the limit. But then, in Theorem 4, it is shown that in episodic learning processes the mean return divided by the mean episode length equals the expected policy return. This seems like an interesting result, but the text never explicitly discusses its implications or why it is important. A similar observation applies to Theorem 5, though to a lesser extent. In both cases the text focuses on pinpointing differences with previous results or algorithms, leaving to the reader the burden of figuring out the consequences of such differences.
I believe the observation above is the reflex of the style of the narrative more broadly. I have the impression that most of the information is there, but sometimes conclusions are hidden behind technical observations whose implications are assumed to be obvious to the reader. In my opinion this paper would greatly benefit from some rewriting to make sure conclusions are clearly spelled out.

__ Relation to Prior Work__: Yes. The most relevant reference seems to be White’s “Unifying task specification in reinforcement learning”, which is discussed in the paragraph starting on line 352.

__ Reproducibility__: Yes

__ Additional Feedback__: POST-REBUTTAL COMMENTS:
------------------
Thank you for the clarifying rebuttal! After reading it and the other reviews, as well as revisiting the paper and looking at the supplement more carefully, I decided to increase my score to 6.
The idea of "connecting" the terminal states of an episodic MDP with the initial distribution is not new (see, for example, White, 2017). But, as the authors point out, the paper does take this idea one step further, analyzing the properties of the induced MDP. In doing so they arrive at some interesting insights, which provide theoretical support for some algorithms used in practice (Section E of the supplementary material) and also suggest some new tricks (Section F). I suspect the former will have more of an impact as a contribution than the latter.
As I said in my review, my main concern regarding this paper is its presentation. As is, it comes across as a compressed version of a longer manuscript, and as a reader I could see that some material had to be cut. Looking at the supplement and the rebuttal, I think the authors have enough material to put together a good paper, but that will, in my opinion, require restructuring things a bit. For example, the authors spend quite some time explaining the basics of Markov chains and MDPs in Section 2, including different formulations of the latter, at the expense of explaining some of their own insights in more detail --which are deferred to the appendix. I urge the authors to take this comment seriously and to put some effort in making the next version of the paper more clear, otherwise an interesting contribution might end up obfuscated by the presentation.