__ Summary and Contributions__: The paper proposes a sampling method to tackle the problem of training Reinforcement Learning (RL) policies that are robust with respect to varying process dynamics. Previous work has focused on The method is a direct extension of the one proposed in [12] to the RL setting, in which they propose an algorithm that uses SGLD (Stochastic Gradient Lagenvin Dynamics) to find a mixed Nash Equilibrium (NE) in the space of the parameters of the neural networks.
The authors show that, even for simple 2-dimensional non-convex-concave problems, sampling algorithms can achieve better tracking of the NEs (when compared to optimization-based methods). They also describe a variant of DDPG that uses SGLD to train robust policy, and provide numerical experiments.

__ Strengths__: The paper tackles an important problem, that is the training of robust RL policies with respect to varying dynamics of the underlying process. Previous work [9,10] focused on two different approaches that have a cumbersome training process. Adapting the algorithm in [12] to the RL setting is the main novelty of the contribution (accompanied by numerical experiments). There is also a brief discussion on the convergence of stationary points or NEs of a simple 2-dimensional example.

__ Weaknesses__: I believe there are some weaknesses in the work at its current status:
>> Main Comments <<
- A related work section is missing, making it difficult to understand what is the contribution of the work, and what previous research focused on.
- Theoretically-speaking, the contribution is very limited. The work builds on the foundations of [12], therefore it is just a simple extension to the RL case.
- There is a discussion on simple non-convex-concave problems that is, in my opinion, marginal to the problem of training robust RL policies (which is the main objective of the paper).
- Experiments-wise: (i) in figure 4 what is the shadowed area? (ii) How many simulations have been done per seed? If it is only 1 simulation per seed, I believe this number to be relatively small; (iii) In some experiments the MixedNE-LD algorithm seems to perform worse, what is the reason behind it? (iv) Why do not you include a comparison with RARL? (v) How about doing simulations where changes in the transition dynamics cannot be simulated via changes in the action?
>> Minor comments <<
- (Minor comment) The extension of the algorithm in [12] is given only for DDPG, but other algorithms are not taken into consideration. But this can be done quite easily.
- (Minor comment) A more detailed discussion of algorithm 1 would be a plus (in order to understand what are the various parameters and the reasoning behind this algorithm).
- The appendix is a bit hard to follow at the current status.

__ Correctness__: The claims seem correct, and the theoretical work in [12] seems sound.

__ Clarity__: The paper is clearly written, and easy to follow. I have not found any evident grammar mistake.

__ Relation to Prior Work__: One that is not confident with the topic of the paper may find difficult to understand the difference with previous work, and what previous work focused on. A related work section is missing, mostly mixed up with the introduction section. A comparison with the work in [9] (RARL) is not given in the experiments.
Furthermore, it would be nice if you can compare your approach with the one in [*].
[*] Smirnova, Elena, Elvis Dohmatob, and Jérémie Mary. "Distributionally robust reinforcement learning", 2019.

__ Reproducibility__: Yes

__ Additional Feedback__: I suggest the authors to address the comments in the Weaknesses section.
--- Post rebuttal ---
I have read the author response, although the answers did not exhaustively address my concerns. In general, I still think the theoretical contribution is limited, and that the discussion on simple non-convex-concave problems is in my opinion marginal to the problem of training robust RL policies. The experimental part can be improved, especially in presentation. Overall, I think the work at the current state can still be improved, especially in presentation.

__ Summary and Contributions__: This paper proposes a method for robust RL based on finding mixed Nash equilibria, rather than pure Nash equilibria, in a two player zero sum game between an agent and an adversary. Stochastic gradient Langevin dynamics is the proposed tool for this approach, and the authors demonstrate that their method has better theoretical and empirical properties in a toy domain. This is then empirically extended to several MuJoCo tasks with systematically varying mass and friction parameters, where the proposed method is demonstrated to be generally more robust than a recent prior robust RL method.

__ Strengths__: Generally, the content of the paper is strong, and though I am not an expert in this area, I believe that the contributions are both significant and novel. The theoretical claims in the toy domain are well supported by empirical evidence, though I did not thoroughly check the proofs themselves. The additional experiments also demonstrate the practicality of the method in more realistic domains, and there is also an extensive set of experiments that are further presented in the appendix.

__ Weaknesses__: The main weakness of the work is in its presentation. For a reader that is not intimately familiar with the background material, Section 2 is not self contained, and the significance of the concept of mixed NE vs pure NE is not explained. But perhaps the main area of the paper that would benefit greatly from additional discussion is the experiments section, which currently features a very large Figure 4 (consider cutting down to half the current size) and little discussion of the results themselves. When does the proposed method work better vs worse than the baselines, and is there intuition for why? How about other considerations, such as computation time? Some videos of the learned policies would also nicely supplement the results.
It is also unclear if the best or most significant results are presented in the main paper. Unless I am misinterpreting, it seems like TD3 (Figures 8 and 9 in the appendix) vastly outperforms DDPG, as evidenced by the rewards on the y axis. This would not be surprising. I understand that the baseline in prior work used DDPG, but barring any stronger reasoning, it seems reasonable to "upgrade" all baselines and methods to use TD3 instead, and then it would be sensible to instead include this as the main result.
It also seems, however, that the margin of improvement for the proposed method over the baselines is somewhat smaller. But this is hard to eyeball, and including a simple metric such as average improvement over baselines across domains, both in the main paper and the appendix, would be useful to better gauge this. But most importantly, I would like to hear the authors' thoughts on why they chose certain experiments for the main paper vs the appendix and whether they would agree that the TD3 results are more significant.

__ Correctness__: The claims and method seem correct, and the empirical methodology is generally sound.

__ Clarity__: Except for the concerns listed above, the paper is otherwise well written.

__ Relation to Prior Work__: I am not an expert on the prior work in this domain, but there seems to be some discussion as to the relationship with recent prior work in robust RL. Though, there is no explicit related work section.

__ Reproducibility__: No

__ Additional Feedback__: Regarding reproducibility, the method appears to be relatively complicated and would require a great deal of expertise and familiarity with prior work to reimplement. But to the authors' credit, they have included code, which I have not tried to run.
*Edit after reading the author response and the other reviews*
---
My score remains unchanged. I appreciate the authors addressing my concerns about adding the table of average improvement and discussing the experimental results in greater detail. I also appreciate the response with respect to the TD3 results, and I understand the authors' reasoning about maintaining DDPG in the main paper though I would still prefer the TD3 results in the main paper. Thus I will maintain my weak accept.

__ Summary and Contributions__: In this paper, the authors propose an adversarial training method with Langevin dynamics to tackle the problems in robust reinforcement learning.

__ Strengths__: The authors demonstrate theoretically with a stylized example that the (continuous-time) training dynamics of GAD and EG will either get trapped by non-equilibrium stationary points or converge to NE. In contrast, the proposed MixedNE-LD algorithm is always able to escape non-equilibrium stationary points in expectation. They also show the promising performance of their algorithm when compared to other state-of-the-art robust RL algorithms.
The sampling perspective of tackling robust RL could potentially contribute to the theoretical RL literature if the theorems could be extended to a more general set-up.

__ Weaknesses__: I have some concerns in terms of the set-up (assumptions) and theoretical contributions of this work:
1) The introduction of Stochastic Gradient Langevin Dynamics is somehow misleading. The randomness of stochastic gradient descent is due to the sampling of a subset of data to update the gradient term. It is not due to adding a random noise into the update as illustrated around Eqn. (1). In addition, it is well-known that the noise in the stochastic gradient update is Gaussian for convex optimization problems whereas the noise has heavier tail for nonconvex-nonconcave problems. The authors should clarify.
2) For some set-ups in RL, the distribution of the reward is unknown to the agent and the agent can only observe a realized sample of the reward at the end of each round after the agent takes an action. Hence, the agent is not able to calculate the gradient of J (notation for objective function in Section 3), f (notation for objective function in Section), or h (notation of objective function in Algorithm 1). There are zero-order optimization methods to approximate the gradient of the objective functions using the samples of the rewards, which are more suitable for this scenario. However in this paper, it seems that the distribution of reward is given to the agent. Hence the gradient of the utility function can be calculated.
3) Theoretical contributions: The conclusions in Theorem 1 and Theorem 2 are interesting but the objective functions considered are too stylized which do not fit into the framework of general RL problems.
4) Experiment: If the benchmark algorithms assume no knowledge about the distribution of the reward function whereas the proposed algorithm utilizes the distribution of the reward, the comparison seems to be unfair. Please clarify.

__ Correctness__: Yes.

__ Clarity__: Yes the paper is well-written and easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: [Further feedback]
=======================
I appreciate the authors' clarifications on the stochastic gradient Langevin dynamics, deterministic reward, and contribution of the main theorems. However, I still think it will make the paper stronger if the theoretical results could be derived under a more general set-up and stochastic rewards could be incorporated in the experiments. Therefore I will keep my orignal assessment.

__ Summary and Contributions__: This paper proposes an improved robust RL algorithm that finds mixed Nash equilibria via Langevin dynamics. The paper first constructs a toy example where analytically non-sampling based methods fail to find the Nash equilibrium if initialized poorly and the proposed algorithm consistently finds the Nash equilibria. Then, it shows that even adaptive non-sampling method cannot find the Nash equilibria. Finally, the paper presents empirical results where the proposed method outperforms prior baseline for robust RL consistently.

__ Strengths__: This is a solid paper and the modification made is very straightforward. The theoretical analysis of the paper on the toy problem is sound and nicely gets the point across nicely. The technique cleverly draws on the large body of work on stabilizing GANs and uses them to improve training two-player RL games which indeed bears a lot similarity of GAN. The paper also proposes a practical algorithm that efficiently performs Langevin dynamics with a 2-player DDPG-variant and demonstrates consistent improvement over existing approaches. I believe the contribution of the paper is important for robust RL and could be presented at NeurIPS since robust decision making is essential for high-stake automated decision making.

__ Weaknesses__: I am overall happy with the quality of the work, but I have a few minor concerns.
- In section 4.4, I don’t think it’s correct to state that exploring the distribution in parameter space is the kind of exploration that reinforcement learning is concerned with. At least, it’s not the most important one since we are more interested in exploration in the state space rather than the parameter space.
- I think the empirical evaluation could be improved. Specifically, if I understand correctly, only mass is changed in all the experiments. It would be much more convincing if the authors can present experimental results on other adversaries such as changing the friction or length of the agent’s appendages. While I think this work is solid, I think it’s also incremental (in a good way) over previous methods. As such, it would be nice to a larger range of experiments.
- Seems like comparison to previous robust RL methods are missing even though they are cited (e.g. RARL). If either GAD or EG already include these prior works, please be explicit.

__ Correctness__: The claims and method are correct. Experiments could be improved.

__ Clarity__: The paper is well-written and easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: ==================Update==================
After extensive discussion with other reviewers, I think I will keep my original evaluation. I would recommend including the TD3 results in the main text as well.