Review for NeurIPS paper: Robust Reinforcement Learning via Adversarial training with Langevin Dynamics

NeurIPS 2020

Robust Reinforcement Learning via Adversarial training with Langevin Dynamics

Meta Review

The paper focuses on robust RL in an adversarial setting in which two policies are simultaneously optimized and one is used to perturb the other. In this setting the paper proposes a parameter sampling approach and provides theoretical and empirical evidence why parameter sampling is more effective than deterministic optimization. The reviewers agree that the paper is well motivated and addresses a topic that is relevant to the community and some reviewers appreciate the combination of simple conceptual examples and larger scale empirical evaluation. Although the majority of the reviewers look favorably at the paper they highlight a number of areas for improvement, especially in the presentation (e.g. a more explicit discussion of contributions and prior work, R1; a more self-contained presentation and discussion of the significance of concepts such as Nash equilibrium; R2); the experimental evaluation (presentation and analysis of results, R1/R2/R3; use of TD3 instead of DDPG results in the main text, all reviewers); limited novelty compared to [12] (R1); limitations of the scope of the theoretical results and of the problem formulation (e.g. use of deterministic rewards, R3). The authors have provided responses to the main criticisms and the paper and the response was extensively discussed by the reviewers. All reviewers agree that the stronger TD3 results should be included in the main text. Several reviewers comment on the fact that the improvement achieved by the proposed method is considerably smaller when TD3 is used as the base algorithm compared to the case when DDPG is used. A point of contention remains whether the conceptual example in section 4 is adequately tied to the larger scale results in section 5, and whether the improvement from sampling that is demonstrated in section 4 for the toy example is indeed the same effect that is responsible for the improvement in section 5. After discussion, R1 still recommends reject; R2 and R3 continue to weakly endorse the paper; and R4 recommends acceptance. Based on the reviews and discussion the meta-reviewer believes that the paper is likely to be of interest to the community but that both the presentation and the the empirical evaluation could be improved. More specifically it seems that the paper could benefit from a revision that (a) promotes the TD3 results to the main text; (b) provides a more detailed description, discussion, and analysis of the empirical results; and (c) has a more in depth discussion of some of the necessary background and provides a better connection between the simple conceptual example and the full RL setup. The meta reviewer would also like to suggest that the authors provide in the final version additional baselines that include, for instance simple stochastic policies (possibly entropy regularized) trained with a state of the art algorithm (SAC, MPO) as well as a discussion and comparison to approaches that perform exploration in parameter space (e.g. https://arxiv.org/abs/1706.01905). This would (a) help to calibrate the reader’s understanding of the difficulty of the robust RL problem studied empirically in section 5, and of the quality of the base algorithms, and it would also help to further disentangle effects that arise specifically from the adversarial setting from those that arise due to effects of parameter sampling in the non-adversarial setting (also touched upon discussion in Appendix B). It would also provide a better connection to the broader RL literature. The meta reviewer would further like to reinforce a point raised by the reviewers: the presentation of results is not ideal, for instance, it would be really helpful to use the same y-scale for all results obtained for the same domain (where rewards are directly comparable).