__ Summary and Contributions__: The authors describe a framework to study adversarial observation perturbations against Deep RL agents. To this, they define the state-adversarial Markov decision process (SA-MDP) as a formal and theoretical framework.
In essence, the authors propose a policy regularization that enables an agent to approximate an original policy under attack from an optimal adversary, and give approximations of this method for PPO, DDPG, and DQN.
They propose two novel attack methods against RL agents. The authors show experimental results of their policy regularization method for different algorithms and attacks.

__ Strengths__: - The paper is well written and the reasoning seems sound
- The general experimental design is sound, and covers all immediate cases.
- The direction of research is relevant and the contributions are significant.

__ Weaknesses__: The paper is too long (important aspects have been moved to the appendix, the conclusion rephrased as the broader impact section put of 9th page; broader impact section itself is missing). The paper is not self-contained.
Experimental design: the presentation of the experimental results is not clear in general. More points:
- the authors don’t give a std deviation across different training runs. This would be interesting to judge whether the small improvements for some models wrt. performance or robustness are significant.
- The adversary bounds chosen are not explained or varied.
- The baseline adversarial methods Critic and Random sometimes improve agent performance?
- The adversaries implemented are quite weak.
(Bit of speculation on my part): As far as I understand it, the regularization essentially constrains the policy to be smooth over the observation. Are the proposed algorithms (and rather heavy optimization methods) the right way to achieve this objective?
The authors propose to use expensive second-order optimization for their method. I suppose the proposed method produces a huge computational workload. The authors miss to give an analysis on this in the experimental section.

__ Correctness__: In general, the claims stated by the authors are reasonable and seem correct to me. They are supported by the experimental results.

__ Clarity__: *** Derivations in Section 3 ***
While the theorems across Section 3.1 seem reasonable I would have liked some a more self-contained presentation of theorems together with proofs.
P4, 150-171 quickly jump over theorems without giving at least rough sketches of the proofs.
Assumption 2 (Bounded adversary power) is a bit strange, and while the experimental implementation (with the norm ball around s) seems reasonable for many environments, this should probably be defined in a better way.
The authors refer to the Appendix a lot and in my opinion such derivations are necessary for the reader to follow along.
- However, I was curious about Theorem 3 – a proof is missing.
- Moreover, Theorem 5 was left too unexplained. I cannot really follow how the authors get there.
- How do you arrive at Theorem 6?
Theoretical Results:
*** Experiments in Section 4 ***
I see a few issues with the (presentation of) the experimental results.:
- apart from Figure 2, the experiment results are presented in dense tables without a clear way to judge the significance of the results. Add Plots (similar to Appendix I, Figure 12).
- Table 3, gray markings: These are applied inconsistently. For Pong, SA-DQN PGD and convex, both should be marked (both 21.0). Also, I would like to take note that Vanilla DQN seems to perform comparably, within limits of the evaluation, with the author’s models.
- Although it presents essentially the same type of results data, Table 1 employs a different structure than Tables 2 and 3.
- Table 2 does not report which attack was the strongest. Given that against PPO, the novel RS attack massively outperforms the baseline attack methods, this would be interesting to see.
- Across the board, adversarial training (adv. 50 / 100%) performs very bad. To save space, these results could be omitted.
- Numbers (decimal points) are not aligned, which make It even harder to quickly parse the results.
Moreover, I think the authors should add a few more experiments:
- A runtime assessment of their method; 2nd order optimization sound computationally expensive. It would be great to see this in comparison to the performance of the baseline methods.
- It would be interesting to the see an ablation study that analyzes the perturbation budget over the robustness of the policy.
- I think the authors should also add a paragraph that lines out limitations of their method.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: The abstract states that how to improve the robustness of DRL under adversarial settings has not been well studied. I think this claim is too strong as this has been subject to research in a lot of work.
(Starting from line 138:) Why do you put $\tilde{V}^{\pi}_{\nu}$ ? I would rather use $V^{\pi \circ \nu}$ . This would seem more natural to me. Same for $Q$
Line 198: “neuron networks” “neural networks”
Line 204: \texttt{auto_LiRPA} is not well-known, at least not to me.
Line 255+ and 267+ (the equations): please do not reduce font sizes. Add label.
Lines 281-286 contains important information that cannot be moved to the appendix.
####################### Update #######################
I would like to thank the authors for their feedback. Indeed, your feedback together with the other reviews clarified a lot (and I got a few things a bit different, especially details in the experimental section). Most of my confusion came from the denseness of the paper and the missing running example that R2 proposes to add in. The authors also line out a few suggestions to improve. With such improvements I think the paper is much more readable and becomes a good fit to the conference.

__ Summary and Contributions__: The authors formulate a state-adversarial MDP, where an adversary can change the agent's perception to be a nearby state. They show basic results of policy optimization (including the existence of SA-MDPs where there is no optimal Markovian policy. They show a bound on policy performance and use it to define a policy regularizer to defend against attacks. In effect, their regularizer enforces that the policy action not change too much over a set of neighboring states. They run experiments on their approach and show that their approach improves performance in adversarial settings and some non-adversarial settings.

__ Strengths__: The ideas behind the work are clean and potentially highly influential. Adversarial ML is a huge area of interest, and while adversarial RL has gotten some attention there is not a clear uniting formalism. I could see SA-MDPs filling that gap. The methods proposed look sound, and the policy regularizer seems like it may be useful in non-adversarial settings by enforcing a type of smoothness prior on the policy.

__ Weaknesses__: The idea behind the work is good, and the results presented look sound. The primary weakness of the paper is that it is quite dense and detail oriented. See the section on clarity for details.

__ Correctness__: The claims and methods look correct and the empirical methodology is appropriate.

__ Clarity__: The paper is clear, but somewhat hard to follow. There's room for improvement to make it more engaging and easier to follow. In particular, especially when introducing a new framework, it is helpful to include a running example and provide intuitive groundings of the theorems. For example, consider describing the impact of the policy regularizer on a toy problem and/or walking through an example (Fig. 1 tries to do this, but isn't connect to the SA-MDP formalism direction).
The paper often leaves important conclusions unstated for the reader to infer. The clearest example of this is on the bottom of page 4. The paper presents several theorems that are described in the text as negative results, but the only statement of the results is the theorem itself. Thms 3-5 should be described informally in English concisely in addition to the formal theorem statement. I also think it is important to describe something of the proof here --- it is ok to move detailed proofs to the appendix, but the paper should stand on its own. In this case, I think that means including an example or proof sketch in the main text. This also shows up in that the paper does not have a discussion or conclusion section to interpret the experimental results.
I realize that a lot of these choices are driven by space constraints. I recommend that the authors consider the most important parts of the paper in order to tell a full story and then move additional sections to the appendix. For example, instead of covering everything briefly (e.g., all of the robust optimization approaches and adversarial attacks) consider covering one in detail, the others briefly, and then referring readers to the appendix.

__ Relation to Prior Work__: Prior work is discussed very well, but the paper does not do a good job of describing the relationship between this work and prior work. The related work section is a good literature review, but does not situate these contributions with respect to that work.

__ Reproducibility__: Yes

__ Additional Feedback__: Typos:
I am curious if there are any environments (or types of optimal policy) where this method would fail. For example, if the policy really does need to change sharply in a particular neighborhood, then it seems like this would substantially harm natural performance. Can the authors discuss this somewhat?
Can the authors comment more on the choice to sample states from B(s), rather than doing a maximization. I understand that this may be useful, but the properties of the max may be quite different than the sum.
l.137 "to those of regular MDP"
l.152 "the known results in MDP" --- while the literal acronym works, this reads strangely to me. Consider 'known results in MDP theory'
l.163 "not all hopes are lost" --- phrase is 'not all hope is lost' but consider replacing with something simpler
Update after the discussion phase: I appreciate the author's response and the description of limitations they included. I am not changing my (quite high) evaluation.

__ Summary and Contributions__: Existing techniques for improving robustness of policies are ineffective for many RL tasks, so they propose a new formulation of MDPs called state-adversarial Markov decision process (SA-MDP). They show that while there does not exist an optimal policy under an optimal adversary in this SA-MDP formulation (as does in classical MDPs), the loss in performance can be bounded under certain assumptions. Under this formulation, they develop a policy regularization technique that is robust to noise and adversarial attacks on state observations and improves the robustness of DQN ,PPO, and DDPG in both discrete and continuous spaces. The authors further evaluate their algorithm under 2 new attacks, the robust SARSA attack and the maximal action difference (MAD) attack. They show that their method outperforms several baselines on 11 environments in both the discrete and continuous action space to prove their results.

__ Strengths__: - With more robust RL, we can more safely apply RL in real world settings, such as autonomous driving and situations where small amounts of noise can cause huge consequences. One weakness of deep RL is that it often fails to generalize at test time, and with a more robust method, perhaps RL will be able to be used in other applications where only other methods are currently used.
- The authors give a clear and easy to follow presentation of theorems which lead to the motivation behind the policy regularization technique.
- Experiments are consistent with their claims and do show that their SA-DQN outperforms several baselines against the strongest adversaries
- RL robustness is very relevant in the realistic settings mentioned, including sensor noise and measurement errors
- Train humanoid more robustly, which is great!

__ Weaknesses__: - In the related work section: Adversarial Attacks on State Observations in DRL, would be nice to know how these attacks relate to the ones they looked at (did they use these or what’s different about the attacks they used)
- It might be interesting to evaluate why the agent performs worse under the robust SARSA than under the MAD attack, since the MAD attack is designed to maximize exactly what the proposed regularizer tries to minimize
- small typo in line 810 of appendix
- Figure 1 is shown, but no experiments on this environment. Also, safeness is mentioned throughout the introduction, but none of the experiments are specified that safeness is critical (i.e. non-resetting the humanoid / ant when it falls over, other cliff falling states / disastrous outcome states)
Other minor comments:
- Unclear difference between partial observability and state adversarial.
- Would be interesting to see how practical implementations of TD3 or SAC, which use target Q functions / other hacks for stability, are affected by the theory. It could be possible that the noisiness of state estimation is already accounted for in these hacks, so having more robust policies shows no significant improvement.
- Model based methods? If this does not work on something like MPC, then it’s possible that it is merely attacking the networks parameterizing the policies, rather than the some inherit robustness to attacks on state perturbation.
- Table 1 would be nicer if it was flush bottom, rather than having a small paragraph of text between the table and the bottom of the page.

__ Correctness__: Nothing pops out as incorrect (but I also didn’t read through the proofs in the appendix very thoroughly)

__ Clarity__: Paper is structured very clearly and is easy to follow. The theorems provide motivation for why minimizing the variation distance between the policies under two MDP formulations will make the policy more robust, and follows with examples in 3 widely used RL algorithms.

__ Relation to Prior Work__: Overall, looks good.

__ Reproducibility__: Yes

__ Additional Feedback__: I have read the author response and the reviews from other authors. I'm satisfied with the response and stick to my original score of Accept.

__ Summary and Contributions__: This paper tackles adversarial perturbations on state observations in RL. It describes an approach to train robust policies with a modified loss for standard RL algorithms such as ddpg e.g.
The main idea consists in smoothing the randomized policy, in such a way that for each state, in the neighborhood of that state, the distribution of actions does not change too much.

__ Strengths__: - The problem adressed in the paper (attacks on state observations) is highly relevant for NeurIps and very interesting.
- The method proposed in the paper (a regularizer entailing smoothness) is sound
- The method is implemented in several algorithms
- The theory part is sound. In particular, theorem 5 relates the true objective (robustness) to the regularizer nicely.

__ Weaknesses__: - The idea of the paper is that smooth policies are also robust policies. In the supervised setting, this fact is well known, well understood, and many smoothing techniques are available, including the one presented in this paper [1]. The paper would have been more interesting if many smoothing techniques were compared in the RL setting.

__ Correctness__: Theorems are correct as far as I can tell, although their proofs are probably lengthier than needed.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The authors should refer to the litterature on smoothing techniques for defending against adversarial attacks in the supervised learning setting. In particular, virtual adversarial training (VAT), described in [1] is exactly the same regularizer as described in this paper, and should absolutely be cited. The same idea, although a little different, is presented in [2].
[1] T Miyato, S. Maeda, M. Koyama, K. Nakae, S. Ishii, Distributional Smoothing with Virtual Adversarial Training
[2] H. Zhang, Y.Yu, J.Jiao, E.Xing, L.E.Ghaoui, M.Jordan, Theoretically Principled Trade-off between Robustness and Accuracy

__ Reproducibility__: Yes

__ Additional Feedback__: - the fact that optimal policies are randomized is not particularly surprising here, because we are in a zero-sum two player game, where randomizing improves the value of the game in general.
- because \tilde{V}_{\nu}^{\pi}=V^{\nu\circ\pi}, theorem 1 is absolutely trivial.
- Theorem 5, in the definition of \alpha, is the \gamma/(1-\gamma)^2 term really tight ? Can't we replace it by \gamma/(1-\gamma) ?