Review for NeurIPS paper: Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses

NeurIPS 2020

Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses

Review 1

Summary and Contributions: The paper proposes to add a relaxation term to the losses (cross-entropy, margin loss) used for standard adversarial attacks to improve their effectiveness. Moreover, such attacks can be used at training time to produce, via adversarial training, robust models, especially with single-step methods.

Strengths: - The suggested additional term is simple and shown effective in practice. - The new loss allows to train with single-step first order attacks classifiers which are more robust than what achieved in previous works. - The evaluation of the proposed defenses includes many attacks and seems properly conducted. - The proposed attack, GAMA, is effective in evaluating the robustness of many defenses, outperforming existing individual attacks.

Weaknesses: - The paper proposes many different variations of the methods (different losses, training schemes, optimization schemes) which are used for different tasks, each with different parameters. Thus, it is sometimes difficult to follow which exact setup is used for every task. This also makes the method less general. - Some of the variations introduced seem a bit hacky, e.g. changing the loss for adversarial training in alternate iterations.

Correctness: The claims and method seem correct, and the experiments properly done.

Clarity: The paper is well written, although in some parts, especially in the appendix, some details could be slightly more clearly presented.

Relation to Prior Work: The paper introduces properly prior works.

Reproducibility: Yes

Additional Feedback: - In Eq. (1), shouldn't the margin loss have the opposite sign? In the current formulation, maximizing L as in Eq. (1) would increase the logit of the true class compared to the others (or am I missing something?). - In my point of view, the main contribution of the paper consists in training robust models with 1-step methods achieving better robustness than Wong et al. [31] (as acknowledged by the authors for multi-step adversarial training the improvement is minimal, and in my opinion not significant), and this could be further emphasized. In [31] the models also suffer from catastrophic overfitting, i.e. the robust training fails depending on different random seeds. How stable is 1-step GAT wrt different seeds? Does it always yields similar results or is there high variance? - On the line of my previous point, I think it'd be helpful and interesting to analyse in more details why using the proposed regularization helps 1-step adversarial training (my guess is that the loss landscape is made smoother so that FGSM is almost as effective as PGD). - According to what mentioned in the Appendix, for GAMA-PGD an initial step size of 2\epsilon is used and later reduced. This seems similar to what proposed in APGD. Have you maybe tried to use the proposed loss within the APGD framework? Overall, I think the paper proposes an effective method. In my opinion, the presentation could be improved, since, as said above, the many variations give the impression that many methods are proposed for different tasks. ### Update post rebuttal ### I thank the authors for the detailed response. My opinion about the paper remains unchanged. I invite the authors to improve the clarity of the presentation and the discussion of prior works (in particular, I think a comment about the loss function of [29] could be useful, as it shares some similarities with the proposed one) for the benefit of the reader.

Review 2

Summary and Contributions: The authors propose adversarial attacks that incorporate an additional relaxation term to the standard loss as used in existing attacks. The attacks can be used to generate adversarial examples to train robust models. Experimental results show that the proposed method outperforms other single-step training methods.

Strengths: + This paper rethinks the optimization procedure of PGD and proposes a new method that outperforms PGD in certain scenarios. + Extensive experiments have been performed for testing the proposed method.

Weaknesses: I'm confused about the objective function of the proposed attacks in Eqn (1). Let us consider generating an adversarial example using a clean image for initialization. In the first step, the third term in Eqn (1) is zero, so the first two terms dominate the loss and the first optimization step minimizes the confidence score of the label class and maximizes the confidence score of one other class. Then in the second step, the third term in Eqn (1) plays a similar role to that of the maximum margin loss. Table 2 in the paper also demonstrates that the proposed l2 loss help to a very limited extent and it seems that the maximum margin loss developed in CW's attack [1] contributes much more. Can the authors provide more discussions about how the proposed l2 loss helps theoretically and empirically? I have checked the submitted code and found that the l2 loss is seemingly encouraged to be minimized in the code implementation, rather than being maximized as described in the paper. [1] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP). IEEE, 2017. -------------------Post rebuttal-------------------------- I would like to thank the authors for response to my comments. My concerns have been partially addressed and thus I'm happy to raise the score to accept.

Correctness: See the box above.

Clarity: Yes, the paper is well written.

Relation to Prior Work: The connection betweent the proposed method and CW's attack should be further discussed.

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: This manuscript introduces a modified loss function for PGD adversarial attacks that finds more suitable gradient-directions, increases attack effectiveness and leads to more effective adversarial training.

Strengths: The method is extensively compared to several other SOTA attacks and adversarial training methods on a battery of different neural network models. The results are promising, especially taking into account the much lower computational overhead of the method compared to standard PGD. I am not aware of a similar loss function proposed in the literature.

Weaknesses: It would be good to get a better theoretical understanding of the regularization term and its effect on the optimisation. Under what conditions can you expect that the modified gradients are more closely aligned with the "optimal" descent direction? Why is this regularization term better than e.g. the cross-entropy term (which also takes into account all latents and should be more biased towards latents with more sensitive probability scores)?

Correctness: The results look sensible and the reported values for competing methods match the literature, except maybe FAB on Madry et al for which [1] reports 45.37% instead of 50.67% as reported here. Maybe the authors can comment on this deviation. [1] Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks, https://arxiv.org/abs/2003.01690

Clarity: The paper is overall well written and the story line is easy to follow.

Relation to Prior Work: Overall the manuscript reviews most related work, although it could include a bit more the wider range of methods SPSA [1], L2 attacks like DDN [2] or the Brendel & Bethge attack [3]. The latter two also tend to be quite a bit stronger than C&W in practice, which could strengthen the L2 evaluation in table 3. [1] https://arxiv.org/pdf/1802.05666.pdf [2] https://arxiv.org/pdf/1811.09600.pdf [3] https://papers.nips.cc/paper/9446-accurate-reliable-and-fast-robustness-evaluation.pdf

Reproducibility: Yes

Additional Feedback: * What is the initial value for lambda? How sensitive is the method to different choices. --Post-Rebuttal Statement The rebuttal addressed many concerns raised by me and others, and I am happy to keep my positive assessment of the work.

Review 4

Summary and Contributions: *********************Update*********************** I want to thank the authors for their response. Many of which sufficiently addresses my concerns. I am changing my score to a 6. The reason why I am not changing this to a 7 is for an empirical-based paper that is less theoretical I would have expected slightly more analysis (not just on single example landscapes or ablation studies), but more detailed analysis/careful design of experiments to shed light on what the regulariser might actually be doing to the neural network. Either way, I think the regulariser is something easy to try, the results for one-step is very good. Thus I will change this to a 6. ****************************************************************** This paper introduces a regularisation term on to the objectives used for adversarial attacks and demonstrate its efficacy when it is used for both defence and attack scenarios. Crucially, they show that this is an objective which can be used for a single-step attack defence which previously has been demonstrated to have effects of gradient obfuscation.

Strengths: Empirically, it seems to achieve good adversarial accuracy with a single step of PGD, namely 49% adversarial accuracy on CIFAR-10. This seems like a scalable method for tasks such as ImageNet.

Weaknesses: I'm a little concerned about the strength of the adversarial attack introduced, the key reason for my worry is highlighted in the correctness section below. For the one-step defense, the nominal accuracy (80%) for CIFAR-10 seems a bit too low for WRN-34-10. I thought this should be above 83% at the very least. Does the authors have an intuition as to why the nominal accuracy is so low? One axes which this has missed is the optimiser used for the objective. How does this adversarial attack compare to the margin loss with Adam optimisation? Or even using the gradient rather than the sign of the gradient. There are no ablations on how the amount regularisation term added onto the loss affect the attack attack and very little theoretical/empirical analysis on how this regularisation term will guide the attack. In other words, there is no theoretical grounding as to why this would be a better objective for optimisation, it seems like an empirical argument - which is also fine but then more analysis would be needed to back up the intuition. I see that Figure 1 gives an example of when this regularisation is useful, but does this example hold true in practise?

Correctness: I guess what was very striking in terms of the results section is the adversarial accuracy they obtain using their attack for Madry et al's network. For Multi-Targeted attacks and AutoAttack, the accuracy obtained is around 44%. But for their "strong" attack, they can only obtain 49.81 - 50%? This is worrying ... even in their Table 3 they show that the adversarial training obtains 44, why is the adversarial accuracy reported for their attack on Madry et al's network so high in Table 1? For Multi-Targeted attack in Figure 2, the paper shows that using 1 random restart we can only obtain 66% adversarial accuracy for CIFAR-10, 8/255. From my experience of multi-targeted it definitely gets below 60%, it should be around 55% or lower. It might be because the authors used only the sign of the gradient and wrong hyperparameters, note that in the multi-targeted paper many of the experiments was done with Adam optimisation. Another comment is that for Multi-Targeted attack comparison in Table 1, the choice was to choose 5 random targets whereas in the MT paper they have chosen the targets by using the second/third etc etc highest logits. This should be changed. I think Eq. 1 has some signs wrong? The objective is one which is maximised (as they have explained in the paper and also show this in the algorithm), but then the objective is maximising the probability of the true label while minimising the probability of other labels which is wrong. Unless the objective is in fact a targeted attack? In which case the text doesn't match the equation. Given this is the only equation in the entire paper, the authors should not have got this wrong. The loss surface plots definitely seems to show that GAT produces non-obfuscated loss surface but it was not specified whether these plots generated are for single-step GAT or multi-step GAT as the claim that their single-step methodology does not cause gradient obfuscation should be backed up with loss surface plots as well.

Clarity: The introduction and related section are well written. The tables are formatted in a way that it is hard to distill what exactly is the advantage of this method. Maybe some of these tables can be moved to the appendix and make a smaller version in the main text that makes the delta in performance much easier to read and understand.

Relation to Prior Work: Regarding single step methods I think this paper also needs to be compared methods that are also cheap computationally such as CURE [1] and LLR [2] both of which just needs 2/3 gradient calculations to avoid gradient obfuscation. [1] Moosavi-Dezfooli, Seyed-Mohsen, et al. "Robustness via curvature regularization, and vice versa." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [2] Qin, Chongli, et al. "Adversarial robustness through local linearization." Advances in Neural Information Processing Systems. 2019.

Reproducibility: Yes

Additional Feedback: Some of my comments were regarding baselines for the adversarial attack but it might be due to the fact the authors just used the sign of the gradient, I think it is crucial for the authors to revisit using optimisers such as Adam to test if their adversarial attack can be even stronger. It is also very important to make sure that the baselines are what we should expect, currently it doesn't seem to be the case. I hope the authors make sure this is fully addressed.