Review for NeurIPS paper: Perturbing Across the Feature Hierarchy to Improve Standard and Strict Blackbox Attack Transferability

NeurIPS 2020

Perturbing Across the Feature Hierarchy to Improve Standard and Strict Blackbox Attack Transferability

Review 1

Summary and Contributions: - This paper proposes a black-box transfer-based adversarial attack method, which utilizes the intermediate features in a white-box model. The proposed method uses multiple layers for a stronger attack. - The proposed method has two components: (1) cross-entropy terms for enhancing targeted attacks, and (2) aggregation of multiple layers. Both components improve attack performance (in terms of error / tSuc). - This paper provides extensive experimental results, including cross-distribution scenarios.

Strengths: - The proposed method is simple yet effective in various black-box attack scenarios. - This paper considers more realistic attack scenarios, e.g., cross-distribution.

Weaknesses: - The first major concern is the limited methodological contribution compared to FDA. The proposed method just aggregates (i.e., sum) FDA objectives of multiple layers and adding the cross-entropy term like other attack methods; in other words, these approaches are straightforward. Although the improvements of the proposed method are meaningful, it is not surprising or interesting results. - Secondly, the comparision between TMIM/SGM and FDA-based frameworks seems to be unfair. TMIM/SGM methods do not use the training data for the white-box model while FDA-based frameworks use the data for training auxiliary functions g. In my opinion, access to only pre-trained white-box models largely differs from that to whole training data, and thus the latter uses more knowledge than the former. So the improvements over baselines seem to be somewhat overclaimed, especially when the white-box and black-box models are trained on the same dataset. If using the intermediate features is crucial in adversarial attack, then how to utilize the features without the training data? The authors partially cover this issue in "cross-distribution" scenarios (Section 4.2), but in that case the source's and target's label spaces are largely overlapped. I think a harder case should be considered; for example, all labels are exclusive, or the number of available training samples is small. - Is the greedy layer optimization important? How about selecting layers heuristically, for example, the feature maps right before the pooling layers? ========= I generally agree with the author's response about my concerns.

Correctness: Yes.

Clarity: This paper is generally well-written. The followings are some types or suggestions: - Clarification is required for general readers who are not familiar with adversarial attack. In Equation (1)-(3), using g_{l,y}(f_l(x+delta)) instead of p(y|f_l(x+delta)) would be better for understanding because the latter can be considered as the softmax output of the model f. - typo: Line 18, "the the". - typo: Line 52, "i.e." -> "i.e.,"

Relation to Prior Work: While I'm not familiar with the adversarial attack literature, I think this paper discusses prior works enough.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes a targeted transfer-based blackbox attack by allowing perturbations in the intermediate layers. As it is building on an already existing framework (FDA), the method might be lacking novelty, however, the empirical evaluation is quite extensive.

Strengths: The paper is well written and the experiment section seems thorough.

Weaknesses: 1. It might be worth mentioning a connection between adversarial examples and counterfactuals. Can the proposed framework be used in interpreting/explaining models predicitons or behaviour? 2. Can you include some visuals of the generated adversarial examples? 3. From equation 3, we see that each layer contributes differently. Is there any intuition on petrurbations from which layers might have a bigger impact (more specifically are the deeper layers more important)? 3. Would it be possibe to have a human sample evaluation for the attack? 4. Could the author give an opinion if the sucess of the the proposed attack can be explained by "robust vs not-robust features" (Adversarial Examples Are Not Bugs, They Are Features https://arxiv.org/abs/1905.02175) 5. Will the code be made available?

Correctness: To the best of my knowledge the method seems correct.

Clarity: Yes.

Relation to Prior Work: Discussion on other existin feature space attackes could be included: - "Towards Feature Space Adversarial Attack" (https://arxiv.org/abs/2004.12385) - "Constructing Unrestricted Adversarial Examples with Generative Models" (https://arxiv.org/pdf/1805.07894.pdf)

Reproducibility: No

Additional Feedback: ------------------------------------------------------------ Update after Author Feedback and Discussion ------------------------------------------------------------ Thank you to the authors for their detailed feedback. My questions and concerns have been mostly addressed/answered so I raise my score.

Review 3

Summary and Contributions: This paper discusses an approach for blackbox transfer-based adversarial attack for DNNs. The approach is a straight-forward extension of a prior work based on layer-wise perturbation of feature maps of a white box model to incorporation of multiple layers. The authors show that the proposed extension results in significantly better results for target attack accuracy on a number of combinations of source-target models and a variety of conditions in terms of overlap between classes and trining data.

Strengths: The results obtained by the authors on various combinations of white box and black box models for transfer of adversarial attack and very impressive. Furthermore, the improvements are quite consistent even in the extended study where the training data and labels are don't match for the two models. The paper is clearly written to explain the contributions. The experiments are pretty detailed cover a large number of practical conditions.

Weaknesses: Some of the analysis presented in Section 4.1.4 of multi-intermediate-layer is not clear. For example, which layers were used to perturb the feature-maps in the FDA(1) and other multi-layer models isn't shown. Will these plots change depending upon the what layers where chosen by the perturbation algorithm or it is indecent on the chosen layers? I also couldn't find any implication on the time complexity involved in extending the attack generation process to multiple layers. Although the authors do talk about such costs in 4.3 for query based extension but it will be good to comment on their own method as well compared to simple FDA.

Correctness: This is mostly an empirical paper and the experiments performed look fine to me.

Clarity: The paper is well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I have read the authors' rebuttal and I am fine with that.