Reviews: Adversarial Examples Are Not Bugs, They Are Features

Over the past few years, adversarial examples have received a significant amount of attention in the deep learning community. This paper approaches and addresses this important problem in a unique way by disentangling robust and non-robust features in a standard dataset. I have few queries: By selecting robust or non-robust features for standard or adversarial training, how do you avoid over-fitting of models to these features? In your proposed method, you randomly sample clean images from the distribution as the starting point of optimisation. Do the obtained images look similar to the source images or target images (images which provide robust features in the optimization)? If they are similar to the source images, dosen't it mean that the robust features are not robust? Why do the authors use distance in robust feature space as your optimisation objective? Any specific reason or motivation for this? UPDATE AFTER REBUTTAL: I thank the authors for addressing my comments. They have original contribution, and overall research problem has been nicely presented and addressed. I am increasing my score by 1 for this paper.

Reviewer 2

Quality : The paper is technically sound and tackles a non-trivial and important issue Significance : good (see comment on the contributions) Originality : The contributions are original (but it would be better to position the paper in the main body, not the in the appendices) Clarity : While the English is good and sentences are in general pleasant to read, the paper lacks in clarity mainly because it is poorly organized. Important aspects of the papers can only be found in the appendices making the paper main body difficult to fully understand and not self-contained. The most important aspects in question are : - review of prior art allowing to position the paper - Algorithms allowing to generate modified inputs for sections 3.1 and 3.2 Clarity would also be greatly improved by exemplifying things. Experiments are made in a deep learning context. It would be good to explain the introduced concepts in this setting progressively throughout the paper. Comment : utility and robustness analysis are made feature by feature. The most discriminative information is often obtained by combining features. For example, take a two concentric circles dataset (such as those generated by the make_circles function from sklearn). Each raw feature achieves zero usefulness. The feature corresponding to the distance from the circles origin achieves maximal utility and is obtained by mixing raw features. If the function space in which features live is extremely general, it can be argued that any combination of the raw features belongs to it and therefore the definition matches its purpose. BTW, I believe that a feature with maximal utility is any non-rescaled version of p(y|x). I don’t think that it challenges the validity of the authors’ contributions but maybe they should comment on that because the definition might, at first sight, seem too simplistic. Remarks : Following line 83, investigated classifiers are perceptron like. To what extent is this a limit of the authors’ analysis ? Is this general definition useful for other parts of the paper ? Fig. 2 / right side : ordering of methods in the caption and in the figure are not consistent. What is the adversarial technique specifically used to obtain this figure ? I later found out that this is explained in appendix C but is can be briefly mentioned in the caption. Line 84 : it’s unclear to me that the features are learned from the classifier … Parameters w_f are learned but not mappings f. Line 107 : « vulnerability is caused by non-robust features and is not inherently tied to the standard training framework ». Isn’t there training frameworks prone to produce non-robust features ? (Sounds like chicken or the egg causality dilemma) Line 124->135 : very unclear . Some explanations from appendix C need to be brought back here.  Line 159 : The definition of the deterministically modified class labels t is ambiguous. Do we have t = -y, or something more subtle ? -> again clarified in appendix C Fig. 3 : how is transfer performed. What is transfer success rate ? A short recap would be appreciated (can be in the appendices) The theoretical analysis of section 4 is interesting and illustrative yet based on a particular (and very simple) setting. To what extent can we generalize these results to more general situations ? UPDATE AFTER REBUTTAL: I thank the authors for their additional comments. I am confident that they can improve clarity as they committed to in their feedback and that their work can lead to interesting further developments. score +=1

Reviewer 3

This paper offers a new perspective on the phenomenon of adversarial examples, imperceptible small feature perturbations that cause state-of-the-art models to output incorrect predictions. While previous work related this phenomenon to peculiarities of high-dimensional spaces or the local linearity of the models, this paper posits that adversarial examples are in fact due to sensitivity of models to *well-generalizing* features. In order to present their results, the authors define *robust* features as features that remain useful under adversarial manipulation. The central hypothesis is that there exists both robust and non-robust features for image classifications, and the authors present empirical evidence for this premise by explicitly disentangling both set of features. First, they construct a “robustified” dataset by leveraging a pre-trained robust classifier to explicitly remove non-robust features. More specifically, they construct robust features by optimizing the input (with gradient descent) to match the penultimate layer of the pre-trained robust classifier. Next, they train a standard classifier on these robust features and show that the resulting model achieves strong standard classification and *robust* classification (i.e. is resistant to adversarial examples). Second, the paper introduces a “non-robust” dataset by exploiting a pre-trained classifier to add adversarial perturbations to existing inputs (and relabeling the corresponding outputs, either uniform randomly or deterministically according to the original class label). As a consequence, these examples appear to be mislabeled to humans. They then train a standard classifier on these non-robust features, and show that these models still generalize on the original test set. In other words, even if the robust features are distracting the training signal, these models learn to pick-up non-robust features (which still achieve strong performance). Finally, the paper also studies this phenomenon for maximum likelihood classification for two Gaussian distributions. They show that 1) adversarial vulnerability is essentially explained by the difference between the data-induced metric and the l2 metric and 2) that the gradients for more robust models are better aligned with the adversary’s metric (which support some recent empirical observations). Strengths: 1. A very well-written paper with a new thought-provoking perspective on adversarial examples 2. A very creative and thorough set of experiments to support their claims Questions: 1. Looking at the examples of the robustified dataset, they appear to be more prototypical examples of the classes. Did you observe that the produced examples are less diverse than the original dataset? 2. The classifier trained on robust features appears to be more adversarially robust on ImageNet than of Cifar10 (comparing Fig. 2 with Fig. 12 in supplementary material here). Is there a good explanation for this difference? 3. I’m intrigued by the experiments on the non-robustified dataset, as the robust features now confuse the training signal for the classifier. Although the experiments show that the classifier mostly relies on the non-robust features, I’m wondering how sensitive this result is to the epsilon parameter in the adversarial generation process? UPDATE AFTER REBUTTAL: Thanks for addressing my questions. I'm keeping my score.

Paper ID:	71
Title:	Adversarial Examples Are Not Bugs, They Are Features

Reviewer 1

Reviewer 2

Reviewer 3