Paper ID: | 7666 |
---|---|

Title: | On Robustness to Adversarial Examples and Polynomial Optimization |

The paper studies the robustness of machine learning classifiers against adversarial examples. The authors use semi definite programming (SDP) in order to obtain provable polynomial-time adversarial attacks against hypothesis classes that are degree-1 and degree-2 polynomial threshold functions. With the same techniques the authors design robust classifiers for these two hypothesis classes. Furthermore, the authors show that learning polynomial threshold functions of degree 2 or higher in a robust sense (against adversarial examples) is computationally a hard problem. The authors also experiment with 2-layer neural networks and with an SDP method they either find adversarial examples or certify that none exist within a certain budget delta available to the adversary. First of all, there are parts of the paper that are very well written and other parts that I assume were rushed for the NeurIPS submission deadline. I admit guilty of not having read the supplementary material, but it appears that in the current form of the paper, significant information is laid in the appendix; there are cases where important information, problems and equations are referred to from the main text and appear in the appendix. Having said that this is not the only issue with the paper. Regarding adversarial examples, the authors define as success the criterion f(x+delta) != f(x), where f is the learned classifier. On the other hand, the authors also want to discuss and have claims in the context of an adversarial loss, where success is now defined to be f(x+delta) != y, where y = f(x); i.e., y is the true label of the original instance x. It appears that the authors know about these distinctions due to the subtle sentence in line 36 of the paper, but somehow do not present clearly the context in which their results hold. In fact, they even cite the paper [11] which is explicitly about distinctions on such notions. Along these lines, the Bubeck et al papers mentioned in line 105 are using a yet, third definition, where now f(x+delta) is compared with the true label of the ground truth function at the instance x+delta. In this sense, comparing the work of this paper with the line of work along the Bubeck et al papers, is simply wrong. I think a journal version of the paper is probably more appropriate as it appears that the authors have several results and lots of content, but somehow there is still work that needs to be done on the presentation. For this reason, I think the 8 pages allowed by NeurIPS, are somehow too little for good presentation of the material, or at least, this is the feeling that one gets after reading the paper. --------------------------------------------------- Comments after rebuttal: I am happy with the response that the authors gave to my part. However, I have one more remark, which is related to the response of the authors to comments of other reviewers. In particular I would like to see the authors add 2-3 sentences and clarify that the results that they have regarding the robustness of the learned classifiers hold under conditions -- if misclassification is the notion against which they measure the robustness. In particular, the two conditions are: (i) initially f(x) is equal to the true label c(x) of the ground truth c, (ii) the label of the ground truth does not change in the regime of the perturbation delta; i.e., c(x+delta) = c(x). > I trust that the authors will provide such a clarification in the final version of the paper and for this reason I am raising the score from 5 to 6. Why such a clarification is important? Some explanations below: o The definition that the authors use for adversarial examples is in Lines 118-119 and completely ignores the ground truth that provides the true labels. This definition is referred to as "prediction change" in [11]. o A successful adversarial perturbation delta yields f(x+delta) != f(x). As such, it can be the case that an initially incorrectly classified instance x has to be pushed to an x+delta (i.e., delta is strictly nonzero) and in fact now for x+delta it may even hold that f is predicting the correct label at that point with respect to the ground truth! This is of course contrary to the aim that we have for adversarial perturbations where the adversary aims to make the learner make a mistake and of course no push was necessary. o As another example, consider the case where the learner comes up with the hypothesis f that is precisely the ground truth c (this is common, e.g., for Boolean functions). In such a case, for non-constant functions c, one can always find an adversarial perturbation under the "prediction change" definition, when in fact f can never err on any instance in the instance space. In other words, again "prediction change" does not capture fully our aim for misclassification when dealing with adversarial examples. Thank you for a very interesting paper!

Contribution 1: The main contribution here in my opinion is the conceptual connection between polynomial optimization and adversarial attacks of PTFs, and showing that finding adversarial examples can be used to robustly learn these PTFs via a convex program. The actual algorithmic result is not super interesting as it only holds for degree-1 and degree-2 PTFs---and learning degree-1 PTFs (in other words just a linear classifier) robustly is easy as it is just a max-margin problem. Contribution 2: I find the lower bounds to be more interesting that the above upper bounds. As also mentioned by the authors, previous hardness results by Bubeck et al. were for a worst-case cryptographic instance, whereas the bounds in the paper are for a much more natural problem of learning a simple PTF. The only drawback here is that the result only holds for proper learning, whereas the result in Bubeck et al. held for any learner, but this is still a good contribution. Contribution 3: The experiments show that the SDP formulation to find attacks on neural networks could be a promising direction of future work. The main bottleneck here is the computational cost. Maybe using non-convex methods such as Burer-Monteiro could help? Clarity: The paper is extremely well-written (even the parts I read in the supplementary read very well). ------Post author response------- I thank the authors for their detailed response. The response seems adequate, hence I am keeping my previous score of 7. As suggested by Reviewer 1, the authors should clarify the definition of adversarial examples in the next revision of the paper. I have one small suggestion about the 2-layer neural network results which could help clarify its significance to the readers. The experimental result here could have potential given that it seems to succeed where PGD fails, though the main concern here is the computational cost. It would be good if the authors also compared with other SMT/MIP based methods for exact verification, as these usually do better than PGD but are also slower.

Summary of Contributions: This work considers the problem of adversarial robustness, where the goal of the learner is to minimize adversarial or robust loss (as defined in Madry et al. 2017, the population risk modified to account for worst-case attacks in a L_infty ball of radius delta around the distribution). The focus here is on provably efficient robustness, and to this end the authors introduce a notion of approximately robust learning, where an algorithm gamma-approximately robustly learns the class if the robust loss is at most epsilon for radius delta/gamma, when given a labeled sample of size polynomial in 1/eps and the VC dimension of the class. The definition assumes that the sample is labeled by a function in the class and that function is gamma-robust with error 0. Results given for this framework: It is computationally hard to approximately robustly learn the class of delta-robust realizable degree-2 PTFs with gamma = O(log^c n) for some constant c. (It is also NP-hard for gamma=1 for even constant eps.) Degree-2 PTFs can be approximately robustly learned in polynomial time with gamma=O(sqrt{log n}) and halfspaces with gamma=1. Polynomial time algorithms for finding an adversarial example with l_infty distance at most delta * O(sqrt{log n}) when one exists at distance delta, or proof that no such example exists. An algorithm for finding adversarial examples for 2-layer neural networks (with some additional constraint on the last layer of the network or on the resulting SDP), with gamma~sqrt{log n} delta / m or a certifying that no such example exists with l_infty ball of radius delta around x and has margin > m from the decision boundary. Experiments for 2-layer neural networks using the algorithm from 4 on a network trained for MNIST. The experiments suggest that the assumption underlying 4 may hold in practice and the performance of the attack is compared to PGD from Madry et al. In this particular experiment, the attack seems to have advantage over PGD both when delta=.3 and .01, although it is computationally more expensive. Strengths/Weaknesses The impact of computational efficiency on the tasks of adversarially robust training and adversarial attacks is increasingly important and not yet fully understood; this paper proposes a connection to polynomial optimization and a model for understanding this impact. While it shows several non-trivial results in this model, due to the definition of the model, I’m not sure I understand the implications to the original problem, especially regarding the results for deg-2 PTFs given here. One limiting aspect of the definition (mostly for the lower bound) is that the learner should also output a deg-2 PTF. For the upper bounds, I’m not sure I understand the requirement that the target PTF must have robust error =0. Please correct me if I’ve misunderstood, but it seems like when learning LTFs over {0,1}^n, the learning algorithm need not learn decision lists, since there would be no deg-1 LTF with robust error=0 for even small values of delta. If this is not an unnatural requirement (or if this isn’t actually a requirement for the definition), perhaps the authors can clarify why. When coupled with the fully realizable/proper requirement, Deg-2 LTFs feels like a fairly restrictive class that may be of less interest to the adversarial attack community. The lower bound has an additional restriction on the behavior of learning algorithm (see last paragraph below). The results, especially the lower bound, are still interesting from a theoretical point of view (and we don’t expect to achieve computational hardness without some restriction on the hypothesis class or cryptographic assumption). The results and experiments for 2-layer networks should be of broader interest, but I feel unable to judge how reasonable the required assumptions are (also, there is no robustness result for these networks) or how meaningful the experimental results are. Overall, the quality and clarity of the work are good. I found the clarity in section 6 could easily be improved, for example it references a problem (9) that is not shown in the section. Basically this section is unreadable without referring to the supplementary material. Regarding Theorem 5.1, Remark 5.2, I didn’t see that this is a requirement from the statement? Unless I misunderstand, you should state this requirement in the Theorem statement, as it is a limitation of the result. It also is not mentioned in the supplementary material, and seems relevant to the discussion justifying corollary G.2. To summarize, I think this is interesting theoretical work with some limitations -- the results for 2-layer networks may make a significant difference in this regard, but I am not in a position to assess this.