NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6459
Title:Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder

Reviewer 1

1) The problem formulation is correct and it is backed well by supporting experimental results. But did not report any other baseline approaches and did not compare their framework to other approaches for this problem. 2) The writeup is good and easy to follow. Experiments are well described for others to repeat them. 3) The algorithm is novel and seems significant for others to continue from here.

Reviewer 2

[Edit after the author feedback]: I thank the authors for addressing my comments during the author feedback. I have read the authors' response as well as the other reviews. The authors' response addresses my concern regarding the motivation of the proposed method. Overall, I think this submission provides a new 'adversarial' setting, and I update my overall score as "6: Marginally above the acceptance threshold". ========================================================== Summary: This paper proposes a training time attack that is able to manipulate the behavior of trained classifiers, including deep models and classical models. The paper develops a heuristic algorithm to solve the expensive bilevel problem in Eq 4. The experimental results demonstrate the effectiveness of the proposed algorithm in this paper. Pros: - The proposed heuristic algorithm is efficient and able to find good solutions for the expensive bilevel problem. - The adversarial training examples generated by the proposed algorithm can manipulate the predictions of the classifiers and transfer between different types of classifiers. - The generated training time adversarial examples can transfer between different models including deep and non-deep models. Limitation & Questions: - Is there any comparable baseline method for this training time adversarial attack? - As shown in Figure 5, for the MNIST dataset, the model is 'robust' against the generated training adversarial examples when $\epsilon \leq 0.3$ and the percentage of adversaries is less than 60%. As 60% is already a large percentage, this makes the proposed attack less effective. Typo: - L230, should be $f_{\theta}(g_{\xi}(x))$. There is an omission in the related work on the bilevel problem in data poison attack:

Reviewer 3

Post Response Comment: ========================================== I think the authors have addressed my initial concerns, therefore I maintain my initial stand and incline to accepting it. Originality ========================================= The setting is new as far as my knowledge can tell. Previous work such as "Certified Defense for Data Poisoning Attacks" considers contaminated instance within a feasible set, but modifying each training point by a small amount for an offline learner is new to me. I saw a backdoor attack in reference ([5]), but it is not referred to in the main body. I think the difference between this attack and the backdoor attack is that this one doesn't require the backdoor pattern to activate during test-time. Quality ========================================= The paper is technically sound overall. The most interesting part of the algorithm is using an encoder-decoder net instead of directly doing gradient ascent on the clean inputs to generate attack instances. Clarity ========================================= The writing can be more compact. There are running sentences here and there. Despite these flaws, the ideas are clearly conveyed. Significance ======================================= My main concern about the paper is the practicality of the setting. The author proposes a (hypothetical) setting in intro, but I'd like to see more explanation about the purpose. Is it to protect privacy? If so, how is it different from using synthetic data, say generated from GAN? I also have two suggestions. First, since SVM is also undermined by the attack, is it possible to visualize the SVM weights? Maybe it allows us to identify the pattern injected in the data. (Instead of a digit classifier, it might now detect whether a horizontal grey stroke is present.) Second, again for some convex model such as SVM or LR, the optimal model should satisfy some KKT condition (e.g. gradient=0), which can be a function of the training data. Maybe the conditions can shed light to how and why the training data are perturbed in some way.