NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1557
Title:Knowledge Extraction with No Observable Data

Reviewer 1

Originality: The proposed approach is a novel application and combination of GAN-style generative models, knowledge distillation and model Compression. Quality: The proposed approach seems like a technically sound and sensible approach to solving the stated problem. I’m a little confused by in section 4.3 where during knowledge distillation the target class is soft-sampled from a uniform distribution and fed un-normalized into the generator, as especially considering that it was not trained in this way. Perhaps a more consistent approach would be to do the same during both training and inference. Alternatively, you could sample categorical distributions from a carefully specified Dirichlet Prior and minimise the KL-divergence between the sampled categorical distribution and the categorical predicted by M instead in equation 8. This would be a less adhoc and more consistent approach to using soft-targets during training and inference. I’m concerned that since KegNet trained networks are initialised using SVD instead of Tucker-2, that you do not provide baseline numbers for networks compressed using SVD vs. Tucker-2. I would also be good to see use of KegNet to train a student networks which is initialised randomly without compression. Clarity: This submission is clearly written and informative. I could reproduce the results based on this paper (even without provided code). However, notation is a little overloaded. Would be good to specify exactly this regards to what parameters the losses in equations 7-10 are (this can still be inferred, but good to be explicit). It not entirely clear why you initialised KegNet-trained student Networks Minor points: It is not entirely clear in which parts of the student network are NOT compressed and simply carried over from the teacher. Also, In line 255 you refer to Resnet20, while in Table 3 you refer to Resnet14. Can you please clarify this discrepancy. Significance: This paper seems to provide a sensible solution to a novel problem. However, I have some concerns over the experimental results. The most complex model/dataset which was ResNet14 on FashionMNIST and SVHN. It would be good to see some results on CIFAR10/CIFAR-100, as MNIST/SVHN/FashionMNIST are still very simple datasets. Also, as described in the Quality section, I’m concerned about the lack of KetNet+ random init, SVD and SVD+finetuning numbers. ---POST-REBUTTAL COMMENTS--- The claimed bad performance on CIFAR-10 and CIFAR-100 undermines the significance of the method. However, I am happy to vote to accept this as a stepping stone to a more advanced methods, as long as the authors are very honest and explicit about the limitations of this approach. Regarding sampling unnormalised vectors - I am happy that this approach does improve performance, and the ablation study is quite useful. However, the mismatch between training and test time could be hampering the method, and eliminating it could yield further improvement in performances. Furthermore, I would be happier on a conceptual level with an approach which is a little more principled - I find the lack of normalisation confusing and unnecessary. The large standard deviation of idea 2 seems to support this. At the same time, I had no issues with an ensemble of generators - I though that was very sensible. Finally, it would be ideal if the authors could include the detailed description of how their method integrates with SVD and T2 into the main body of the paper as well as point out which parts of the network and retained and which are compressed. This added clarity would strengthen the paper.

Reviewer 2

The problem presented in the paper is very interesting and challenging, as it considers the case where no observable data is present, and the approach proposed is in some sense very reasonable, generating artificial data points. The theoretical explanation for this approach follows a dependence assumption and simple approximations and de-compositions of the probability p_x. The weakest part in this paper is the experimental part. Although the framework is new and lacks of fair benchmarks it seems that several details and experiments are lacking in order to understand the benefits of using this approach and its limitations: 1. Why the size of z was chosen to be 10? How does it affect the results? 2. Why z is assumed to be low dimensional? 3. How your method scales to more complex and challenging datasets?

Reviewer 3

novelty: paper addresses new problem with a new approach. I think the problem is interesting. Originally I was not convinced by the approach, given that this could lead to learning adversarial examples as well, but results seem to confirm the the validity of the proposed method (though I still have some doubts in point 5) quality: the proposed method is in general clear and should be easy to implement given the paper. Experimental results confirm the proposed approach. clarity: the paper was in general easy to follow, significance: I am unaware that this problem has been addressed before. Extracting knowledge from a network without access to data is potentially impactful.