Review for NeurIPS paper: Self-Supervised Generative Adversarial Compression

NeurIPS 2020

Self-Supervised Generative Adversarial Compression

Review 1

Summary and Contributions: The paper proposes an algorithm to train sparse generative networks. This is done by an adversarial learning framework, where a sparse network is trained jointly with a dense network, such that outputs of the sparse network are close to that of the dense network. Contributions: 1. The paper shows that existing pruning/sparsifying algorithms fail when applied to training generative models. 2. The proposed algorithm achieves 90% sparsification on certain image processing tasks. ---------------------------Edit after author feedback----------------------- I have read the author feedback and other reviews. I am raising my score to a 6 as my concern about quality of baselines was satisfactorily addressed in the author feedback, i.e., there do not exist other algorithms for compressing generative models, and hence the baselines considered are OK. However, my previous concerns about vagueness of discussion and strong unverified claims still remain. ------------------------------------------------------------------------------------

Strengths: + The algorithm has good empirical performance, as it can achieve 90% sparsification without loss in image quality. + The algorithm is easy to implement.

Weaknesses: - While the algorithm achieves good performance, the baselines are not convincing. The authors borrow sparsification algorithms designed for image classification networks, and report the performance of these algorithms when used for training GANs. As GANs are trained in an adversarial fashion, it seems natural that the proposed adversarial learning framework will produce better results than algorithms meant for image classification networks. - The algorithm is a simple modification of the StarGAN algorithm, where a sparse/compressed generative network is trained jointly with the dense network. - Some of the discussions about distributional entropies and the difficulties of using baseline algorithms do not have valid arguments to back them up. Right now they sound more like wild conjectures than intuition.

Correctness: The proposed claims and methods are correct, up to some minor issues outlined under "Weaknesses".

Clarity: - Section 4 needs more elaboration on all the loss functions. The notation is extremely confusing, and most of them are not defined. The ones that are defined are done verbally. The authors should not assume that a reader is familiar with what may be standard loss functions in the training of StarGAN, and should explicitly state what these functions are.

Relation to Prior Work: The relationship to prior work can be improved. Currently it seems that all the considered baselines are valid algorithms for generative modelling, although they are algorithms for training classifiers that have been used to train generative models. A better discussion on existing algorithms for generative modelling would be beneficial.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper deals with the problem of compressing the generator component of GANs. The authors outline why previous approaches to compression (either pruning or GAN-targeted approaches) don’t work, both qualitatively and quantitatively, and show a whole host of results of their compression technique on a variety of GAN tasks such as image synthesis, domain translation, and super resolution.

Strengths: 1. Paper is well-motivated and well-written and compressing generators for GANs is a practically useful problem 2. Idea is simple 3. As far as I know the proposed idea is novel, as it applies to compression of GANs 4. I like the speculation of reasons why existing compression techniques work for models trained on classification tasks but not GANs. While the authors do not explicitly test or verify any of the hypothesized reasons, it motivates the proposed approach and gives direction for future work

Weaknesses: 1. Quantitative experiments motivating the use of this method over existing compression methods are limited to StarGAN on CelebA. While there are a wealth of additional experiments for their method specifically, would be nice to provide the same level of thorough quantitative and qualitative results comparing to the baseline techniques for at least one more GAN architecture and dataset so that we can be confident that the method introduced isn’t overfit specifically to StarGAN and CelebA 2. The authors may be overloading the term “self-supervised”. The explanation for using this term is “Since the original discriminator is used as a proxy for a human’s subjective evaluation, we refer to this as ‘self-supervised’ compression” however (and perhaps there is a precedent for this that I am unaware of, but) self-supervision is usually a property of the dataset and not the models used. For example, image colorization (https://arxiv.org/pdf/1603.08511.pdf), inpainting (https://arxiv.org/pdf/1604.07379.pdf), and predicting image rotations (https://arxiv.org/pdf/1803.07728.pdf) all are examples of self-supervised learning tasks. Maybe there is a better term out there to describe this approach?

Correctness: The claims in the paper look to be correct and the paper doesn’t over-claim anything notable. The empirical methodology seems reliable because wherever possible the authors use publicly available implementations of different GAN training schemes that are open source on Github.

Clarity: This paper is very clear and readable. A few small notes to make it even more easy to read: - On line 78, for “dense baseline” would be nice to add “dense (i.e. uncompressed) baseline” for those who might not know what “dense baseline” means - Table 1 - in the caption put that these results are for StarGAN with the CelebA dataset. This is helpful for individuals who like to skim figures before deciding to read a paper, they can know what dataset the FID, etc. is measured on - Line 189: “there is no need to prune it to” → “there is no need to prune it too”

Relation to Prior Work: Prior work is discussed in Section 3 and differences between these and the proposed method are discussed.

Reproducibility: Yes

Additional Feedback: Overall I think this paper is well-written, the method described seems to work well, and the approach solves a problem that is practically useful (compression of generators in GANs). Aside from my reservations about using the term ‘self-supervised’ for this approach, I think this paper would make a nice addition to the NeurIPS paper lineup. ------------------- Post Rebuttal Feedback ------------------- In contrast to what some other reviewers thought, I don't think that it is unconvincing to use baselines from image classification if those are the only techniques that currently exist. I also don't necessarily agree that the paper requires more insightful understanding of why the method works in order to be considered a good paper - while more understanding would always of course be great to have, if a method is practically useful, then it should be used. Additionally, it is extremely common for GAN literature to be evaluated qualitatively with image samples. Because of this I will stick to my score.

Review 3

Summary and Contributions: This paper bases on some observations of the degradation in performance of existing work in compressing or distilling generative adversarial networks to propose some tricks to overcome the training issues.

Strengths: The experimental results look promising.

Weaknesses: For this type of papers, I expect more insightful diving into the problem rather than showing some generated images with some general comments. Without the insightful understanding, the proposed solution does not really convince me and seems to be some tricks rather than novel scientific discovery. The keyword “self-supervised” in the title also misleads me because I cannot see any pretext task that helps to learn and expose new features of the data targeting downstream tasks. The task what is doing in this paper is closer to model distillation for generative adversarial networks with some further tricks including: i) start from a well pre-trained discriminator rather than training from scratch and ii) new loss in Eqs. (5) and (7) to allow copying the full generator better.

Correctness: It seems to have no problem with theory and experiments.

Clarity: The writing style of the paper is hard to follow with abstract or common words when describing the drawbacks of previous approaches for example: - In Line 35: “In some cases, this result is masked by loss curves that look identical to the original training”. Why is it problematic? - In Line 166: “the new discriminator quickly falls into a low-entropy solution and cannot escape”. Again, why is it problematic? It means that the fake examples from a compressing generator are easily distinguishable to those of a full generator, does not it?

Relation to Prior Work: The prior work is clearly discussed in this paper.

Reproducibility: No

Additional Feedback: The idea is reasonable and applicable. However for this paper, I expect more insightful view and understanding of the training of existing approach for GAN compression. It would be better for me if you show clear evidence and explanation of what you claim: loss curves that look identical to the original training is not good and the new discriminator quickly falls into a low-entropy solution. Actually, in Section 3, the authors have done experiments on too many existing models without any specific focus on what they want to improve. This makes the discussion less condense and not deep enough. ----------------------- Post rebuttal: Thanks for your responses to my questions. However, I am still keen on my current score. The reasons include: i) Although the experiments are comprehensive and the paper also showed promising results, the discussion of the motivation of the paper is vague and less convincing to me without any remarkable supportive evidences. For example, the feedback: 'the discriminator falls into a low-entropy solution that will cause mode collapse' is very misleading and vague. It seems that you are talking about the low entropy of the distribution induced by the generator (the distribution over fake examples). ii) Furthermore, you argue that using a pre-trained discriminator and then doing fine-tuning is better than training a fresh new discriminator from scratch. However, what was provided only another misleading explanation: 'In some cases, this result is masked by loss curves that look identical to the original training as shown in (h) and (m)'. It is not clear to me why it is problematic. iii) Finally, in the experimental section, the experiments to compare two cases: i) training a new discriminator from scratch and ii) doing fine-tuning pre-trained discriminator should be conducted. However, there is no such kind of experiment and the performance of the proposed method was only showed. iv) Last but not least, optimization problems in (5), (6) have the form of A/B (B relates to the discriminator) which is usually hard to train. Probably the stop gradient was applied to consider B as a constant when training, but again this did not mention at all in the paper.

Review 4

Summary and Contributions: The authors proposed a new network compression method for GANs, specifically, pruning the generator in GAN. Through extensive experiments on different task and different networks, the authors demonstrated that the proposed method outperformed existing works both qualitatively and quantitatively. The method also achieved considerable speedup after pruning while maintaining generating data with good quality.

Strengths: 1. The authors utilized the power of the trained discriminator to guide the compression of the generator, which both outperformed existing methods and took a lot less time to train comparing to original training process. 2. The empirical evaluation of the proposed method is quite comprehensive. The authors performed extensive experiments to demonstrate the effectiveness of the propose method under different tasks with different networks. The methods were compared both qualitatively and quantitively. The authors also provided additional discussion on effects of different compression granularities and rates of the proposed method. 3. The authors clearly explained the motivation of the proposed method and provided detailed discussion of the shortcomings of existing approaches in generator compression in complex GAN tasks.

Weaknesses: 1. One main contribution of the paper is to utilize the trained discriminator to guide the network pruning process. While the authors demonstrated the effectiveness of this approach through different experiments, it would be helpful if the authors could provide more rigorous analysis or discussion on why using the trained discriminator in the original GAN could substantially boost the performance comparing to previous works. 2. In order to show the effectiveness of the compression approach, it would be helpful to compare training a small & dense network from scratch as (c) but with the discriminator initialized as the trained discriminator. The authors could include comparison in both the qualitative and quantitative results, as well as the training time with this setup. This could help further strengthen the argument of the effectiveness of the proposed method.

Correctness: The claims and methods in this paper are correct. The methodology is correct and intuitive.

Clarity: The paper is well written. Overall the methods are well explained, and the results are presented clearly. I appreciate the authors' discussion on the potential reasons of why existing approaches fails in complex GAN pruning tasks. However, it would be helpful if the author could elaborate more mathematically on the details of compression process in the proposed method during training.

Relation to Prior Work: The difference between the proposed method and existing works are clearly explained. The authors provided detailed discussion in section 3 about the reasons why existing methods produced unsatisfactory results when compressing the generator of a GAN in complex tasks. The main difference of the proposed method is the utilization of the trained discriminator.

Reproducibility: Yes

Additional Feedback: