Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper touches an important problem of measuring the quality of conditional generative models. The author proposed Classification Accuracy Score -- a metric that is based on a performance of a discriminative model that is trained on samples obtained from the conditional generative model. The paper also discussed pros and cons of the proposed metric. The empirical study shows that a number of sota-level deep generative models fail to match the target distribution. Pros: While the idea has been proposed before in Shmelkov2018, it was not widely used in the field. The current paper points out some limitations of deep generative models as well as limitations currently used metrics, thus the paper delivers a significant contribution. The paper is clearly written, the experiments look thoughtfully designed. Cons: The original work Shmelkov2018 has not been cited. The proposed metric has a range of limitations that are listed in section 3 (e.g., the metric does not penalize memorization). Despite the questionable novelty of the proposed metric, the paper provides a nice and interesting empirical evaluation with sota level generative models, that definitely is of interest to the community. The increasing popularity of the proposed metric also has a significant potential impact. I recommending to accept the paper. Comments: 0) The original work Shmelkov2018 should definitely be cited. 1) The experimental evaluation suggests that IS/FD tend to over-penalize likelihood based generative models as well as over-penalize nearly ideal reconstructions from VAEish model. While the issue is discussed in section 4.3 (lines 270-280), it is still not clear why does this happen. I feel like there are some reasons for it that might be interesting and useful for the community. 2) The paper may also benefit from adding comparison on MNIST dataset as well as more extensive evaluations on CIFARs. The first, it is really interesting if the proposed metric is already perfect for a simpler dataset (for most generative models perhaps)? The second, a lot of groups have no resources to reproduce the ImageNet experiments, so more MNIST/CIFARs experiments might accelerate future research. 3) The one more case when the proposed metric might produce wrong results is Dataset Distillation (https://arxiv.org/pdf/1811.10959.pdf) however the case looks extremely unlikely. 4) Formatting recommendation: it is common practice to use a vector format for plots e.g., use pylab.savefig('name.pdf'). (Shmelkov2018) "How good is my GAN?." Proceedings of the European Conference on Computer Vision (ECCV), 2018.
The paper presents a new metric to evaluate generative models by training a classifier using only sampled images. Properly evaluating generative models is an important task and this is an interesting novel idea, however I think this tests more the conditional part of conditional generative models then the generative part and the results should be seen as such. Detailed remarks: 1) Conditional generative models have a hard time capturing the class, for example in " Are generative classifiers more robust to adversarial attacks?" the authors get bad classification on cifar10 using a conditional generative model. This was also discussed in "Conditional Generative Models are not Robust" (concurrent work so no need to cite, but might be of interest). It seems that there is a big difference between generative model and conditional generative model and the metric evaluates the latter and should be described as one. Some discussion on the matter is needed. 2) In the same line, it would be important to how the evaluated models capture the class, what is the accuracy of using p(x|y) as a classifier for real data? What is the accuracy of using samples from p(x,y) using a pretrained classifier (trained with real data).
The lack of novelty is quite problematic. Essentially the central idea of the paper "regenerate training set using a generative model and train a classifier on this data to compare with the one trained on real data and thus evaluate the generative model" was described in "How good is my GAN?" K. Shmelkov et al. ECCV'18 (which is not mentioned in related work at all). The remaining contributions look pretty weak without the main one. Conditional GANs learn some classes worse than others. FID (IS as well), being a single number, fails to capture this aspect especially for a dataset as rich as ImageNet. Recent likelihood-based generative models actually perform quite well and don't have completely degenerate classes (as their latent space contains reconstructions as well). IS is extremely problematic and does not really account for diversity . Truncation trick inflates IS by directly restricting diversity of BigGAN output and thus degrades CAS. I'd say we already know all this or at least it is unsurprising. The writing looks pretty good though, the paper is well-structured and easy to read. The main idea is clearly explained, easy to reproduce. But not novel, unfortunately.