Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
They try to estimate the manifold in the feature space of real images and the generated images separately. They do it calculating the pairwise Euclidean distances between all feature vectors in the set of real images and for each feature vector forming a hyper-sphere with radius equal to the distance to its kth nearest neighbor. Together, these hyperspheres define a volume in the feature space that serves as an estimate of the true manifold. Similarly, manifold of the generated images is estimated. Further, precision is estimates as the fraction of the generated images that lie in the manifold of the real images. And recall is estimated as the fraction of the real images that lie in the manifold of the generated images. Originality: I believe the proposed method of estimating precision-recall is not novel. Given its simplicity, it must have been used in some previous works although in a different context (not for high dimensional images). Quality: The paper is well written and results are clearly explained. Significance: Given their numerical experiments, the proposed method holds significance in evaluating the quality and variation of samples generated by GAN/
I found this submission really interesting and influential. It represents a great step forward compare to the prior work on recall/precision estimation of GANs. The authors show examples where previous way to estimate recall/precision fails to find any difference despite obviously varying quality of samples. The paper is very well-written and contains thorough experiments (for instance, one experiment semi-accidentally pushes StyleGAN state-of-the-art in FID). As a sidenote, the underlying idea of growing hyperspheres around data points reminded me about persistent homologies. It is a concept from computational topology which can be used to estimate manifold topology when the manifold is described by point cloud of samples. The authors may benefit from exploring this connection. For more details please see for example "Towards topological analysis of high-dimensional feature spaces" by Hubert Wagner and Paweł Dłotkoa (Computer Vision and Image Understanding, Volume 121, April 2014, Pages 21-26)
Originality: This paper uses similar intuition as . Precision should represent the generated images captured by real images and the recall should represent the real images should be captured by generated images. Instead of using PR curve, the authors use two values definition as information retrieval metric and claim it is better by showing counterexample in StyleGAN with truncation trick. I think the main contribution is the empirical evaluations on large-scale GANs. They evaluated StyleGAN and BigGAN and show the tradeoff between precision and recall by controlling the truncation trick. They also proposed a method for evaluating a single image and demonstrate the effectiveness qualitatively. The authors demonstrate the potential of using these evaluation metrics to improve the training of GANs. Clarity: The paper is generally well-written and structured clearly. However, it would be better if the authors could write the precision and recall as a definition on probability measures and estimator on data samples. If I understood clearly, the definitions in (2) line 88 are actually estimator of precision defined as P(supp(Q)) and recall defined as Q(supp(P)). Quality: In theory, I do not think P(supp(Q)) and Q(supp(P)) are a good evaluation of distance of two probability distributions. One could easily come up with a counterexample: suppose two continuous probability distributions have same support set but different densities, e.g. two skewed gaussians. Then this method could not measure the performance since P(supp(Q)) and Q(supp(P)) are both 1. This method only works if two probability distributions are uniform. And I believe this is the reason why we need a PR curve like  instead of two values. In practice, this problem might be mitigated by carefully choosing the features in the pre-trained classifier and k. The authors mentioned these two choices in line 99 and line 104. It would be better if the authors could provide results for different choices of features and k, the method of choosing them and the analysis. If the number of samples, training data, or even generative model has changed, how should we adjust these settings to have consistent evaluations? As the authors explained at line 138,139,140, the bad clustering performance in the feature space is probably the reason why  failed, it would be critical to choose a good feature space and k. For experiment sections, it would be better if the authors can also provide results on some simple or synthetic datasets like  did, so we can have a fair comparison on these two methods. The authors claim (line 57, 58) that one of the drawbacks of  is that a curve representation is ambiguous. However, in Figure 4 (c), the authors actually use F-scores as proxies for precision and recall, which is proposed in . I do not agree that curve representation is ambiguous as they can also be summarized by F-scores. They also claim another drawback of  is that the estimation algorithm is not practical. However, it is discussed in  that the definition can be extended to continuous distributions. A practical estimation algorithm based on binary classifier and binary hypothesis testing is also proposed in . The motivation of using a PR curve/ROC curve and connection to binary hypothesis testing is discussed in [2, 3]. Significance: This paper proposes evaluation metrics of GANs and applies them on StyleGAN and BigGAN. The experiments are thoroughly conducted and discussed. I think it is a good contribution to ML community.  Simon, L., Webster, R., & Rabin, J. (2019). Revisiting Precision and Recall Definition for Generative Model Evaluation. arXiv preprint arXiv:1905.05441.  Lin, Z., Khetan, A., Fanti, G., & Oh, S. (2018). Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems (pp. 1498-1507).