{"title": "Improved Precision and Recall Metric for Assessing Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3927, "page_last": 3936, "abstract": "The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.", "full_text": "Improved Precision and Recall Metric for Assessing\n\nGenerative Models\n\nTuomas Kynk\u00e4\u00e4nniemi\u2217\n\nAalto University\n\nNVIDIA\n\ntuomas.kynkaanniemi@aalto.fi\n\nJaakko Lehtinen\nAalto University\n\nNVIDIA\n\njlehtinen@nvidia.com\n\nTero Karras\n\nNVIDIA\n\ntkarras@nvidia.com\n\nSamuli Laine\n\nNVIDIA\n\nslaine@nvidia.com\n\nTimo Aila\nNVIDIA\n\ntaila@nvidia.com\n\nAbstract\n\nThe ability to automatically estimate the quality and coverage of the samples\nproduced by a generative model is a vital requirement for driving algorithm research.\nWe present an evaluation metric that can separately and reliably measure both\nof these aspects in image generation tasks by forming explicit, non-parametric\nrepresentations of the manifolds of real and generated data. We demonstrate\nthe effectiveness of our metric in StyleGAN and BigGAN by providing several\nillustrative examples where existing metrics yield uninformative or contradictory\nresults. Furthermore, we analyze multiple design variants of StyleGAN to better\nunderstand the relationships between the model architecture, training methods,\nand the properties of the resulting sample distribution. In the process, we identify\nnew variants that improve the state-of-the-art. We also perform the \ufb01rst principled\nanalysis of truncation methods and identify an improved method. Finally, we\nextend our metric to estimate the perceptual quality of individual samples, and use\nthis to study latent space interpolations.\n\n1\n\nIntroduction\n\nThe goal of generative methods is to learn the manifold of the training data so that we can subsequently\ngenerate novel samples that are indistinguishable from the training set. While the quality of results\nfrom generative adversarial networks (GAN) [7], variational autoencoders (VAE) [14], autoregressive\nmodels [29, 30], and likelihood-based models [6, 13] have seen rapid improvement recently [11, 8,\n28, 20, 4, 12], the automatic evaluation of these results continues to be challenging.\nWhen modeling a complex manifold for sampling purposes, two separate goals emerge: individual\nsamples drawn from the model should be faithful to the examples (they should be of \u201chigh quality\u201d),\nand their variation should match that observed in the training set. The most widely used metrics, such\nas Fr\u00e9chet Inception Distance (FID) [9], Inception Score (IS) [25], and Kernel Inception Distance\n(KID) [2], group these two aspects to a single value without a clear tradeoff. We illustrate by examples\nthat this makes diagnosis of model performance dif\ufb01cult. For instance, it is interesting that while\nrecent state-of-the-art generative methods [4, 13, 12] claim to optimize FID, in the end the (uncurated)\nresults are almost always produced using another model that explicitly sacri\ufb01ces variation, and often\nFID, in favor of higher quality samples from a truncated subset of the domain [17].\n\n\u2217This work was done during an internship at NVIDIA.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fPg\n\nPr\n\n(a) Example distributions\n\n(b) Precision\n\n(c) Recall\n\nFigure 1: De\ufb01nition of precision and recall for distributions [24]. (a) Denote the distribution of real\nimages with Pr (blue) and the distribution of generated images with Pg (red). (b) Precision is the\nprobability that a random image from Pg falls within the support of Pr. (c) Recall is the probability\nthat a random image from Pr falls within the support of Pg.\n\nMeanwhile, insuf\ufb01cient coverage of the underlying manifold continues to be a challenge for GANs.\nVarious improvements to network architectures and training procedures tackle this issue directly\n[25, 19, 11, 15]. While metrics have been proposed to estimate the degree of variation, these have not\nseen widespread use as they are subjective [1], domain speci\ufb01c [19], or not reliable enough [23].\nRecently, Sajjadi et al. [24] proposed a novel metric that expresses the quality of the generated\nsamples using two separate components: precision and recall. Informally, these correspond to the\naverage sample quality and the coverage of the sample distribution, respectively. We discuss their\nmetric (Section 1.1) and characterize its weaknesses that we later demonstrate experimentally. Our\nprimary contribution is an improved precision and recall metric (Section 2) which provides explicit\nvisibility of the tradeoff between sample quality and variety. Source code of our metric is available at\nhttps://github.com/kynkaat/improved-precision-and-recall-metric.\nWe demonstrate the effectiveness of our metric using two recent generative models (Section 3),\nStyleGAN [12] and BigGAN [4]. We then use our metric to analyze several variants of StyleGAN\n(Section 4) to better understand the design decisions that determine result quality, and identify new\nvariants that improve the state-of-the-art. We also perform the \ufb01rst principled analysis of truncation\nmethods [17, 13, 4, 12]. Finally, we extend our metric to estimate the quality of individual generated\nsamples (Section 5), offering a way to measure the quality of latent space interpolations.\n\n1.1 Background\n\nSajjadi et al. [24] introduce the classic concepts of precision and recall to the study of generative\nmodels, motivated by the observation that FID and related density metrics cannot be used for making\nconclusions about precision and recall: a low FID may indicate high precision (realistic images), high\nrecall (large amount of variation), or anything in between. We share this motivation.\nFrom the classic viewpoint, precision denotes the fraction of generated images that are realistic,\nand recall measures the fraction of the training data manifold covered by the generator (Figure 1).\nBoth are computed as expectations of binary set membership over a distribution, i.e., by measuring\nhow likely is it that an image drawn from one distribution is classi\ufb01ed as falling under the support\nof the other distribution. In contrast, Sajjadi et al. [24] formulate precision and recall through the\nrelative probability densities of the two distributions. The choice of modeling the relative densities\ncomes from an ambiguity, i.e., should the differences between the two distributions be attributed\nto the generator covering the real distribution inadequately or is the generator producing samples\nthat are unrealistic. The authors resolve this ambiguity by modeling a continuum of precision/recall\nvalues where the extrema correspond to the classic de\ufb01nitions. In addition to raising the question of\nwhich value to use, their practical algorithm cannot reliably estimate the extrema due to its reliance\non relative densities: it cannot, for instance, correctly interpret situations where large numbers of\nsamples are packed together, e.g., as a result of mode collapse or truncation. The k-nearest neighbors\nbased two-sample test by Lopez-Paz et. al. [16] suffers from the same problem. Parallel with our\nwork, Simon et al. [26] extend Sajjadi\u2019s formulation to arbitrary probability distributions and provide\na practical algorithm that estimates precision and recall by training a post hoc classi\ufb01er.\n\n2\n\n\f(a) True manifold\n\n(b) Approx. manifold\n\nFigure 2: (a) An example manifold in a feature space. (b) Estimate of the manifold obtained by\nsampling a set of points and surrounding each with a hypersphere that reaches its kth nearest neighbor.\n\nWe argue that the classic de\ufb01nition of precision and recall is suf\ufb01cient for disentangling the effects of\nsample quality and manifold coverage. This can be partially justi\ufb01ed by observing that precision and\nrecall correspond to the vertical and horizontal extremal cases in Lin et al.\u2019s [15] theoretically founded\nanalysis of mode collapse regions. In order to approximate these quantities directly, we construct\nadaptive-resolution \ufb01nite approximations to the real and generated manifolds that are able to answer\nbinary membership queries: \u201cdoes sample x lie in the support of distribution P ?\u201d. Together with\nexisting density-based metrics, such as FID, our precision and recall scores paint a highly informative\npicture of the distributions produced by generative image models. In particular, they make effects in\nthe \u201cnull space\u201d of FID clearly visible.\n\n2\n\nImproved precision and recall metric using k-nearest neighbors\n\nWe will now describe our improved precision and recall metric that does not suffer from the weak-\nnesses listed in Section 1.1. The key idea is to form explicit non-parametric representations of the\nmanifolds of real and generated data, from which precision and recall can be estimated.\nSimilar to Sajjadi et al. [24], we draw real and generated samples from Xr \u223c Pr and Xg \u223c Pg,\nrespectively, and embed them into a high-dimensional feature space using a pre-trained classi\ufb01er\nnetwork. We denote feature vectors of the real and generated images by \u03c6r and \u03c6g, respectively, and\nthe corresponding sets of feature vectors by \u03a6r and \u03a6g. We take an equal number of samples from\neach distribution, i.e., |\u03a6r| = |\u03a6g|.\nFor each set of feature vectors \u03a6 \u2208 {\u03a6r, \u03a6g}, we estimate the corresponding manifold in the feature\nspace as illustrated in Figure 2. We obtain the estimate by calculating pairwise Euclidean distances\nbetween all feature vectors in the set and, for each feature vector, forming a hypersphere with radius\nequal to the distance to its kth nearest neighbor. Together, these hyperspheres de\ufb01ne a volume in the\nfeature space that serves as an estimate of the true manifold. To determine whether a given sample \u03c6\nis located within this volume, we de\ufb01ne a binary function\n(cid:48)\n\n, \u03a6(cid:1)(cid:13)(cid:13)2 for at least one \u03c6\n\n(cid:48) \u2212 NNk\n\n(cid:48)(cid:13)(cid:13)2 \u2264(cid:13)(cid:13)\u03c6\n\n(cid:26)1, if (cid:13)(cid:13)\u03c6 \u2212 \u03c6\n(cid:0)\u03c6\n, \u03a6(cid:1) returns kth nearest feature vector of \u03c6\n\n0, otherwise,\n\nwhere NNk\nIn essence, f (\u03c6, \u03a6r)\nprovides a way to determine whether a given image looks realistic, whereas f (\u03c6, \u03a6g) provides a way\nto determine whether it could be reproduced by the generator. We can now de\ufb01ne our metric as\n\n(cid:48) from set \u03a6.\n\n(cid:0)\u03c6\n\n(cid:48)\n\nf (\u03c6, \u03a6) =\n\n(cid:48) \u2208 \u03a6\n\n(1)\n\nprecision(\u03a6r, \u03a6g) =\n\n1\n|\u03a6g|\n\nf (\u03c6g, \u03a6r)\n\nrecall(\u03a6r, \u03a6g) =\n\n1\n|\u03a6r|\n\nf (\u03c6r, \u03a6g)\n\n(2)\n\nIn Equation (2), precision is quanti\ufb01ed by querying for each generated image whether the image is\nwithin the estimated manifold of real images. Symmetrically, recall is calculated by querying for each\nreal image whether the image is within estimated manifold of generated images. See Appendix A in\nthe supplement for pseudocode.\nIn practice, we compute the feature vector \u03c6 for a given image by feeding it to a pre-trained VGG-16\nclassi\ufb01er [27] and extracting the corresponding activation vector after the second fully connected layer.\nBrock et al. [4] show that the nearest neighbors in this feature space are meaningful in the sense that\nthey correspond to semantically similar images. Meanwhile, Zhang et al. [32] use the intermediate\nactivations of multiple convolutional layers of VGG-16 to de\ufb01ne a perceptual metric, which they\n\n3\n\n(cid:88)\n\n\u03c6g\u2208\u03a6g\n\n(cid:88)\n\n\u03c6r\u2208\u03a6r\n\n\f(a) Varying |\u03a6|, VGG-16\n\n(b) Varying k, VGG-16\n\n(c)Varying k, Inception-v3\n\nFigure 3: (a) Our metric behaves similarly to FID in terms of varying sample count. (b) Precision\n(blue) and recall (orange) for several neighborhood sizes k. Larger k increases both numbers. Here a\ntrained model (\u03c8 = 1) was expanded to a family of models by arti\ufb01cially limiting the variation in the\nresults. We would expect the precision and recall to reach 1.0 and 0.0, respectively, when \u03c8 \u2192 0. (c)\nUsing Inception-v3 features instead of VGG-16 yields a substantially similar result.\n\nshow to correlates well with human judgment for image corruptions. We have tested both approaches\nand found that feature space, used by Brock at al., works considerably better for the purposes of\nour metric, presumably because it places less emphasis on the exact spatial arrangement \u2014 sparsely\nsampled manifolds rarely include near-exact matches in terms of spatial structure.\nLike FID, our metric is weakly affected by the number of samples taken (Figure 3a). Since it is\nstandard practice to quote FIDs with 50k samples, we adopt the same design point for our metric\nas well. The size of the neighborhood, k, is a compromise between covering the entire manifold\n(large values) and overestimating its volume as little as possible (small values). In practice, we have\nfound that higher values of k increase the precision and recall estimates in a fairly consistent fashion,\nand lower values of k decrease them, until they start saturating at 1.0 or 0.0 (Figure 3b). Tests with\nvarious datasets and GANs showed that k = 3 is a robust choice that avoids saturating the values\nmost of the time. Thus we use k = 3 and |\u03a6| = 50000 in all our experiments unless stated otherwise.\nFigure 3c further shows that the qualitative behavior of our metric is not limited to VGG-16 \u2013 which\nwe use in all tests \u2013 as Inception-v3 features lead to very similar results.\n\n3 Precision and recall of state-of-the-art generative models\n\nIn this section, we demonstrate that precision and recall computed using our method correlate well\nwith the perceived quality and variation of generated distributions, and compare our metric with\nSajjadi et al.\u2019s method [24] as well as the widely used FID metric [9]. For Sajjadi et al.\u2019s method,\nwe use 20 clusters and report F1/8 and F8 as proxies for precision and recall, respectively, as\nrecommended by the authors. We examine two state-of-the-art generative models, StyleGAN [12]\ntrained with the FFHQ dataset, and BigGAN [4] trained on ImageNet [5].\n\nStyleGAN Figure 4 shows the results of various metrics in four StyleGAN setups. These setups\nexhibit different amounts of truncation and training time, and have been selected to illustrate how\nthe metrics behave with varying output image distributions. Setup A is heavily truncated, and\nthe generated images are of high quality but very similar to each other in terms of color, pose,\nbackground, etc. This leads to high precision and low recall, as one would expect. Moving to setup B\nincreases variation, which improves recall, while the image quality and thus precision is somewhat\ncompromised. Setup C is the FID-optimized con\ufb01guration in [12]. It has even more variation in terms\nof color schemes and accessories such as hats and sunglasses, further improving recall. However,\nsome of the faces start to become distorted which reduces precision. Finally, setup D preserves\nvariation and recall, but nearly all of the generated images have low quality, indicated by much lower\nprecision as expected.\nIn contrast, the method of Sajjadi et al. [24] indicates that setups B, C and D are all essentially perfect,\nand incorrectly assigns setup A the lowest precision. Looking at FID, setups B and D appear almost\nequally good, illustrating how much weight FID places on variation compared to image quality, also\nevidenced by the high FID of setup A. Setup C is ranked as clearly the best by FID despite the obvious\n\n4\n\n10000200003000040000500006000070000Number of images0.000.250.500.751.00Precision / RecallPrecisionRecallFID0.008.3416.6725.0133.34FID0.00.20.40.60.81.0Truncation \u00c30.00.20.40.60.81.0k=2k=2k=3k=3k=5k=5k=7k=7k=10k=100.00.20.40.60.81.0Truncation \u00c30.00.20.40.60.81.0k=2k=2k=3k=3k=5k=5k=7k=7k=10k=10\fA (FID = 91.7)\n\nB (FID = 16.9)\n\nC (FID = 4.5)\n\nD (FID = 16.7)\n\nFigure 4: Comparison of our method (black dots), Sajjadi et al.\u2019s method [24] (red triangles), and\nFID for 4 StyleGAN setups. We recommend zooming in to better assess the quality of images.\n\n\u03c8 = 0.0\n\n\u03c8 = 0.3\n\n\u03c8 = 0.7\n\n\u03c8 = 1.0\n\n(a)\n\n(b) Our method\n\n(c) Sajjadi et al. [24]\n\nFigure 5: (a) Example images produced by StyleGAN [12] trained using the FFHQ dataset. It is\ngenerally agreed [17, 4, 13, 12] that truncation provides a tradeoff between perceptual quality and\nvariation. (b) With our method, the maximally truncated setup (\u03c8 = 0) has zero recall but high\nprecision. As truncation is gradually removed, precision drops and recall increases as expected. The\n\ufb01nal recall value approximates the fraction of training set the generator can reproduce (generally\nwell below 100%). (c) The method of Sajjadi et al. reports both precision and recall increasing as\ntruncation is removed, contrary to the expected behavior, and the \ufb01nal numerical values of both\nprecision and recall seem excessively high.\n\nimage artifacts. The ideal tradeoff between quality and variation depends on the intended application,\nbut it is unclear which application might favor setup D where practically all images are broken over\nsetup B that produces high-quality samples at a lower variation. Our metric provides explicit visibility\non this tradeoff and allows quantifying the suitability of a given model for a particular application.\nFigure 5 applies gradually stronger truncation [17, 4, 13, 12] on precision and recall using a single\nStyleGAN generator. Our method again works as expected, while the method of Sajjadi et al. does not.\nWe hypothesize that their dif\ufb01culties are a result of truncation packing a large number of generated\nimages into a small region in the embedding space. This may result in clusters that contain no real\nimages in that region, and ultimately causes the metric to incorrectly report low precision. The\ntendency to underestimate precision can be alleviated by using fewer clusters, but doing so leads to\noverestimation of recall. Our metric does not suffer from this problem because the manifolds of real\nand generated images are estimated separately, and the distributions are never mixed together.\n\nBigGAN Brock et al. recently presented BigGAN [4], a high-quality generative network able to\nsynthesize images for ImageNet [5]. ImageNet is a diverse dataset containing 1000 classes with\n\u223c1300 training images for each class. Due to the large amount of variation within and between\nclasses, generative modeling of ImageNet has proven to be a challenging problem [21, 31, 4]. Brock et\nal. [4] list several ImageNet classes that are particularly easy or dif\ufb01cult for their method. The dif\ufb01cult\nclasses often contain precise global structure or unaligned human faces, or they are underrepresented\n\n5\n\n0.00.20.40.60.81.0Recall0.40.60.81.0PrecisionABCDABCD0.00.20.40.60.81.0Truncation \u00c30.00.20.40.60.81.0PrecisionRecall0.00.20.40.60.81.0Truncation \u00c30.00.20.40.60.81.0PrecisionRecall\f(a)\n\n(b)\n\nGreat Pyrenees\n(FID = 30.0)\n\nBroccoli\n\n(FID = 40.2)\n\nEgyptian cat\n(FID = 39.2)\n\nLemon\n\n(FID = 46.4)\n\nBubble\n\n(FID = 63.5)\n\nBaseball player\n(FID = 49.2)\n\nTrumpet\n\n(FID = 100.4)\n\nPark bench\n(FID = 80.3)\n\nFigure 6: Our precision and recall for four easy (a) and four dif\ufb01cult (b) ImageNet classes using\nBigGAN. For each class we sweep the truncation parameter \u03c8 linearly from 0.3 to 1.0, left-to-right.\nThe FIDs refer to a non-truncated model, i.e., \u03c8 = 1.0. The per-class metrics were computed using\nall available training images of the class and an equal number of generated images, while the curve\nfor the entire dataset was computed using 50k real and generated images.\n\nin the dataset. The easy classes are largely textural, lack exact global structure, and are common in\nthe dataset. Dogs are a noteworthy special case in ImageNet: with almost a hundred different dog\nbreeds listed as separate classes, there is much more training data for dogs than for any other class,\nmaking them arti\ufb01cially easy. To a lesser extent, the same applies to cats that occupy \u223c10 classes.\nFigure 6 illustrates the precision and recall for some of these classes over a range of truncation\nvalues. We notice that precision is invariably high for the suspected easy classes, including cats and\ndogs, and clearly lower for the dif\ufb01cult ones. Brock et al. state that the quality of generated samples\nincreases as more truncation is applied, and the precision as reported by our method is in line with\nthis observation. Recall paints a more detailed picture. It is very low for classes such as \u201cLemon\u201d or\n\u201cBroccoli\u201d, implying much of the variation has been missed, but FID is nevertheless quite good for\nboth. Since FID corresponds to a Wasserstein-2 distance in the feature space, low intrinsic variation\nimplies low FID even when much of that variation is missed. Correspondingly, recall is clearly higher\nfor the dif\ufb01cult classes. Based on visual inspection, these classes have a lot of intra-class variation\nthat BigGAN training has successfully modeled. Dogs and cats show recall similar to the dif\ufb01cult\nclasses, and their image quality and thus precision is likely boosted by the additional training data.\n\n4 Using precision and recall to analyze and improve StyleGAN\n\nGenerative models have seen rapid improvements recently, and FID has risen as the de facto standard\nfor determining whether a proposed technique is considered bene\ufb01cial or not. However, as we have\nshown in Section 3, relying on FID alone may hide important qualitative differences in the results and\nit may inadvertently favor a particular tradeoff between precision and recall that is not necessarily\naligned with the actual goals. In this section, we use our metric to shed light onto some of the\ndesign decisions associated with the model itself. Appendix C in the supplement performs a similar,\nprincipled analysis for truncation methods. We use StyleGAN [12] in all experiments, trained with\nFFHQ at 1024 \u00d7 1024.\n\n4.1 Network architectures and training con\ufb01gurations\n\nTo avoid drawing false conclusions when comparing different training runs, we must properly account\nfor the stochastic nature of the training process. For example, we have observed that FID can often\nvary by up to \u00b114% between consecutive training iterations with StyleGAN. The common approach\nis to amortize this variation by taking multiple snapshots of the model at regular intervals and selecting\nthe best one for further analysis [12]. With our metric, however, we are faced with the problem of\nmultiobjective optimization [3]: the snapshots represent a wide range of different tradeoffs between\nprecision and recall, as illustrated in Figure 7a. To avoid making assumptions about the desired\ntradeoff, we identify the Pareto frontier, i.e., the minimal subset of snapshots that is guaranteed to\ncontain the optimal choice for any given tradeoff.\n\n6\n\n0.00.10.20.30.40.5Recall0.50.60.70.80.91.0PrecisionGreat PyreneesBroccoliEgyptian catLemonBaseball playerBubbleTrumpetPark benchEntire dataset\fCon\ufb01guration\n\nA) StyleGAN\nB) No mb. std.\nC) B + low \u03b3\nD) No growing\nE) Rand. trans.\nF) No inst. norm.\n\nFID\n\n4.43\n8.58\n10.34\n6.14\n4.27\n4.16\n\n(a)\n\n(b)\n\n(c)\n\nFigure 7: (a) Precision and recall for different snapshots of StyleGAN taken during the training,\nalong with their corresponding Pareto frontier. We use the standard training con\ufb01guration by\nKarras et al. [12] with FFHQ and \u03c8 = 1. (b) Different training con\ufb01gurations lead to vastly different\ntradeoffs between precision and recall. (c) Best FID obtained for each con\ufb01guration (lower is better).\n\nFigure 7b shows the Pareto frontiers for several variants of StyleGAN. The baseline con\ufb01guration (A)\nhas a dedicated minibatch standard deviation layer that aims to increase variation in the generated\nimages [11, 15]. Using our metric, we can con\ufb01rm that this is indeed the case: removing the\nlayer shifts the tradeoff considerably in favor of precision over recall (B). We observe that R1\nregularization [18] has a similar effect: reducing the \u03b3 parameter by 100\u00d7 shifts the balance even\nfurther (C). Karras et al. [11] argue that their progressive growing technique improves both quality\nand variation, and indeed, disabling it reduces both aspects (D). Moreover, we see that randomly\ntranslating the inputs of the discriminator by \u221216 . . . 16 pixels improves precision (E), whereas\ndisabling instance normalization in the AdaIN operation [10], unexpectedly, improves recall (F).\nFigure 7c shows the best FID obtained for each con\ufb01guration; the corresponding snapshots are\nhighlighted in Figure 7a,b. We see that FID favors con\ufb01gurations with high recall (A, F) over\nthe ones with high precision (B, C), and the same is also true for the individual snapshots. The\nbest con\ufb01guration in terms of recall (F) yields a new state-of-the-art FID for this dataset. Random\ntranslation (E) is an exceptional case: it improves precision at the cost of recall, similar to (B), but\nalso manages to slightly improve FID at the same time. We leave an in-depth study of these effects\nfor future work.\n\n5 Estimating the quality of individual samples\n\nWhile our precision metric provides a way to assess the overall quality of a population of generated\nimages, it yields only a binary result for an individual sample and therefore is not suitable for ranking\nimages by their quality. Here, we present an extension of the classi\ufb01cation function f (Equation 1)\nthat provides a continuous estimate of how close a given sample is to the manifold of real images.\nWe de\ufb01ne a realism score R that increases the closer an image is to the manifold and decreases the\nfurther an image is from the manifold. Let \u03c6g be a feature vector of a generated image and \u03c6r a\nfeature vector of a real image from set \u03a6r. Realism score of \u03c6g is calculated as\n\n(cid:40)(cid:107)\u03c6r \u2212 NNk (\u03c6r, \u03a6r)(cid:107)2\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)\u03c6g \u2212 \u03c6r\n\n(cid:41)\n\nR(\u03c6g, \u03a6r) = max\n\u03c6r\n\n.\n\n(3)\n\nThis is a continuous extension of f (\u03c6g, \u03a6r) with the simple relation that f (\u03c6g, \u03a6r) = 1 iff\nR(\u03c6g, \u03a6r) \u2265 1. In other words, when R \u2265 1, the feature vector \u03c6g is inside the (k-NN induced)\nhypersphere of at least one \u03c6r.\nWith any \ufb01nite training set, the k-NN hyperspheres become larger in regions where the training\nsamples are sparse, i.e., regions with low representation. When measuring the quality of a large\npopulation of generated images, these underrepresented regions have little impact as it is unlikely\nthat too many generated samples land there \u2014 even though the hyperspheres may be large, they are\nsparsely located and cover a small volume of space in total. However, when computing the realism\n\n7\n\n0.320.340.360.380.400.420.44Recall0.680.690.700.710.720.730.740.75PrecisionTraining snapshotsLowest FIDPareto frontier0.200.250.300.350.400.450.50Recall0.610.630.650.670.690.710.730.750.770.790.810.83PrecisionA) StyleGANB) No mb. std.C) B + low \u03b3D) No growingE) Rand. trans.F) No inst. norm.\fRed wine\n\nAlp\n\nGolden Retriever\n\nLadybug\n\nLighthouse\n\nTabby cat\n\nMonarch butter\ufb02y Cocker Spaniel\n\n2\n-\nt\ns\ne\nB\n\n2\n-\nt\ns\nr\no\nW\n\nFigure 8: Quality of individual samples of BigGAN from eight classes. Top: Images with high\nrealism. Bottom: Images with low realism. We show two images with the highest and lowest realism\nscore selected from 1000 non-truncated images.\n\nscore for a single image, a sample that happens to land in such a fringe hypersphere may obtain\na wildly inaccurate score. Large errors, even if they are rare, would undermine the usefulness of\nthe metric. We tackle this problem by discarding half of the hyperspheres with the largest radii. In\nother words, the maximum in Equation 3 is not taken over all \u03c6r \u2208 \u03a6r but only over those \u03c6r whose\nassociated hypersphere is smaller than the median. This pruning yields an overconservative estimate\nof the real manifold, but it leads to more consistent realism scores. Note that we use this approach\nonly with R, not with f.\nFigure 8 shows example images from BigGAN with high and low realism. In general, the samples\nwith high realism display a clear object from the given class, whereas the object is often distorted to\nunrecognizable for the low realism images. Appendix D in the supplement provides more examples.\n\n5.1 Quality of interpolations\n\nAn interesting application for the realism score is to evaluate the quality of interpolations. We do this\nwith StyleGAN using linear interpolation in the intermediate latent space W as suggested by Karras\net al. [12]. Figure 9 shows four example interpolation paths with randomly sampled latent vectors\nas endpoints. Paths A appears to be located completely inside the real manifold, path D completely\noutside it, and paths B and C have one endpoint inside the real manifold and one outside it. The\nrealism scores assigned to paths A\u2013D correlate well with the perceived image quality: Images with\nlow scores contain multiple artifacts and can be judged to be outside the real manifold, and vice versa\nfor high-scoring images. See Appendix D in the supplement for additional examples.\nWe can use interpolations to investigate the shape of the subset of W that produces realistic-looking\nimages. In this experiment, we sampled without truncation 1M latent vectors in W for which\nR \u2265 1, giving rise to 500k interpolation paths with both endpoints on the real manifold. It would\nbe unrealistic to expect all intermediate images on these paths to also have R \u2265 1, so we chose to\nconsider an interpolation path where more than 25% of the intermediate images have R < 0.9 as\nstraying too far from the real manifold. Somewhat surprisingly, we found that only 2.4% of the\npaths crossed unrealistic parts of W under this de\ufb01nition, suggesting that the subset of W on the real\nmanifold is highly convex. We see potential in using the realism score for measuring the shape of this\nregion in W with greater accuracy, possibly allowing the exclusion of unrealistic images in a more\nre\ufb01ned manner than with truncation-like methods.\n\n8\n\n\fA\n\nB\n\nC\n\nD\n\nFigure 9: Realism score for four interpolation paths as function of linear interpolation parameter t\nand corresponding images from paths A\u2013D. We did not use truncation when generating the images.\n\n6 Conclusion\n\nWe have demonstrated through several experiments that the separate assessment of precision and recall\ncan reveal interesting insights about generative models and can help to improve them further. We\nbelieve that the separate quanti\ufb01cation of precision can also be useful in the context of image-to-image\ntranslation [33], where the quality of individual images is of great interest.\nUsing our metric, we have identi\ufb01ed previously unknown training con\ufb01guration-related effects in\nSection 4.1, raising the question whether truncation is really necessary if similar tradeoffs can be\nachieved by modifying the training con\ufb01guration appropriately. We leave the in-depth study of these\neffects for future work.\nFinally, it has recently emerged that density models can be incapable of assessing whether a given\nexample belongs to the training distribution [22]. By explicitly modeling the real manifold, our\nmetrics may provide an alternative way for estimating this.\n\n7 Acknowledgements\n\nWe thank David Luebke for helpful comments; Janne Hellsten, and Tero Kuosmanen for compute\ninfrastructure.\n\nReferences\n[1] S. Arora and Y. Zhang. Do GANs actually learn the distribution? An empirical study. CoRR,\n\nabs/1706.08224, 2017.\n\n[2] M. Bi\u00b4nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. CoRR,\n\nabs/1801.01401, 2018.\n\n[3] J. Branke, J. Branke, K. Deb, K. Miettinen, and R. Slowi\u00b4nski. Multiobjective optimization: Interactive and\n\nevolutionary approaches, volume 5252. Springer Science & Business Media, 2008.\n\n[4] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural image synthesis.\n\nIn Proc. ICLR, 2019.\n\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In Proc. CVPR, 2009.\n\n[6] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. CoRR, abs/1605.08803,\n\n2016.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative Adversarial Networks. In NIPS, 2014.\n\n[8] A. Grover, M. Dhar, and S. Ermon. Flow-GAN: Combining maximum likelihood and adversarial learning\n\nin generative models. In Proc. AAAI, 2018.\n\n[9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale\n\nupdate rule converge to a local Nash equilibrium. In NIPS, pages 6626\u20136637, 2017.\n\n[10] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization.\n\nCoRR, abs/1703.06868, 2017.\n\n9\n\n0.00.20.40.60.81.0Interpolation parameter t0.00.51.01.52.0Realism score RABCD\f[11] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability,\n\nand variation. CoRR, abs/1710.10196, 2017.\n\n[12] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.\n\nIn Proc. CVPR, 2019.\n\n[13] D. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. CoRR,\n\nabs/1807.03039, 2018.\n\n[14] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-supervised learning with deep generative\n\nmodels. In Proc. NIPS, 2014.\n\n[15] Z. Lin, A. Khetan, G. Fanti, and S. Oh. PacGAN: The power of two samples in generative adversarial\n\nnetworks. CoRR, abs/1712.04086, 2017.\n\n[16] D. Lopez-Paz and M. Oquab. Revisiting classi\ufb01er two-sample tests. In Proc. ICLR, 2017.\n[17] M. Marchesi. Megapixel size image creation using generative adversarial networks. CoRR, abs/1706.00082,\n\n2017.\n\n[18] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? CoRR,\n\nabs/1801.04406, 2018.\n\n[19] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. CoRR,\n\nabs/1611.02163, 2016.\n\n[20] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial\n\nnetworks. CoRR, abs/1802.05957, 2018.\n\n[21] T. Miyato and M. Koyama. cGANs with projection discriminator. CoRR, abs/1802.05637, 2018.\n[22] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models\n\nknow what they don\u2019t know? In Proc. ICLR, 2019.\n\n[23] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er GANs. In ICML,\n\n2017.\n\n[24] M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via\n\nprecision and recall. CoRR, abs/1806.00035, 2018.\n\n[25] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining GANs. In NIPS, 2016.\n\n[26] L. Simon, R. Webster, and J. Rabin. Revisiting precision and recall de\ufb01nition for generative model\n\nevaluation. CoRR, abs/1905.05441, 2019.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[28] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. In Proc. ICLR, 2018.\n[29] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, pages\n\n1747\u20131756, 2016.\n\n[30] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional\n\nimage generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016.\n\n[31] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. CoRR,\n\nabs/1805.08318, 2018.\n\n[32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep\n\nfeatures as a perceptual metric. In Proc. CVPR, 2018.\n\n[33] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. CoRR, abs/1703.10593, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2176, "authors": [{"given_name": "Tuomas", "family_name": "Kynk\u00e4\u00e4nniemi", "institution": "NVIDIA; Aalto University"}, {"given_name": "Tero", "family_name": "Karras", "institution": "NVIDIA"}, {"given_name": "Samuli", "family_name": "Laine", "institution": "NVIDIA"}, {"given_name": "Jaakko", "family_name": "Lehtinen", "institution": "Aalto University & NVIDIA"}, {"given_name": "Timo", "family_name": "Aila", "institution": "NVIDIA"}]}