{"title": "Are GANs Created Equal? A Large-Scale Study", "book": "Advances in Neural Information Processing Systems", "page_first": 700, "page_last": 709, "abstract": "Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the non-saturating GAN introduced in \\cite{goodfellow2014generative}.", "full_text": "Are GANs Created Equal? A Large-Scale Study\n\nMario Lucic(cid:63) Karol Kurach(cid:63) Marcin Michalski\n\nGoogle Brain\n\nOlivier Bousquet\n\nSylvain Gelly\n\nAbstract\n\nGenerative adversarial networks (GAN) are a powerful subclass of generative\nmodels. Despite a very rich research activity leading to numerous interesting\nGAN algorithms, it is still very hard to assess which algorithm(s) perform better\nthan others. We conduct a neutral, multi-faceted large-scale empirical study on\nstate-of-the art models and evaluation measures. We \ufb01nd that most models can\nreach similar scores with enough hyperparameter optimization and random restarts.\nThis suggests that improvements can arise from a higher computational budget and\ntuning more than fundamental algorithmic changes. To overcome some limitations\nof the current metrics, we also propose several data sets on which precision and\nrecall can be computed. Our experimental results suggest that future GAN research\nshould be based on more systematic and objective evaluation procedures. Finally,\nwe did not \ufb01nd evidence that any of the tested algorithms consistently outperforms\nthe non-saturating GAN introduced in [9].\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GAN) are a powerful subclass of generative models and were\nsuccessfully applied to image generation and editing, semi-supervised learning, and domain adaptation\n[22, 27]. In the GAN framework the model learns a deterministic transformation G of a simple\ndistribution pz, with the goal of matching the data distribution pd. This learning problem may be\nviewed as a two-player game between the generator, which learns how to generate samples which\nresemble real data, and a discriminator, which learns how to discriminate between real and fake data.\nBoth players aim to minimize their own cost and the solution to the game is the Nash equilibrium\nwhere neither player can improve their cost unilaterally [9].\nVarious \ufb02avors of GANs have been recently proposed, both purely unsupervised [9, 1, 10, 5] as well\nas conditional [20, 21]. While these models achieve compelling results in speci\ufb01c domains, there is\nstill no clear consensus on which GAN algorithm(s) perform objectively better than others. This is\npartially due to the lack of robust and consistent metric, as well as limited comparisons which put all\nalgorithms on equal footage, including the computational budget to search over all hyperparameters.\nWhy is it important? Firstly, to help the practitioner choose a better algorithm from a very large set.\nSecondly, to make progress towards better algorithms and their understanding, it is useful to clearly\nassess which modi\ufb01cations are critical, and which ones are only good on paper, but do not make a\nsigni\ufb01cant difference in practice.\nThe main issue with evaluation stems from the fact that one cannot explicitly compute the probability\npg(x). As a result, classic measures, such as log-likelihood on the test set, cannot be evaluated.\nConsequently, many researchers focused on qualitative comparison, such as comparing the visual\nquality of samples. Unfortunately, such approaches are subjective and possibly misleading [8]. As a\nremedy, two evaluation metrics were proposed to quantitatively assess the performance of GANs.\nBoth assume access to a pre-trained classi\ufb01er. Inception Score (IS) [24] is based on the fact that a\ngood model should generate samples for which, when evaluated by the classi\ufb01er, the class distribution\nhas low entropy. At the same time, it should produce diverse samples covering all classes. In contrast,\n\n(cid:63)Indicates equal authorship. Correspondence to {lucic,kkurach}@google.com.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFr\u00e9chet Inception Distance is computed by considering the difference in embedding of true and fake\ndata [11]. Assuming that the coding layer follows a multivariate Gaussian distribution, the distance\nbetween the distributions is reduced to the Fr\u00e9chet distance between the corresponding Gaussians.\nOur main contributions: (1) We provide a fair and comprehensive comparison of the state-of-the-art\nGANs, and empirically demonstrate that nearly all of them can reach similar values of FID, given a\nhigh enough computational budget. (2) We provide strong empirical evidence2 that to compare GANs\nit is necessary to report a summary of distribution of results, rather than the best result achieved, due\nto the randomness of the optimization process and model instability. (3) We assess the robustness\nof FID to mode dropping, use of a different encoding network, and provide estimates of the best\nFID achievable on classic data sets. (4) We introduce a series of tasks of increasing dif\ufb01culty for\nwhich undisputed measures, such as precision and recall, can be approximately computed. (5) We\nopen-sourced our experimental setup and model implementations at goo.gl/G8kf5J.\n\n2 Background and Related Work\n\nand (ii) The variability of the samples should be high, or equivalently, the marginal(cid:82)\n\nThere are several ongoing challenges in the study of GANs, including their convergence and general-\nization properties [2, 19], and optimization stability [24, 1]. Arguably, the most critical challenge is\ntheir quantitative evaluation. The classic approach towards evaluating generative models is based\non model likelihood which is often intractable. While the log-likelihood can be approximated for\ndistributions on low-dimensional vectors, in the context of complex high-dimensional data the task\nbecomes extremely challenging. Wu et al. [26] suggest an annealed importance sampling algorithm to\nestimate the hold-out log-likelihood. The key drawback of the proposed approach is the assumption\nof the Gaussian observation model which carries over all issues of kernel density estimation in high-\ndimensional spaces. Theis et al. [25] provide an analysis of common failure modes and demonstrate\nthat it is possible to achieve high likelihood, but low visual quality, and vice-versa. Furthermore, they\nargue against using Parzen window density estimates as the likelihood estimate is often incorrect.\nIn addition, ranking models based on these estimates is discouraged [4]. For a discussion on other\ndrawbacks of likelihood-based training and evaluation consult Husz\u00e1r [12].\nInception Score (IS). Proposed by Salimans et al. [24], IS offers a way to quantitatively evaluate\nthe quality of generated samples. The score was motivated by the following considerations: (i) The\nconditional label distribution of samples containing meaningful objects should have low entropy,\nz p(y|x =\nG(z))dz should have high entropy. Finally, these desiderata are combined into one score, IS(G) =\nexp(Ex\u223cG[dKL(p(y | x), p(y)]). The classi\ufb01er is Inception Net trained on Image Net. The authors\nfound that this score is well-correlated with scores from human annotators [24]. Drawbacks include\ninsensitivity to the prior distribution over labels and not being a proper distance.\nFr\u00e9chet Inception Distance (FID). Proposed by Heusel et al. [11], FID provides an alternative\napproach. To quantify the quality of generated samples, they are \ufb01rst embedded into a feature space\ngiven by (a speci\ufb01c layer) of Inception Net. Then, viewing the embedding layer as a continuous\nmultivariate Gaussian, the mean and covariance is estimated for both the generated data and the\nreal data. The Fr\u00e9chet distance between these two Gaussians is then used to quantify the quality\nof the samples, i.e. FID(x, g) = ||\u00b5x \u2212 \u00b5g||2\n2 ), where (\u00b5x, \u03a3x), and\n(\u00b5g, \u03a3g) are the mean and covariance of the sample embeddings from the data distribution and model\ndistribution, respectfully. The authors show that the score is consistent with human judgment and\nmore robust to noise than IS [11]. Furthermore, the authors present compelling results showing\nnegative correlation between the FID and visual quality of generated samples. Unlike IS, FID can\ndetect intra-class mode dropping, i.e. a model that generates only one image per class can score a\nperfect IS, but will have a bad FID. We provide a thorough empirical analysis of FID in Section 5. A\nsigni\ufb01cant drawback of both measures is the inability to detect over\ufb01tting. A \u201cmemory GAN\u201d which\nstores all training samples would score perfectly. Finally, as the FID estimator is consistent, relative\nmodel comparisons for large sample sizes are sound.\nA very recent study comparing several GANs using IS has been presented by Fedus et al. [7]. The\nauthors focus on IS and consider a smaller subset of GANs. In contrast, our focus is on providing a\nfair assessment of the current state-of-the-art GANs using FID, as well as precision and recall, and\nalso verifying the robustness of these models in a large-scale empirical evaluation.\n\n2 + Tr(\u03a3x + \u03a3g \u2212 2(\u03a3x\u03a3g) 1\n\n2Reproducing these experiments requires approximately 6.85 GPU years (NVIDIA P100).\n\n2\n\n\fTable 1: Generator and discriminator loss functions. The main difference whether the discriminator outputs a\nprobability (MM GAN, NS GAN, DRAGAN) or its output is unbounded (WGAN, WGAN GP, LS GAN, BEGAN),\nwhether the gradient penalty is present (WGAN GP, DRAGAN) and where is it evaluated.\n\nGAN\n\nMM GAN\n\nNS GAN\n\nWGAN\n\nWGAN GP\n\nLS GAN\n\nDRAGAN\n\nBEGAN\n\nD\n\nD\n\nDISCRIMINATOR LOSS\nLGAN\nD = \u2212Ex\u223cpd [log(D(x))] \u2212 E\u02c6x\u223cpg [log(1 \u2212 D(\u02c6x))]\nLNSGAN\n= \u2212Ex\u223cpd [log(D(x))] \u2212 E\u02c6x\u223cpg [log(1 \u2212 D(\u02c6x))]\nLWGAN\n= \u2212Ex\u223cpd [D(x)] + E\u02c6x\u223cpg [D(\u02c6x)]\nLWGANGP\n= LWGAN\nLLSGAN\n= \u2212Ex\u223cpd [(D(x) \u2212 1)2] + E\u02c6x\u223cpg [D(\u02c6x)2]\nLDRAGAN\n= LGAN\nLBEGAN\n= Ex\u223cpd [||x \u2212 AE(x)||1] \u2212 ktE\u02c6x\u223cpg [||\u02c6x \u2212 AE(\u02c6x)||1]\n\nD + \u03bbE\u02c6x\u223cpd+N (0,c)[(||\u2207D(\u02c6x)||2 \u2212 1)2]\n\n+ \u03bbE\u02c6x\u223cpg [(||\u2207D(\u03b1x + (1 \u2212 \u03b1\u02c6x)||2 \u2212 1)2]\n\nD\n\nD\n\nD\n\nD\n\nD\n\nG\n\nG\n\nGENERATOR LOSS\nLGAN\nG = E\u02c6x\u223cpg [log(1 \u2212 D(\u02c6x))]\nLNSGAN\n= \u2212E\u02c6x\u223cpg [log(D(\u02c6x))]\nLWGAN\n= \u2212E\u02c6x\u223cpg [D(\u02c6x)]\nLWGANGP\n= \u2212E\u02c6x\u223cpg [D(\u02c6x)]\nLLSGAN\n= \u2212E\u02c6x\u223cpg [(D(\u02c6x \u2212 1))2]\nLDRAGAN\n= E\u02c6x\u223cpg [log(1 \u2212 D(\u02c6x))]\nLBEGAN\n= E\u02c6x\u223cpg [||\u02c6x \u2212 AE(\u02c6x)||1]\n\nG\n\nG\n\nG\n\nG\n\n3 Flavors of Generative Adversarial Networks\n\nIn this work we focus on unconditional generative adversarial networks. In this setting, only unlabeled\ndata is available for learning. The optimization problems arising from existing approaches differ by (i)\nthe constraint on the discriminators output and corresponding loss, and the presence and application\nof gradient norm penalty.\nIn the original GAN formulation [9] two loss functions were proposed. In the minimax GAN the\ndiscriminator outputs a probability and the loss function is the negative log-likelihood of a binary\nclassi\ufb01cation task (MM GAN in Table 1). Here the generator learns to generate samples that have a low\nprobability of being fake. To improve the gradient signal, the authors also propose the non-saturating\nloss (NS GAN in Table 1), where the generator instead aims to maximize the probability of generated\nsamples being real. In Wasserstein GAN [1] the discriminator is allowed to output a real number and\nthe objective function is equivalent to the MM GAN loss without the sigmoid (WGAN in Table 1).\nThe authors prove that, under an optimal (Lipschitz smooth) discriminator, minimizing the value\nfunction with respect to the generator minimizes the Wasserstein distance between model and data\ndistributions. Weights of the discriminator are clipped to a small absolute value to enforce smoothness.\nTo improve on the stability of the training, Gulrajani et al. [10] instead add a soft constraint on the\nnorm of the gradient which encourages the discriminator to be 1-Lipschitz. The gradient norm is\nevaluated on points obtained by linear interpolation between data points and generated samples where\nthe optimal discriminator should have unit gradient norm [10]. Gradient norm penalty can also be\nadded to both MM GAN and NS GAN and evaluated around the data manifold (DRAGAN [15] in Table 1\nbased on NS GAN). This encourages the discriminator to be piecewise linear around the data manifold.\nNote that the gradient norm can also be evaluated between fake and real points, similarly to WGAN\nGP, and added to either MM GAN or NS GAN [7]. Mao et al. [18] propose a least-squares loss for the\ndiscriminator and show that minimizing the corresponding objective (LS GAN in Table 1) implicitly\nminimizes the Pearson \u03c72 divergence. The idea is to provide smooth loss which saturates slower than\nthe sigmoid cross-entropy loss of the original MM GAN. Finally, Berthelot et al. [5] propose to use\nan autoencoder as a discriminator and optimize a lower bound of the Wasserstein distance between\nauto-encoder loss distributions on real and fake data. They introduce an additional hyperparameter \u03b3\nto control the equilibrium between the generator and discriminator.\n\n4 Challenges of a Fair Comparison\n\nThere are several interesting dimensions to this problem, and there is no single right way to compare\nthese models (i.e. the loss function used in each GAN). Unfortunately, due to the combinatorial\nexplosion in the number of choices and their ordering, not all relevant options can be explored. While\nthere is no de\ufb01nite answer on how to best compare two models, in this work we have made several\npragmatic choices which were motivated by two practical concerns: providing a neutral and fair\ncomparison, and a hard limit on the computational budget.\nWhich metric to use? Comparing models implies access to some metric. As discussed in Section 2,\nclassic measures, such as model likelihood cannot be applied. We will argue for and study two sets of\n\n3\n\n\fDATA SET\nCELEBA\nCIFAR10\nFASHION-MNIST\nMNIST\n\nAVG. FID DEV. FID\n0.02\n0.02\n0.03\n0.02\n\n2.27\n5.19\n2.60\n1.25\n\n(a) Bias and variance\n\n(b) Mode dropping\n\n(c) VGG vs Inception\n\nFigure 1: Figure (a) shows that FID has a slight bias, but low variance on samples of size 10000. Figure (b)\nshows that FID is extremely sensitive to mode dropping. Figure (c) shows the high rank correlation (Spearman\u2019s\n\u03c1 = 0.9) between FID score computed on InceptionNet vs FID computed using VGG for the CELEBA data set\n(for interesting range: FID < 200).\n\nevaluation metrics in Section 5: FID, which can be computed on all data sets, and precision, recall,\nand F1, which we can compute for the proposed tasks.\nHow to compare models? Even when the metric is \ufb01xed, a given algorithm can achieve very different\nscores, when varying the architecture, hyperparameters, random initialization (i.e. random seed for\ninitial network weights), or the data set. Sensible targets include best score across all dimensions (e.g.\nto claim the best performance on a \ufb01xed data set), average or median score (rewarding models which\nare good in expectation), or even the worst score (rewarding models with worst-case robustness).\nThese choices can even be combined \u2014 for example, one might train the model multiple times using\nthe best hyperparameters, and average the score over random initializations).\nFor each of these dimensions, we took several pragmatic choices to reduce the number of possible\ncon\ufb01gurations, while still exploring the most relevant options.\n1. Architecture: We use the same architecture for all models. We note that this architecture suf\ufb01ces\n\nto achieve good performance on considered data sets.\n\n2. Hyperparameters: For both training hyperparameters (e.g. the learning rate), as well as model\nspeci\ufb01c ones (e.g. gradient penalty multiplier), there are two valid approaches: (i) perform the\nhyperparameter optimization for each data set, or (ii) perform the hyperparameter optimization\non one data set and infer a good range of hyperparameters to use on other data sets. We explore\nboth avenues in Section 6.\n\n3. Random seed: Even with everything else being \ufb01xed, varying the random seed may in\ufb02uence on\n\nthe results. We study this effect and report the corresponding con\ufb01dence intervals.\n\n4. Data set: We chose four popular data sets from GAN literature.\n5. Computational budget: Depending on the budget to optimize the parameters, different algo-\nrithms can achieve the best results. We explore how the results vary depending on the budget k,\nwhere k is the number of hyperparameter settings for a \ufb01xed model.\n\nIn practice, one can either use hyperparameter values suggested by respective authors, or try to\noptimize them. Figure 4 and in particular Figure 14 show that optimization is necessary. Hence, we\noptimize the hyperparameters for each model and data set by performing a random search. While we\npresent the results which were obtained by a random search, we have also investigated sequential\nBayesian optimization, which resulted in comparable results. We concur that the models with fewer\nhyperparameters have an advantage over models with many hyperparameters, but consider this fair as\nit re\ufb02ects the experience of practitioners searching for good hyperparameters for their setting.\n\n5 Metrics\n\nIn this work we focus on two sets of metrics. We \ufb01rst analyze the recently proposed FID in terms\nof robustness (of the metric itself), and conclude that it has desirable properties and can be used in\npractice. Nevertheless, this metric, as well as Inception Score, is incapable of detecting over\ufb01tting: a\nmemory GAN which simply stores all training samples would score perfectly under both measures.\nBased on these shortcomings, we propose an approximation to precision and recall for GANs and\nhow that it can be used to quantify the degree of over\ufb01tting. We stress that the proposed method\nshould be viewed as complementary to IS or FID, rather than a replacement.\n\n4\n\n12345678910Number of classes020406080Frechet Inception DistanceData setCIFAR10FASHION-MNISTMNIST050100150200250FID (Inception)05101520253035FID (VGG) (In thousands)pearsonr = 0.9; p = 0\f(b) High precision, low recall\n\n(d) Low precision, low recall\n\n(a) High precision, high recall\nFigure 2: Samples from models with (a) high recall and precision, (b) high precision, but low recall (lacking in\ndiversity), (c) low precision, but high recall (can decently reproduce triangles, but fails to capture convexity),\nand (d) low precision and low recall.\n\n(c) Low precision, high recall\n\nFr\u00e9chet Inception Distance. FID was shown to be robust to noise [11]. Here we quantify the bias\nand variance of FID, its sensitivity to the encoding network and sensitivity to mode dropping. To\nthis end, we partition the data set into two groups, i.e. X = X1 \u222a X2. Then, we de\ufb01ne the data\ndistribution pd as the empirical distribution on a random subsample of X1 and the model distribution\npg to be the empirical distribution on a random subsample from X2. For a random partition this\n\u201cmodel distribution\u201d should follow the data distribution.\nWe evaluate the bias and variance of FID on four data sets from the GAN literature. We start by using\nthe default train vs. test partition and compute the FID between the test set (limited to N = 10000\nsamples for CelebA) and a sample of size N from the train set. Sampling from the train set is\nperformed M = 50 times. The optimistic estimates of FID are reported in Table 1. We observe\nthat FID has high bias, but small variance. From this perspective, estimating the full covariance\nmatrix might be unnecessary and counter-productive, and a constrained version might suf\ufb01ce. To test\nthe sensitivity to train vs. test partitioning, we consider 50 random partitions (keeping the relative\nsizes \ufb01xed, i.e. 6 : 1 for MNIST) and compute the FID with M = 1 sample. We observe results\nsimilar to Table 1 which is expected as both training and testing data sets are sampled from the same\ndistribution. Furthermore, we evaluate the sensitivity to mode dropping as follows: we \ufb01x a partition\nX = X1 \u222a X2 and subsample X2 while keeping only samples from the \ufb01rst k classes, increasing k\nfrom 1 to 10. For each k, we consider 50 random subsamples from X2. Figure 1 shows that FID is\nheavily in\ufb02uenced by the missing modes. Finally, we estimate the sensitivity to the choice of the\nencoding network by computing FID using the 4096 dimensional FC7 layer of the VGG network\ntrained on ImageNet. Figure 1 shows the resulting distribution. We observe high Spearman\u2019s rank\ncorrelation (\u03c1 = 0.9) which encourages the use of the coding layer suggested by the authors.\nPrecision, recall and F1 score. Precision, recall and F1 score are proven and widely adopted\ntechniques for quantitatively evaluating the quality of discriminative models. Precision measures\nthe fraction of relevant retrieved instances among the retrieved instances, while recall measures the\nfraction of the retrieved instances among relevant instances. F1 score is the harmonic average of\nprecision and recall. Notice that IS mainly captures precision: It will not penalize the model for not\nproducing all modes of the data distribution \u2014 it will only penalize the model for not producing all\nclasses. On the other hand, FID captures both precision and recall. Indeed, a model which fails to\nrecover different modes of the data distribution will suffer in terms of FID.\nWe propose a simple and effective data set for evaluating (and comparing) generative models. Our\nmain motivation is that the currently used data sets are either too simple (e.g. simple mixtures of\nGaussians, or MNIST) or too complex (e.g. ImageNet). We argue that it is critical to be able to\nincrease the complexity of the task in a relatively smooth and controlled fashion. To this end, we\npresent a set of tasks for which we can approximate the precision and recall of each model. As a\nresult, we can compare different models based on established metrics. The main idea is to construct\na data manifold such that the distances from samples to the manifold can be computed ef\ufb01ciently.\nAs a result, the problem of evaluating the quality of the generative model is effectively transformed\ninto a problem of computing the distance to the manifold. This enables an intuitive approach for\nde\ufb01ning the quality of the model. Namely, if the samples from the model distribution pg are (on\naverage) close to the manifold, its precision is high. Similarly, high recall implies that the generator\ncan recover (i.e. generate something close to) any sample from the manifold.\n\n5\n\n\fFigure 3: How does the minimum FID behave as a function of the budget? The plot shows the distribution of\nthe minimum FID achievable for a \ufb01xed budget along with one standard deviation interval. For each budget,\nwe estimate the mean and variance using 5000 bootstrap resamples out of 100 runs. We observe that, given a\nrelatively low budget, all models achieve a similar minimum FID. Furthermore, for a \ufb01xed FID, \u201cbad\u201d models\ncan outperform \u201cgood\u201d models given enough computational budget. We argue that the computational budget to\nsearch over hyperparameters is an important aspect of the comparison between algorithms.\n\nFor general data sets, this reduction is impractical as one has to compute the distance to the manifold\nwhich we are trying to learn. However, if we construct a manifold such that this distance is ef\ufb01ciently\ncomputable, the precision and recall can be ef\ufb01ciently evaluated. To this end, we propose a set of\ntoy data sets for which such computation can be performed ef\ufb01ciently: The manifold of convex\npolygons. As the simplest example, let us focus on gray-scale triangles represented as one channel\nimages as in Figure 2. These triangles belong to a low-dimensional manifold C3 embedded in Rd\u00d7d.\nIntuitively, the coordinate system of this manifold represents the axes of variation (e.g. rotation,\ntranslation, minimum angle size, etc.). A good generative model should be able to capture these\nfactors of variation and recover the training samples. Furthermore, it should recover any sample from\nthis manifold from which we can ef\ufb01ciently sample which is illustrated in Figure 2.\nComputing the distance to the manifold. Let us consider the simplest case: single-channel gray\nscale images represented as vectors x \u2208 Rd2. The distance of a sample \u02c6x \u2208 Rd2 to the mani-\nfold is de\ufb01ned as the squared Euclidean distance to the closest sample from the manifold C3, i.e.\n2. This is a non-convex optimization problem. We \ufb01nd an\napproximate solution by gradient descent on the vertices of the triangle (more generally, a convex\npolygon), ensuring that each iterate is a valid triangle (more generally, a convex polygon). To\nreduce the false-negative rate we repeat the algorithm 5 times from random initial solutions. To\ncompute the latent representation of a sample \u02c6x \u2208 Rd\u00d7d we invert the generator, i.e. we solve\nz(cid:63) = arg minz\u2208Rdz ||\u02c6x \u2212 G(z)||2\n\nminx\u2208C3 (cid:96)(x, \u02c6x) = (cid:80)d2\n\ni=1 ||xi \u2212 \u02c6xi||2\n\n2, using gradient descent on z while keeping G \ufb01xed [17].\n\n6 Large-scale Experimental Evaluation\n\nWe consider two budget-constrained experimental setups whereby in the (i) wide one-shot setup one\nmay select 100 samples of hyper-parameters per model, and where the range for each hyperparameter\nis wide, and (ii) the narrow two-shots setup where one is allowed to select 50 samples from more\nnarrow ranges which were manually selected by \ufb01rst performing the wide hyperparameter search\nover a speci\ufb01c data set. For the exact ranges and hyperparameter search details we refer the reader\nto the Appendix A. In the second set of experiments we evaluate the models based on the \"novel\"\nmetric: F1 score on the proposed data set. Finally, we included the Variational Autoencoder [14] in\nthe experiments as a popular alternative.\nExperimental setup. To ensure a fair comparison, we made the following choices: (i) we use the\ngenerator and discriminator architecture from INFO GAN [6] as the resulting function space is rich\nenough and all considered GANs were not originally designed for this architecture. Furthermore, it is\nsimilar to a proven architecture used in DCGAN [22]. The exception is BEGAN where an autoencoder\nis used as the discriminator. We maintain similar expressive power to INFO GAN by using identical\nconvolutional layers the encoder and approximately matching the total number of parameters.\nFor all experiments we \ufb01x the latent code size to 64 and the prior distribution over the latent space\nto be uniform on [\u22121, 1]64, except for VAE where it is Gaussian N (0, I). We choose Adam [13] as\nthe optimization algorithm as it was the most popular choice in the GAN literature (cf. Appendix F\nfor an empirical comparison to RMSProp). We apply the same learning rate for both generator\nand discriminator. We set the batch size to 64 and perform optimization for 20 epochs on MNIST\n\n6\n\n510152025303540455055606570Budget0510152025303540FIDData set = MNIST510152025303540455055606570Budget102030405060708090100Data set = FASHION-MNIST510152025303540455055606570Budget6080100120140Data set = CIFAR10510152025303540455055606570Budget406080100120Data set = CELEBAModelBEGANDRAGANLSGANMM GANNS GANWGANWGAN GP\fFigure 4: A wide range hyperparameter search (100 hyperparameter samples per model). Black stars indicate\nthe performance of suggested hyperparameter settings. We observe that GAN training is extremely sensitive to\nhyperparameter settings and there is no model which is signi\ufb01cantly more stable than others.\n\nand FASHION MNIST, 40 on CELEBA and 100 on CIFAR. These data sets are a popular choice for\ngenerative modeling, range from simple to medium complexity, which makes it possible to run many\nexperiments as well as getting decent results.\nFinally, we allow for recent suggestions, such as batch normalization in the discriminator, and\nimbalanced update frequencies of generator and discriminator. We explore these possibilities,\ntogether with learning rate, parameter \u03b21 for ADAM, and hyperparameters of each model. We report\nthe hyperparameter ranges and other details in Appendix A.\nA large hyperparameter search. We perform hyperparameter optimization and, for each run, look\nfor the best FID across the training run (simulating early stopping). To choose the best model, every\n5 epochs we compute the FID between the 10k samples generated by the model and the 10k samples\nfrom the test set. We have performed this computationally expensive search for each data set. We\npresent the sensitivity of models to the hyper-parameters in Figure 4 and the best FID achieved by\neach model in Table 2. We compute the best FID, in two phases: We \ufb01rst run a large-scale search\non a wide range of hyper-parameters, and select the best model. Then, we re-run the training of the\nselected model 50 times with different initialization seeds, to estimate the stability of the training and\nreport the mean FID and standard deviation, excluding outliers.\nFurthermore, we consider the mean FID as the computational budget increases which is shown\nin Figure 3. There are three important observations. Firstly, there is no algorithm which clearly\ndominates others. Secondly, for an interesting range of FIDs, a \u201cbad\u201d model trained on a large budget\ncan out perform a \u201cgood\u201d model trained on a small budget. Finally, when the budget is limited, any\nstatistically signi\ufb01cant comparison of the models is unattainable.\nImpact of limited computational budget. In some cases, the computational budget available to a\npractitioner is too small to perform such a large-scale hyperparameter search. Instead, one can tune\nthe range of hyperparameters on one data set and interpolate the good hyperparameter ranges for\nother data sets. We now consider this setting in which we allow only 50 samples from a set of narrow\nranges, which were selected based on the wide hyperparameter search on the FASHION-MNIST data\nset. We report the narrow hyperparameter ranges in Appendix A. Figure 14 shows the variance of\nFID per model, where the hyperparameters were selected from narrow ranges. From the practical\npoint of view, there are signi\ufb01cant differences between the models: in some cases the hyperparameter\n\nTable 2: Best FID obtained in a large-scale hyperparameter search for each data set. The asterisk (*) on some\ncombinations of models and data sets indicates the presence of signi\ufb01cant outlier runs, usually severe mode\ncollapses or training failures (** indicates up to 20% failures). We observe that the performance of each model\nheavily depends on the data set and no model strictly dominates the others. Note that these results are not\n\u201cstate-of-the-art\u201d: (i) larger architectures could improve all models, (ii) authors often report the best FID which\nopens the door for random seed optimization.\n\nMM GAN\nNS GAN\nLSGAN\nWGAN\nWGAN GP\nDRAGAN\nBEGAN\nVAE\n\nMNIST\n9.8 \u00b1 0.9\n6.8 \u00b1 0.5\n7.8 \u00b1 0.6*\n6.7 \u00b1 0.4\n20.3 \u00b1 5.0\n7.6 \u00b1 0.4\n13.1 \u00b1 1.0\n23.8 \u00b1 0.6\n\nFASHION\n29.6 \u00b1 1.6\n26.5 \u00b1 1.6\n30.7 \u00b1 2.2\n21.5 \u00b1 1.6\n24.5 \u00b1 2.1\n27.7 \u00b1 1.2\n22.9 \u00b1 0.9\n58.7 \u00b1 1.2\n\nCIFAR\n72.7 \u00b1 3.6\n58.5 \u00b1 1.9\n87.1 \u00b1 47.5\n55.2 \u00b1 2.3\n55.8 \u00b1 0.9\n69.8 \u00b1 2.0\n71.4 \u00b1 1.6\n155.7 \u00b1 11.6\n\nCELEBA\n65.6 \u00b1 4.2\n55.0 \u00b1 3.3\n53.9 \u00b1 2.8*\n41.3 \u00b1 2.0\n30.0 \u00b1 1.0\n42.3 \u00b1 3.0\n38.9 \u00b1 0.9\n85.7 \u00b1 3.8\n\n7\n\nMM GANNS GANLSGANWGANWGAN GPDRAGANBEGANVAEModel102030405060FID ScoreDataset = MNISTMM GANNS GANLSGANWGANWGAN GPDRAGANBEGANVAEModel20406080100120140Dataset = FASHION-MNISTMM GANNS GANLSGANWGANWGAN GPDRAGANBEGANVAEModel50100150200250Dataset = CIFAR10MM GANNS GANLSGANWGANWGAN GPDRAGANBEGANVAEModel50100150200250Dataset = CELEBA\fFigure 5: How does F1 score vary with computational budget? The plot shows the distribution of the maximum\nF1 score achievable for a \ufb01xed budget with a 95% con\ufb01dence interval. For each budget, we estimate the mean\nand con\ufb01dence interval (of the mean) using 5000 bootstrap resamples out of 100 runs. When optimizing for F1\nscore, both NS GAN and WGAN enjoy high precision and recall. The underwhelming performance of BEGAN and\nVAE on this particular data set merits further investigation.\n\nranges transfer from one data set to the others (e.g. NS GAN), while others are more sensitive to this\nchoice (e.g. WGAN). We note that better scores can be obtained by a wider hyperparameter search.\nThese results supports the conclusion that discussing the best score obtained by a model on a data set\nis not a meaningful way to discern between these models. One should instead discuss the distribution\nof the obtained scores.\nRobustness to random initialization. For a \ufb01xed model, hyperparameters, training algorithm, and\nthe order that the data is presented to the model, one would expect similar model performance. To test\nthis hypothesis we re-train the best models from the limited hyperparameter range considered for the\nprevious section, while changing the initial weights of the generator and discriminator networks (i.e.\nby varying a random seed). Table 2 and Figure 15 show the results for each data set. Most models are\nrelatively robust to random initialization, except LSGAN, even though for all of them the variance is\nsigni\ufb01cant and should be taken into account when comparing models.\nPrecision, recall, and F1. We perform a search over the wide range of hyperparameters and compute\nprecision and recall by considering n = 1024 samples. In particular, we compute the precision of the\nmodel by computing the fraction of generated samples with distance below a threshold \u03b4 = 0.75. We\nthen consider n samples from the test set and invert each sample x to compute z(cid:63) = G\u22121(x) and\ncompute the squared Euclidean distance between x and G(z(cid:63)). We de\ufb01ne the recall as the fraction of\nsamples with squared Euclidean distance below \u03b4. Figure 5 shows the results where we select the\nbest F1 score for a \ufb01xed model and hyperparameters and vary the budget. We observe that even for\nthis seemingly simple task, many models struggle to achieve a high F1 score. Analogous plots where\nwe instead maximize precision or recall for various thresholds are presented in Appendix E.\n\n7 Limitations of the Study\n\nData sets, neural architectures, and optimization issues. While we consider classic data sets from\nGAN research, unconditional generation was recently applied to data sets of higher resolution and\narguably higher complexity. In this study we use one neural network architecture which suf\ufb01ces\nto achieve good results in terms of FID on all considered data sets. However, given data sets of\nhigher complexity and higher resolution, it might be necessary to signi\ufb01cantly increase the number of\nparameters, which in turn might lead to larger quantitative differences between different methods.\nFurthermore, different objective functions might become sensible to the choice of the optimization\nmethod, the number of training steps, and possibly other optimization hyperparameters. These effects\nshould be systematically studied in future work.\nMetrics. It remains to be examined whether FID is stable under a more radical change of the\nencoding, e.g using a network trained on a different task. Furthermore, it might be possible to \u201cfool\u201d\nFID can probably by introducing artifacts specialized to the encoding network. From the classic\nmachine learning point of view, a major drawback of FID is that it cannot detect over\ufb01tting to the\ntraining data set \u2013 an algorithm that outputs only the training examples would have an excellent score.\nAs such, developing quantitative evaluation metrics is a critical research direction [3, 23].\n\n8\n\n510152025303540455055606570Budget0.20.00.20.40.60.81.0ValueMeasure = F1510152025303540455055606570Budget0.30.40.50.60.70.80.91.01.1Measure = Precision510152025303540455055606570Budget0.00.20.40.60.8Measure = RecallModelBEGANDRAGANMM GANNS GANWGANWGAN GP\fExploring the space of hyperparameters. Ideally, hyperparameter values suggested by the authors\nshould transfer across data sets. As such, exploring the hyperparameters \"close\" to the suggested\nones is a natural and valid approach. However, Figure 4 and in particular Figure 14 show that\noptimization is necessary. In addition, such an approach has several drawbacks: (a) no recommended\nhyperparameters are available for a given data set, (b) the parameters are different for each data set, (c)\nseveral popular models have been tuned by the community, which might imply an unfair comparison.\nFinally, instead of random search it might be bene\ufb01cial to apply (carefully tuned) sequential Bayesian\noptimization which is computationally beyond the scope of this study, but nevertheless a great\ncandidate for future work [16].\n\n8 Conclusion\n\nIn this paper we have started a discussion on how to neutrally and fairly compare GANs. We focus\non two sets of evaluation metrics: (i) The Fr\u00e9chet Inception Distance, and (ii) precision, recall and\nF1. We provide empirical evidence that FID is a reasonable metric due to its robustness with respect\nto mode dropping and encoding network choices. Our main insight is that to compare models it is\nmeaningless to report the minimum FID achieved. Instead, we propose to compare distributions of\nthe minimum achivable FID for a \ufb01xed computational budget. Indeed, empirical evidence presented\nherein imply that algorithmic differences in state-of-the-art GANs become less relevant, as the\ncomputational budget increases. Furthermore, given a limited budget (say a month of compute-time),\na \u201cgood\u201d algorithm might be outperformed by a \u201cbad\u201d algorithm.\nAs discussed in Section 4, many dimensions have to be taken into account for model comparison,\nand this work only explores a subset of the options. We cannot exclude the possibility that that\nsome models signi\ufb01cantly outperform others under currently unexplored conditions. Nevertheless,\nnotwithstanding the limitations discussed in Section 7, this work strongly suggests that future GAN\nresearch should be more experimentally systematic and model comparison should be performed on\nneutral ground.\n\nAcknowledgments\n\nWe would like to acknowledge Tomas Angles for advocating convex polygons as a benchmark data\nset. We would like to thank Ian Goodfellow, Michaela Rosca, Ishaan Gulrajani, David Berthelot, and\nXiaohua Zhai for useful discussions and remarks.\n\nReferences\n[1] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning (ICML), 2017.\n\n[2] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\nequilibrium in generative adversarial nets (GANs). In International Conference on Machine\nLearning (ICML), 2017.\n\n[3] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do GANs learn the distribution? some theory\n\nand empirics. In International Conference on Learning Representations (ICLR), 2018.\n\n[4] Philip Bachman and Doina Precup. Variational generative stochastic networks with collaborative\n\nshaping. In International Conference on Machine Learning (ICML), 2015.\n\n[5] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[6] Xi Chen, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter\nAbbeel. Infogan: Interpretable representation learning by information maximizing generative\nadversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[7] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed,\nand Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at\nevery step. In International Conference on Learning Representations (ICLR), 2018.\n\n9\n\n\f[8] Holly E Gerhard, Felix A Wichmann, and Matthias Bethge. How sensitive is the human visual\n\nsystem to the local statistics of natural images? PLoS computational biology, 9(1), 2013.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems (NIPS), 2014.\n\n[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of Wasserstein gans. In Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances\nin Neural Information Processing Systems, 2017.\n\n[12] Ferenc Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood,\n\nadversary? arXiv preprint arXiv:1511.05101, 2015.\n\n[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International\n\nConference on Learning Representations (ICLR), 2015.\n\n[14] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. International Confer-\n\nence on Learning Representations (ICLR), 2014.\n\n[15] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of\n\nGANs. arXiv preprint arXiv:1705.07215, 2017.\n\n[16] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The\nGAN Landscape: Losses, architectures, regularization, and normalization. arXiv preprint\narXiv:1807.04720, 2018.\n\n[17] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by\n\ninverting them. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[18] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.\nLeast squares generative adversarial networks. In International Conference on Computer Vision\n(ICCV), 2017.\n\n[19] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. In Advances\n\nin Neural Information Processing Systems (NIPS), 2017.\n\n[20] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[21] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\n\nauxiliary classi\ufb01er GANs. In International Conference on Machine Learning (ICML), 2017.\n\n[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[23] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess-\ning generative models via precision and recall. In Advances in Neural Information Processing\nSystems (NIPS), 2018.\n\n[24] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training GANs. In Advances in Neural Information Processing Systems\n(NIPS), 2016.\n\n[25] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[26] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis\nof decoder-based generative models. International Conference on Learning Representations\n(ICLR), 2017.\n\n[27] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and\nDimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative\nadversarial networks. International Conference on Computer Vision (ICCV), 2017.\n\n10\n\n\f", "award": [], "sourceid": 397, "authors": [{"given_name": "Mario", "family_name": "Lucic", "institution": "Google Brain"}, {"given_name": "Karol", "family_name": "Kurach", "institution": "Google Brain"}, {"given_name": "Marcin", "family_name": "Michalski", "institution": "Google"}, {"given_name": "Sylvain", "family_name": "Gelly", "institution": "Google Brain (Zurich)"}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": "Google Brain (Zurich)"}]}