{"title": "GILBO: One Metric to Measure Them All", "book": "Advances in Neural Information Processing Systems", "page_first": 7037, "page_last": 7046, "abstract": "We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data-independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAEs each trained on four datasets (MNIST, FashionMNIST, CIFAR-10 and CelebA) and discuss the results.", "full_text": "GILBO: One Metric to Measure Them All\n\nAlexander A. Alemi\u2217, Ian Fischer\u2217\n\nGoogle AI\n\n{alemi,iansf}@google.com\n\nAbstract\n\nWe propose a simple, tractable lower bound on the mutual information contained\nin the joint generative density of any latent variable generative model: the GILBO\n(Generative Information Lower BOund). It offers a data-independent measure\nof the complexity of the learned latent variable description, giving the log of\nthe effective description length. It is well-de\ufb01ned for both VAEs and GANs. We\ncompute the GILBO for 800 GANs and VAEs each trained on four datasets (MNIST,\nFashionMNIST, CIFAR-10 and CelebA) and discuss the results.\n\n1\n\nIntroduction\n\nGANs (Goodfellow et al., 2014) and VAEs (Kingma & Welling, 2014) are the most popular latent\nvariable generative models because of their relative ease of training and high expressivity. However\nquantitative comparisons across different algorithms and architectures remains a challenge. VAEs\nare generally measured using the ELBO, which measures their \ufb01t to data. Many metrics have been\nproposed for GANs, including the INCEPTION score (Gao et al., 2017), the FID score (Heusel et al.,\n2017), independent Wasserstein critics (Danihelka et al., 2017), birthday paradox testing (Arora &\nZhang, 2017), and using Annealed Importance Sampling to evaluate log-likelihoods (Wu et al., 2017),\namong others.\nInstead of focusing on metrics tied to the data distribution, we believe a useful additional independent\nmetric worth consideration is the complexity of the trained generative model. Such a metric would\nhelp answer questions related to over\ufb01tting and memorization, and may also correlate well with\nsample quality. To work with both GANs and VAEs our metric should not require a tractable joint\ndensity p(x, z). To address these desiderata, we propose the GILBO.\n\n2 GILBO: Generative Information Lower BOund\n\nA symmetric, non-negative, reparameterization independent measure of the information shared\nbetween two random variables is given by the mutual information:\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)(cid:90)\n\nI(X; Z) =\n\ndx dz p(x, z) log\n\np(x, z)\np(x)p(z)\n\n=\n\ndz p(z)\n\ndx p(x|z) log\n\np(z|x)\np(z)\n\n\u2265 0.\n\n(1)\n\nI(X; Z) measures how much information (in nats) is learned about one variable given the other.\nAs such it is a measure of the complexity of the generative model. It can be interpreted (when\nconverted to bits) as the reduction in the number of yes-no questions needed to guess X = x if you\nobserve Z = z and know p(x), or vice-versa. It gives the log of the effective description length of the\ngenerative model. This is roughly the log of the number of distinct sample pairs (Tishby & Zaslavsky,\n2015). I(X; Z) is well-de\ufb01ned even for continuous distributions. This contrasts with the continuous\nentropy H(X) of the marginal distribution, which is not reparameterization independent (Marsh,\n\n\u2217Authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(cid:90)(cid:90)\n(cid:90)(cid:90)\n(cid:90)(cid:90)\n(cid:90)\n\nI(X; Z) =\n\n=\n\n\u2265\n\n=\n\ndx dz p(x, z) log\n\ndx dz p(x, z) log\n\np(x, z)\np(x)p(z)\np(z|x)\np(z)\n\n(cid:90)\n\n(cid:90)\n\n(2)\n\n(3)\n\n(4)\n\n(cid:21)\n\n(cid:20)\n\ne(z|x)\np(z)\n\nlog\n\n2013). I(X; Z) is intractable due to the presence of p(x) =(cid:82) dz p(z)p(x|z), but we can derive a\n\ntractable variational lower bound (Agakov, 2006):\n\ndx dz p(x, z) log p(z|x) \u2212\n\ndz p(z) log p(z) \u2212 KL[p(z|x)||e(z|x)]\n\ndz p(z)\n\ndx p(x|z) log\n\ne(z|x)\np(z)\n\n= Ep(x,z)\n\n\u2261 GILBO \u2264 I(X; Z) (5)\n\nWe call this bound the GILBO for Generative Information Lower BOund. It requires learning a\ntractable variational approximation to the intractable posterior p(z|x) = p(x, z)/p(x), termed e(z|x)\nsince it acts as an encoder mapping from data to a prediction of its associated latent variables.2 As a\nvariational approximation, e(z|x) depends on some parameters, \u03b8, which we elide in the notation.\nThe encoder e(z|x) performs a regression for the inverse of the GAN or VAE generative model,\napproximating the latents that gave rise to an observed sample. This encoder should be a tractable\ndistribution, and must respect the domain of the latent variables, but does not need to be reparameteri-\nzable as no sampling from e(z|x) is needed during training. We suggest the use of (\u22121, 1) remapped\nBeta distributions in the case of uniform latents, and Gaussians in the case of Gaussian latents. In\neither case, training the variational encoder consists of simply generating pairs of (x, z) from the\ntrained generative model and maximizing the likelihood of the encoder to generate the observed z,\nconditioned on its paired x, divided by the likelihood of the observed z under the generative model\u2019s\nprior, p(z). For the GANs in this study, the prior was a \ufb01xed uniform distribution, so the log p(z)\nterm contributes a constant offset to the variational encoder\u2019s likelihood. Optimizing the GILBO for\nthe parameters of the encoder gives a lower bound on the true generative mutual information in the\nGAN or VAE. Any failure to converge or for the approximate encoder to match the true distribution\ndoes not invalidate the bound, it simply makes the bound looser.\nThe GILBO contrasts with the representational mutual information of VAEs de\ufb01ned by the data and\nencoder, which motivates VAE objectives (Alemi et al., 2017). For VAEs, both lower and upper\nvariational bounds can be de\ufb01ned on the representational joint distribution (p(x)e(z|x)). These\nhave demonstrated their utility for cross-model comparisons. However, they require a tractable\nposterior, preventing their use with most GANs. The GILBO provides a theoretically-justi\ufb01ed and\ndataset-independent metric that allows direct comparison of VAEs and GANs.\nThe GILBO is entirely independent of the true data, being purely a function of the generative\njoint distribution. This makes it distinct from other proposed metrics like estimated marginal log\nlikelihoods (often reported for VAEs and very expensive to estimate for GANs) (Wu et al., 2017)3,\nan independent Wasserstein critic (Danihelka et al., 2017), or the common INCEPTION (Gao et al.,\n2017) and FID (Heusel et al., 2017) scores which attempt to measure how well the generated samples\nmatch the observed true data samples. Being independent of data, the GILBO does not directly\nmeasure sample quality, but extreme values (either low or high) correlate with poor sample quality,\nas demonstrated in the experiments below.\nSimilarly, in Im et al. (2018), the authors propose using various GAN training objectives to quantita-\ntively measure the performance of GANs on their own generated data. Interestingly, they \ufb01nd that\nevaluating GANs on the same metric they were trained on gives paradoxically weaker performance \u2013\nan LS-GAN appears to perform worse than a Wasserstein GAN when evaluated with the least-squares\nmetric, for example, even though the LS-GAN otherwise outperforms the WGAN. If this result holds\nin general, it would indicate that using the GILBO during training might result in less-interpretable\nevaluation GILBOs. We do not investigate this hypothesis here.\n\n2Note that a new e(z|x) is trained for both GANs and VAEs. VAEs do not use their own e(z|x), which would\nalso give a valid lower bound. In this work, we train a new e(z|x) for both to treat both model classes uniformly.\nWe don\u2019t know if using a new e(z|x) or the original would tend to result in a tighter bound.\n\n3Note that Wu et al. (2017) is complementary to our work, providing both upper and lower bounds on the\nlog-likelihood. It is our opinion that their estimates should also become standard practice when measuring GANs\nand VAEs.\n\n2\n\n\f(a) FID\n\n(b) GILBO\n\n(c) GILBO vs FID\n\nFigure 1: (a) Recreation of Figure 5 (left) from Lucic et al. (2017) showing the distribution of FID\nscores for each model on MNIST. Points are jittered to give a sense of density. (b) The distribution of\nGILBO scores. (c) FID vs GILBO.\n\nAlthough the GILBO doesn\u2019t directly reference the dataset, the dataset provides useful signposts.\nFirst is at log C, the number of distinguishable classes in the data. If the GILBO is lower than that,\nthe model has almost certainly failed to learn a reasonable model of the data. Another is at log N,\nthe number of training points. A GILBO near this value may indicate that the model has largely\nmemorized the training set, or that the model\u2019s capacity happens to be constrained near the size of\nthe training set. At the other end is the entropy of the data itself (H(X)) taken either from a rough\nestimate, or from the best achieved data log likelihood of any known generative model on the data.\nAny reasonable generative model should have a GILBO no higher than this value.\nUnlike other metrics, GILBO does not monotonically map to quality of the generated output. Both\nextremes indicate failures. A vanishing GILBO denotes a generative model with vanishing complexity,\neither due to independence of the latents and samples, or a collapse to a small number of possible\noutputs. A diverging GILBO suggests over-sensitivity to the latent variables.\nIn this work, we focus on variational approximations to the generative information. However,\nother means of estimating the GILBO are also valid. In Section 4.3 we explore a computationally-\nexpensive method to \ufb01nd a very tight bound. Other possibilities exist as well, including the recently\nproposed Mutual Information Neural Estimation (Belghazi et al., 2018) and Contrastive Predictive\nCoding (Oord et al., 2018). We do not explore these possibilities here, but any valid estimator of the\nmutual information can be used for the same purpose.\n\n3 Experiments\n\nWe computed the GILBO for each of the 700 GANs and 100 VAEs tested in Lucic et al. (2017) on the\nMNIST, FashionMNIST, CIFAR and CelebA datasets in their wide range hyperparameter search. This\nallowed us to compare FID scores and GILBO scores for a large set of different GAN objectives on the\nsame architecture. For our encoder network, we duplicated the discriminator, but adjusted the \ufb01nal\noutput to be a linear layer predicting the 64 \u00d7 2 = 128 parameters de\ufb01ning a (\u22121, 1) remapped Beta\ndistribution (or Gaussian in the case of the VAE) over the latent space. We used a Beta since all of\nthe GANs were trained with a (\u22121, 1) 64-dimensional uniform distribution. The parameters of the\nencoder were optimized for up to 500k steps with ADAM (Kingma & Ba, 2015) using a scheduled\nmultiplicative learning rate decay. We used the same batch size (64) as in the original training.\nTraining time for estimating GILBO is comparable to doing FID evaluations (a few minutes) on the\nsmall datasets (MNIST, FashionMNIST, CIFAR), or over 10 minutes for larger datasets and models\n(CelebA).\nIn Figure 1 we show the distributions of FID and GILBO scores for all 800 models as well as their\nscatter plot for MNIST. We can immediately see that each of the GAN objectives collapse to GILBO\n\u223c 0 for some hyperparameter settings, but none of the VAEs do. In Figure 2 we show generated\nsamples from all of the models, split into relevant regions. A GILBO near zero signals a failure of the\nmodel to make any use of its latent space (Figure 2a).\n\n3\n\n\f(a) GILBO \u2264 log C\n\n(b) log C < GILBO \u2264 log N\n\n(c) log N < GILBO \u2264 2 log N\n\n(d) 2 log N < GILBO \u2264 80.2(\u223c H(X))\n\n(e) 80.2(\u223c H(X)) < GILBO\n\n(f) Legend\n\nFigure 2: Samples from all models sorted by increasing GILBO in raster order and broken up into\nrepresentative ranges. The background colors correspond to the model family (Figure 2f). Note\nthat all of the VAE samples are in (d), indicating that the VAEs achieved a non-trivial amount of\ncomplexity. Also note that most of the GANs in (d) have poor sample quality, further underscoring the\napparent dif\ufb01culty these GANs have maintaining high visual quality without indications of training\nset memorization.\n\nThe best performing models by FID all sit at a GILBO \u223c 11 nats. An MNIST model that simply\nmemorized the training set and partitioned the latent space into 50,000 unique outputs would have\na GILBO of log 50,000 = 10.8 nats, so the cluster around 11 nats is suspicious. Since mutual\ninformation is invariant to any invertible transformation, a model that partitioned the latent space\ninto 50,000 bins, associated each with a training point and then performed some random elastic\ntransformation but with a magnitude low enough to not turn one training point into another would\nstill have a generative mutual information of 10.8 nats. Larger elastic transformations that could\nconfuse one training point for another would only act to lower the generative information. Among a\nlarge set of hyperparameters and across 7 different GAN objectives, we notice a conspicuous increase\nin FID score as GILBO moves away from \u223c 11 nats to either side. This demonstrates the failure of\nthese GANs to achieve a meaningful range of complexities while maintaining visual quality. Most\nstriking is the distinct separation in GILBOs between GANs and VAEs. These GANs learn less complex\njoint densities than a vanilla VAE on MNIST at the same FID score.\nFigures 3 to 5 show the same plots as in Figure 1 but for the FashionMNIST, CIFAR-10 and CelebA\ndatasets respectively. The best performing models as measured by FID on FashionMNIST continue\nto have GILBOs near log N. However, on the more complex CIFAR-10 and CelebA datasets we see\nnontrivial variation in the complexities of the trained GANs with competitive FID. On these more\ncomplex datasets, the visual performance (e.g. Figure 8) of the models leaves much to be desired.\nWe speculate that the models\u2019 inability to acheive high visual quality is due to insuf\ufb01cient model\ncapacity for the dataset.\n\n4\n\n\f(a) FID\nFigure 3: A recreation of Figure 1 for the Fashion MNIST dataset.\n\n(b) GILBO\n\n(c) GILBO vs FID\n\n(a) FID\n\n(b) GILBO\n\n(c) GILBO vs FID\n\nFigure 4: A recreation of Figure 1 for the CIFAR dataset.\n\n4 Discussion\n\n4.1 Reproducibility\n\nWhile the GILBO is a valid lower bound regardless of the accuracy of the learned encoder, its utility\nas a metric naturally requires it to be comparable across models. The \ufb01rst worry is whether it is\nreproducible in its values. To address this, in Figure 6 we show the result of 128 different training\nruns to independently compute the GILBO for three models on CelebA. In each case the error in the\nmeasurement was below 2% of the mean GILBO and much smaller in variation than the variations\nbetween models, suggesting comparisons between models are valid if we use the same encoder\narchitecture (e(z|x)) for each.\n\n(a) FID\n\n(b) GILBO\n\n(c) GILBO vs FID\n\nFigure 5: A recreation of Figure 1 for the CelebA dataset.\n\n5\n\n\f(a) GILBO \u223c 41.1 \u00b1 0.8 nats\n\n(b) GILBO \u223c 69.5 \u00b1 0.9 nats\n\n(c) GILBO \u223c 104 \u00b1 1 nats\n\nFigure 6: Measure of the reproducibility of the GILBO for the three models visualized in Figure 8.\nFor each model we independently measured the GILBO 128 times.\n\n(a) GILBO \u223c 41 nats\n\n(b) GILBO \u223c 70 nats\n\n(c) GILBO \u223c 104 nats\n\nFigure 7: Simulation-based calibration (Talts et al., 2018) of the variational encoder for the same\nthree models as in Figure 6. Shown are histograms of the ranking statistic for how many of 128\nsamples from the encoder are less than the true z used to generate the \ufb01gure, aggregated over the 64\ndimensional latent vector for 1270 batches of 64 samples each. Shown in red is the 99% con\ufb01dence\ninterval for a uniform distribution, the expected result if e(z|x) was the true p(z|x). The systematic\n\u2229-shape denotes overdispersion in the approximation.\n\n4.2 Tightness\n\n(cid:80)\n\nAnother concern would be whether the learned variational encoder was a good match to the true\nposterior of the generative model (e(z|x) \u223c p(z|x)). Perhaps the model with a measured GILBO of\n41 nats simply had a harder to capture p(z|x) than the GILBO \u223c 104 nat model. Even if the values\nwere reproducible between runs, maybe there is a systemic bias in the approximation that differs\nbetween different models.\nTo test this, we used the Simulation-Based Calibration (SBC) technique of Talts et al. (2018). If\none were to implement a cycle, wherein a single draw from the prior z(cid:48) \u223c p(z) is decoded into an\nimage x(cid:48) \u223c p(x|z(cid:48)) and then inverted back to its corresponding latent zi \u223c p(z|x(cid:48)), the rank statistic\nI [zi < z(cid:48)] should be uniformly distributed. Replacing the true p(z|x(cid:48)) with the approximate\ne(z|x) gives a visual test for the accuracy of the approximation. Figure 7 shows a histogram of the\nrank statistic for 128 draws from e(z|x) for each of 1270 batches of 64 elements each drawn from\nthe 64 dimensional prior p(z) for the same three GANs as in Figure 6. The red line denotes the 99%\ncon\ufb01dence interval for the corresponding uniform distribution. All three GANs show a systematic\n\u2229-shaped distribution denoting overdispersion in e(z|x) relative to the true p(z|x). This is to be\nexpected from a variational approximation, but importantly the degree of mismatch seems to correlate\nwith the scores, not anticorrelate. It is likely that the 41 nat GILBO is a more accurate lower bound\nthan the 103 nat GILBO. This further reinforces the utility of the GILBO for cross-model comparisons.\n\ni\n\n4.3 Precision of the GILBO\n\nWhile comparisons between models seem well-motivated, the SBC results in Section 4.2 highlight\nsome mismatch in the variational approximation. How well can we trust the absolute numbers\ncomputed by the GILBO? While they are guaranteed to be valid lower bounds, how tight are those\nbounds?\n\n6\n\n\fTo answer these questions, note that the GILBO is a valid lower bound even if we learn separate\nper-instance variational encoders. Here we replicate the results of Lipton & Tripathi (2017) and\nattempt to learn the precise z that gave rise to an image by minimizing the L2 distance between the\nproduced image and the target (|x \u2212 g(z)|2). We can then de\ufb01ne a distribution centered on z and\nadjust the magnitude of the variance to get the best GILBO possible. In other words, by minimizing\nthe L2 distance between an image x sampled from the generative model and some other x(cid:48) sampled\nfrom the same model, we can directly recover some z(cid:48) equivalent to the z that generated x. We can\nthen do a simple optimization to \ufb01nd the variance that maximizes the GILBO, allowing us to compute\na very tight GILBO in a very computationally-expensive manner.\nDoing this procedure on the same three models as in Figures 6 and 7 gives (87, 111, 155) nats\nrespectfully for the (41, 70, 104) GILBO models, when trained for 150k steps to minimize the L2\ndistance. These approximations are also valid lower bounds, and demonstrate that our amortized\nGILBO calculations above might be off by as much as a factor of 2 in their values from the true\ngenerative information, but again highlights that the comparisons between different models appear to\nbe real. Also note that these per-image bounds are \ufb01nite. We discuss the \ufb01niteness of the generative\ninformation in more detail in Section 4.6.\nNaturally, learning a single parametric amortized variational encoder is much less computationally\nexpensive than doing an independent optimization for each image, and still seems to allow for\ncomparative measurements. However, we caution against directly comparing GILBO scores from\ndifferent encoder architectures or optimization procedures. Fair comparison between models requires\nholding the encoder architecture and training procedure \ufb01xed.\n\n4.4 Consistency\n\nThe GILBO offers a signal distinct from data-based metrics like FID.\nIn Figure 8, we visually\ndemonstrate the nature of the retained information for the same three models as above in Figures 6\nand 7. All three checkpoints for CelebA have the same FID score of 49, making them each competitive\namongst the GANs studied; however, they have GILBO values that span a range of 63 nats (91 bits),\nwhich indicates a massive difference in model complexity. In each \ufb01gure, the left-most column shows\na set of independent generated samples from the GAN. Each of these generated images are then sent\nthrough the variational encoder e(z|x) from which 15 independent samples of the corresponding z\nare drawn. These latent codes are then sent back through the GAN\u2019s generator to form the remaining\n15 columns.\nThe images in Figure 8 show the type of information that is retained in the mapping from image to\nlatent and back to image space. On the right in Figure 8c with a GILBO of 104 nats, practically all of\nthe human-perceptible information is retained by doing this cycle. In contrast, on the left in Figure 8a\nwith a GILBO of only 41 nats, there is a good degree of variation in the synthesized images, although\nthey generally retain the overall gross attributes of the faces. In the middle, at 70 nats, the variation in\nthe synthesized images is small, but noticeable, such as the sunglasses that appear and disappear 6\nrows from the top.\n\n4.5 Over\ufb01tting of the GILBO Encoder\n\nSince the GILBO is trained on generated samples, the dataset is limited only by the number of unique\nsamples the generative model can produce. Consequently, it should not be possible for the encoder,\ne(z|x), to over\ufb01t to the training data. Regardless, when we actually evaluate the GILBO, it is always\non newly generated data.\nLikewise, given that the GILBO is trained on the \u201ctrue\u201d generative model p(z)p(x|z), we do not\nexpect regularization to be necessary. The encoders we trained are unregularized. However, we\nnote that any regularization procedure on the encoder could be thought of as a modi\ufb01cation of the\nvariational family used in the variational approximation.\nThe same argument is true about architectural choices. We used a convolutional encoder, as we expect\nit to be a good match with the deconvolutional generative models under study, but the GILBO would\nstill be valid if we used an MLP or any other architecture. The computed GILBO may be more or\nless tight depending on such choices, though \u2013 the architectural choices for the encoder are a form of\n\n7\n\n\f(a) GILBO \u223c 41 nats\n\n(b) GILBO \u223c 70 nats\n\n(c) GILBO \u223c 104 nats\n\nFigure 8: Visual demonstration of consistency. The left-most column of each image shows a sampled\nimage from the GAN. The next 15 columns show images generated from 15 independent samples of\nthe latent code suggested for the left-most image by the trained encoder used in estimating the GILBO.\nAll three of these GANs had an FID of 49 on CelebA, but have qualitatively different behavior.\n\ninductive bias and should be made in a problem-dependent manner just like any other architectural\nchoice.\n\n4.6 Finiteness of the Generative Information\n\nThe generative mutual information is only in\ufb01nite if the generator network is not only deterministic,\nbut is also invertible. Deterministic many-to-one functions can have \ufb01nite mutual informations\nbetween their inputs and outputs. Take for instance the following: p(z) = U[\u22121, 1], the prior being\nuniform from -1 to 1, and a generator x = G(z) = sign(z) being the sign function (which is C\u221e\nalmost everywhere), for which p(x|z) = \u03b4(x \u2212 sign(z)) the conditional distribution of x given z is\nthe delta function concentrated on the sign of z.\n\np(x, z) = p(x|z)p(z) =\n\n\u03b4(x\u2212sign(z))\n\np(z) =\n\ndx p(x, z) =\n\n\u03b4(z\u22121)+\n\n1\n2\n\n1\n2\n\n\u03b4(z +1) (6)\n\n(cid:90) 1\n\n\u22121\n\n1\n2\n\n(cid:90)\n(cid:90) 1\n(cid:20) 1\n\n\u22121\n\n(cid:90) 1\n(cid:21)\n\n\u22121\n\nI(X; Z) =\n\n=\n\n=\n\ndx dz p(x, z) log\n\np(x, z)\np(x)p(z)\n\ndx\n\n\u03b4(x \u2212 sign(z)) log\n\ndz\n\n1\n2\n\n(cid:20) 1\n\n(cid:21)\n\nlog 2\n\n+\n\nz=\u22121\n\n2\n\nlog 2\n\nz=1\n\n2\n\n= log 2 = 1bit\n\n\u03b4(x \u2212 sign(z))\n\n1\n\n2 \u03b4(z \u2212 1) + 1\n\n2 \u03b4(z + 1)\n\n(7)\n\n(8)\n\n(9)\n\nIn other words, the deterministic function x = sign(z) induces a mutual information of 1 bit between\nX and Z. This makes sense when interpreting the mutual information as the reduction in the number\nof yes-no questions needed to specify the value. It takes an in\ufb01nite number of yes-no questions to\nprecisely determine a real number in the range [\u22121, 1], but if we observe the sign of the value, it takes\none fewer question (while still being in\ufb01nite) to determine.\nEven if we take Z to be a continuous real-valued random variable on the range [\u22121, 1], if we consider\na function x = float(z) which casts that number to a \ufb02oat, for a 32-bit \ufb02oat on the range [\u22121, 1]\nthe mutual information that results is I(X; Z) = 26 bits (we veri\ufb01ed this numerically). In any\nchain Z \u2192 float(Z) \u2192 X by the data processing inequality, the mutual information I(X; Z) is\nlimited by I(Z; float(Z)) = 26 bits (per dimension). Given that we train neural networks with\nlimited precision arithmetic, this ensures that there is always some \ufb01nite mutual information in the\nrepresentations, since our random variables are actually discrete, albeit discretized on a very \ufb01ne grid.\n\n8\n\n\f5 Conclusion\n\nWe\u2019ve de\ufb01ned a new metric for evaluating generative models, the GILBO, and measured its value on\nover 3200 models. We\u2019ve investigated and discussed strengths and potential limitations of the metric.\nWe\u2019ve observed that GILBO gives us different information than is currently available in sample-quality\nbased metrics like FID, both signifying a qualitative difference in the performance of most GANs on\nMNIST versus richer datasets, as well as being able to distinguish between GANs with qualitatively\ndifferent latent representations even if they have the same FID score.\nOn simple datasets, in an information-theoretic sense we cannot distinguish what GANs with the best\nFIDs are doing from models that are limited to making some local deformations of the training set.\nOn more complicated datasets, GANs show a wider array of complexities in their trained generative\nmodels. These complexities cannot be discerned by existing sample-quality based metrics, but would\nhave important implications for any use of the trained generative models for auxiliary tasks, such as\ncompression or representation learning.\nA truly invertible continuous map from the latent space to the image space would have a divergent\nmutual information. Since GANs are implemented as a feed forward neural network, the fact that we\ncan measure \ufb01nite and distinct values for the GILBO for different architectures suggest not only are\nthey fundamentally not perfectly invertible, but the degree of invertibility is an interesting signal of\nthe complexity of the learned generative model. Given that GANs are implemented as deterministic\nfeed forward maps, they naturally want to live at high generative mutual information.\nHumans seem to extract only roughly a dozen bits (\u223c 8 nats) from natural images into long term\nmemory (Landauer, 1986). This calls into question the utility of the usual qualitative visual compar-\nisons of highly complex generative models. We might also be interested in trying to train models\nthat learn much more compressed representations. VAEs can naturally target a wide range of mutual\ninformations (Alemi et al., 2017). GANs are harder to steer. One approach to make GANs steerable\nis to modify the GAN objective and speci\ufb01cally designate a subset of the full latent space as the\ninformative subspace, as in Chen et al. (2016), where the maximum complexity can be controlled\nfor by limiting the dimensionality of a discrete categorical latent. The remaining stochasticity in\nthe latent can be used for novelty in the conditional generations. Alternatively one could imagine\nadding the GILBO as an auxiliary objective to ordinary GAN training, though as a lower bound, it\nmay not prove useful for helping to keep the generative information low. Regardless, we believe it\nis important to consider the complexity in information-theoretic terms of the generative models we\ntrain, and the GILBO offers a relatively cheap comparative measure.\nWe believe using GILBO for further comparisons across architectures, datasets, and GAN and VAE\nvariants will illuminate the strengths and weaknesses of each. The GILBO should be measured and\nreported when evaluating any latent variable model. To that end, our implementation is available at\nhttps://github.com/google/compare_gan.\n\nAcknowledgements\n\nWe would like to thank Mario Lucic, Karol Kurach, and Marcin Michalski for the use of their 3200\npreviously-trained GANs and VAEs and their codebase (described in Lucic et al. (2017)), without\nwhich this paper would have had much weaker experiments, as well as for their help adding our\nGILBO code to their public repository. We would also like to thank our anonymous reviewers for\nsubstantial helpful feedback.\n\nReferences\nFelix Vsevolodovich Agakov. Variational Information Maximization in Stochastic Environments.\n\nPhD thesis, University of Edinburgh, 2006.\n\nAlex Alemi, Ben Poole, Ian Fischer, Josh Dillon, Rif A. Saurus, and Kevin Murphy. Fixing a Broken\n\nELBO. ICML, 2017. URL https://arxiv.org/abs/1711.00464.\n\nSanjeev Arora and Yi Zhang. Do GANs actually learn the distribution? An empirical study. CoRR,\n\nabs/1706.08224, 2017. URL http://arxiv.org/abs/1706.08224.\n\n9\n\n\fMohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon\nHjelm, and Aaron Courville. Mutual Information Neural Estimation. In International Conference\non Machine Learning, pp. 530\u2013539, 2018.\n\nXi Chen, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. In-\nfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial\nNets. In NIPS, 2016. URL https://arxiv.org/pdf/1606.03657.pdf.\n\nI. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of Maximum\nLikelihood and GAN-based training of Real NVPs. arXiv 1705.05263, 2017. URL https:\n//arxiv.org/abs/1705.05263.\n\nWeihao Gao, Sreeram Kannan, Sewoong Oh,\n\nInformation\n\ning Mutual\nmation\n7180-estimating-mutual-information-for-discrete-continuous-mixtures.pdf.\n\nfor Discrete-Continuous Mixtures.\n\nProcessing\n\nSystems,\n\nand Pramod Viswanath.\n\nEstimat-\nInfor-\nhttp://papers.nips.cc/paper/\n\nIn Neural\n\nURL\n\n2017.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Neural Information\n\nAaron Courville, and Yoshua Bengio. Generative Adversarial Nets.\nProcessing Systems, 2014. URL https://arxiv.org/abs/1406.2661.\n\nM. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs Trained by a Two\nTime-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 1806.08500, 2017. URL\nhttps://arxiv.org/abs/1806.08500.\n\nDaniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating GANs\nwith divergences proposed for training. In International Conference on Learning Representations,\n2018. URL https://arxiv.org/abs/1803.01045.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\nConference on Learning Representations, 2015. URL https://arxiv.org/abs/1412.6980.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference\n\non Learning Representations, 2014. URL https://arxiv.org/abs/1312.6114.\n\nThomas K. Landauer. How much Do People Remember? Some Estimates of the Quantity of\nLearned Information in Long-term Memory. Cognitive Science, 10(4):477\u2013493, 1986. doi:\n10.1207/s15516709cog1004\\_4. URL https://onlinelibrary.wiley.com/doi/abs/10.\n1207/s15516709cog1004_4.\n\nZachary C Lipton and Subarna Tripathi. Precise Recovery of Latent Vectors From Generative\n\nAdversarial Networks, 2017. URL https://arxiv.org/abs/1702.04782.\n\nM. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs Created Equal? A\n\nLarge-Scale Study. arXiv 1711.10337, 2017. URL https://arxiv.org/abs/1711.10337.\n\nCharles Marsh. Introduction to Continuous Entropy, 2013. URL http://www.crmarsh.com/\n\nstatic/pdf/Charles_Marsh_Continuous_Entropy.pdf.\n\nAaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive\n\ncoding. arXiv preprint arXiv:1807.03748, 2018.\n\nS. Talts, M. Betancourt, D. Simpson, A. Vehtari, and A. Gelman. Validating Bayesian Inference\nAlgorithms with Simulation-Based Calibration. arXiv 1804.06788, April 2018. URL https:\n//arxiv.org/abs/1804.06788.\n\nN. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. arXiv\n\n1503.02406, 2015. URL https://arxiv.org/abs/1503.02406.\n\nYuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of\ndecoder-based generative models. In International Conference on Learning Representations, 2017.\nURL https://arxiv.org/abs/1611.04273.\n\n10\n\n\f", "award": [], "sourceid": 3497, "authors": [{"given_name": "Alexander", "family_name": "Alemi", "institution": "Google"}, {"given_name": "Ian", "family_name": "Fischer", "institution": "Google"}]}