{"title": "The continuous Bernoulli: fixing a pervasive error in variational autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 13287, "page_last": 13297, "abstract": "Variational autoencoders (VAE) have quickly become a central tool in machine learning, applicable to a broad range of data types and latent variable models. By far the most common first step, taken by seminal papers and by core software libraries alike, is to model MNIST data using a deep network parameterizing a Bernoulli likelihood. This practice contains what appears to be and what is often set aside as a minor inconvenience: the pixel data is [0,1] valued, not {0,1} as supported by the Bernoulli likelihood. Here we show that, far from being a triviality or nuisance that is convenient to ignore, this error has profound importance to VAE, both qualitative and quantitative. We introduce and fully characterize a new [0,1]-supported, single parameter distribution: the continuous Bernoulli, which patches this pervasive bug in VAE. This distribution is not nitpicking; it produces meaningful performance improvements across a range of metrics and datasets, including sharper image samples, and suggests a broader class of performant VAE.", "full_text": "The continuous Bernoulli: \ufb01xing a pervasive error in\n\nvariational autoencoders\n\nGabriel Loaiza-Ganem\nDepartment of Statistics\n\nColumbia University\n\ngl2480@columbia.edu\n\nJohn P. Cunningham\nDepartment of Statistics\n\nColumbia University\n\njpc2181@columbia.edu\n\nAbstract\n\nVariational autoencoders (VAE) have quickly become a central tool in machine\nlearning, applicable to a broad range of data types and latent variable models. By far\nthe most common \ufb01rst step, taken by seminal papers and by core software libraries\nalike, is to model MNIST data using a deep network parameterizing a Bernoulli\nlikelihood. This practice contains what appears to be and what is often set aside as a\nminor inconvenience: the pixel data is [0, 1] valued, not {0, 1} as supported by the\nBernoulli likelihood. Here we show that, far from being a triviality or nuisance that\nis convenient to ignore, this error has profound importance to VAE, both qualitative\nand quantitative. We introduce and fully characterize a new [0, 1]-supported, single\nparameter distribution: the continuous Bernoulli, which patches this pervasive bug\nin VAE. This distribution is not nitpicking; it produces meaningful performance\nimprovements across a range of metrics and datasets, including sharper image\nsamples, and suggests a broader class of performant VAE.1\n\n1\n\nIntroduction\n\nVariational autoencoders (VAE) have become a central tool for probabilistic modeling of complex,\nhigh dimensional data, and have been applied across image generation [10], text generation [14],\nneuroscience [8], chemistry [9], and more. One critical choice in the design of any VAE is the choice\nof likelihood (decoder) distribution, which stipulates the stochastic relationship between latents and\nobservables. Consider then using a VAE to model the MNIST dataset, by far the most common \ufb01rst\nstep for introducing and implementing VAE. An apparently innocuous practice is to use a Bernoulli\nlikelihood to model this [0, 1]-valued data (grayscale pixel values), in disagreement with the {0, 1}\nsupport of the Bernoulli distribution. Though doing so will not throw an obvious type error, the\nimplied object is no longer a coherent probabilistic model, due to a neglected normalizing constant.\nThis practice is extremely pervasive in the VAE literature, including the seminal work of Kingma\nand Welling [20] (who, while aware of it, set it aside as an inconvenience), highly-cited follow up\nwork (for example [25, 37, 17, 6] to name but a few), VAE tutorials [7, 1], including those in hugely\npopular deep learning frameworks such as PyTorch [32] and Keras [3], and more.\nHere we introduce and fully characterize the continuous Bernoulli distribution (\u00a73), both as a means\nto study the impact of this widespread modeling error, and to provide a proper VAE for [0, 1]-valued\ndata. Before these details, let us ask the central question: who cares?\nFirst, theoretically, ignoring normalizing constants is unthinkable throughout most of probabilistic\nmachine learning: these objects serve a central role in restricted Boltzmann machines [36, 13],\ngraphical models [23, 33, 31, 38], maximum entropy modeling [16, 29, 26], the \u201cOccam\u2019s razor\u201d\nnature of Bayesian models [27], and much more.\n\n1Our code is available at https://github.com/cunningham-lab/cb.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSecond, one might suppose this error can be interpreted or \ufb01xed via data augmentation, binarizing\ndata (which is also a common practice), stipulating a different lower bound, or as a nonprobabilistic\nmodel with a \u201cnegative binary cross-entropy\u201d objective. \u00a74 explores these possibilities and \ufb01nds them\nwanting. Also, one might be tempted to call the Bernoulli VAE a toy model or a minor point. Let us\navoid that trap: MNIST is likely the single most widely used dataset in machine learning, and VAE is\nquickly becoming one of our most popular probabilistic models.\nThird, and most importantly, empiricism; \u00a75 shows three key results: (i) as a result of this error,\nwe show that the Bernoulli VAE signi\ufb01cantly underperforms the continuous Bernoulli VAE across\na range of evaluation metrics, models, and datasets; (ii) a further unexpected \ufb01nding is that this\nperformance loss is signi\ufb01cant even when the data is close to binary, a result that becomes clear by\nconsideration of continuous Bernoulli limits; and (iii) we further compare the continuous Bernoulli\nto beta likelihood and Gaussian likelihood VAE, again \ufb01nding the continuous Bernoulli performant.\nAll together this work suggests that careful treatment of data type \u2013 neither ignoring normalizing\nconstants nor defaulting immediately to a Gaussian likelihood \u2013 can produce optimal results when\nmodeling some of the most core datasets in machine learning.\n\n2 Variational autoencoders\n\nAutoencoding variational Bayes [20] is a technique to perform inference in the model:\n\nZn \u223c p0(z)\n\nand Xn|Zn \u223c p\u03b8(x|zn) , for n = 1, . . . , N,\n\n(1)\nwhere each Zn \u2208 RM is a local hidden variable, and \u03b8 are parameters for the likelihood of observables\nXn. The prior p0(z) is conventionally a Gaussian N (0, IM ). When the data is binary, i.e. xn \u2208\n{0, 1}D, the conditional likelihood p\u03b8(xn|zn) is chosen to be B(\u03bb\u03b8(zn)), where \u03bb\u03b8 : RM \u2192 RD is a\nneural network with parameters \u03b8. B(\u03bb) denotes the product of D independent Bernoulli distributions,\nwith parameters \u03bb \u2208 [0, 1]D (in the standard way we overload B(\u00b7) to be the univariate Bernoulli or the\nproduct of independent Bernoullis). Since direct maximum likelihood estimation of \u03b8 is intractable,\nvariational autoencoders use VBEM [18], \ufb01rst positing a now-standard variational posterior family:\n\nN(cid:89)\n\nq\u03c6(zn|xn), with q\u03c6(zn|xn) = N(cid:16)\n\nm\u03c6(xn), diag(cid:0)s2\n\n\u03c6(xn)(cid:1)(cid:17)\n\n(cid:0)(z1, ..., zn)|(x1, ..., xn)(cid:1) =\n\nq\u03c6\n\nn=1\n\n(2)\nwhere m\u03c6 : RD \u2192 RM , s\u03c6 : RD \u2192 RM\n+ are neural networks parameterized by \u03c6. Then, using\nstochastic gradient descent and the reparameterization trick, the evidence lower bound (ELBO)\nE(\u03b8, \u03c6) is maximized over both generative and posterior (decoder and encoder) parameters (\u03b8, \u03c6):\n\nN(cid:88)\n\nE(\u03b8, \u03c6) =\n\nEq\u03c6(zn|xn)[log p\u03b8(xn|zn)]\u2212 KL(q\u03c6(zn|xn)||p0(zn)) \u2264 log p\u03b8\n\nn=1\n\n2.1 The pervasive error in Bernoulli VAE\n\nIn the Bernoulli case, the reconstruction term in the ELBO is:\n\n(cid:0)(x1, . . . , xN )(cid:1). (3)\n\n(cid:105)\n\n(4)\n\n(cid:104) D(cid:88)\n\nd=1\n\nEq\u03c6(zn|xn)[log p\u03b8(xn|zn)] = Eq\u03c6(zn|xn)\n\nxn,d log \u03bb\u03b8,d(zn)+(1\u2212xn,d) log(1\u2212\u03bb\u03b8,d(zn))\n\nwhere xn,d and \u03bb\u03b8,d(zn) are the d-th coordinates of xn and \u03bb\u03b8(zn), respectively. However, critically,\nBernoulli likelihoods and the reconstruction term of equation 4 are commonly used for [0, 1]-valued\ndata, which loses the interpretation of probabilistic inference. To clarify, hereafter we denote the\nBernoulli distribution as \u02dcp(x|\u03bb) = \u03bbx(1 \u2212 \u03bb)1\u2212x to emphasize the fact that it is an unnormalized\ndistribution (when evaluated over [0, 1]). We will also make this explicit in the ELBO, writing\nE(\u02dcp, \u03b8, \u03c6) to denote that the reconstruction term of equation 4 is being used. When analyzing [0, 1]-\nvalued data such as MNIST, the Bernoulli VAE has optimal parameter values \u03b8\u2217(\u02dcp) and \u03c6\u2217(\u02dcp); that\nis:\n\n(\u03b8\u2217(\u02dcp), \u03c6\u2217(\u02dcp)) = argmax\n\nE(\u02dcp, \u03b8, \u03c6).\n\n(5)\n\n(\u03b8,\u03c6)\n\n2\n\n\f3 CB: the continuous Bernoulli distribution\nIn order to analyze the implications of this modeling error, we introduce the continuous Bernoulli, a\nnovel distribution on [0, 1], which is parameterized by \u03bb \u2208 (0, 1) and de\ufb01ned by:\n\nX \u223c CB(\u03bb) \u21d0\u21d2 p(x|\u03bb) \u221d \u02dcp(x|\u03bb) = \u03bbx(1 \u2212 \u03bb)1\u2212x.\n\n(6)\n\n0 <(cid:82) 1\n\nWe fully characterize this distribution, deferring proofs and secondary properties to appendices.\nProposition 1 (CB density and mean): The continuous Bernoulli distribution is well de\ufb01ned, that is,\n0 \u02dcp(x|\u03bb)dx < \u221e for every \u03bb \u2208 (0, 1). Furthermore, if X \u223c CB(\u03bb), then the density function\n\nof X and its expected value are:\n\n\uf8f1\uf8f2\uf8f3 2tanh\u22121(1 \u2212 2\u03bb)\n\n1 \u2212 2\u03bb\n\n2\n\nif \u03bb (cid:54)= 0.5\notherwise\n\n(7)\n\n(8)\n\np(x|\u03bb) = C(\u03bb)\u03bbx(1 \u2212 \u03bb)1\u2212x, where C(\u03bb) =\n\n\uf8f1\uf8f2\uf8f3 \u03bb\n\n2\u03bb \u2212 1\n0.5\n\n\u00b5(\u03bb)\n\n:= E[X] =\n\n+\n\n1\n\n2tanh\u22121(1 \u2212 2\u03bb)\n\nif \u03bb (cid:54)= 0.5\notherwise\n\nFigure 1 shows log C(\u03bb), p(x|\u03bb), and \u00b5(\u03bb). Some notes warrant mention: (i) unlike the Bernoulli,\nthe mean of the continuous Bernoulli is not \u03bb; (ii) however, like for the Bernoulli, \u00b5(\u03bb) is increasing\non \u03bb and goes to 0 or 1 when \u03bb goes to 0 or 1; (iii) the continuous Bernoulli is not a beta distribution\n(the main difference between these two distributions is how they concentrate mass around the\nextrema, see appendix 1 for details), nor any other [0, 1]-supported distribution we are aware of\n(including continuous relaxations such as the Gumbel-Softmax [28, 15], see appendix 1 for details);\n(iv) C(\u03bb) and \u00b5(\u03bb) are continuous functions of \u03bb; and (v) the log normalizing constant log C(\u03bb) is\nwell characterized but numerically unstable close to \u03bb = 0.5, so our implementation uses a Taylor\napproximation near that critical point to calculate log C(\u03bb) to high numerical precision. Proof: See\nappendix 3.\n\nFigure 1: continuous Bernoulli log normalizing constant (left), pdf (middle) and mean (right).\n\nProposition 2 (CB additional properties): The continuous Bernoulli forms an exponential family,\nhas closed form variance, CDF, inverse CDF (which importantly enables easy sampling and the use\nof the reparameterization trick), characteristic function (and thus moment generating function too)\nand entropy. Also, the KL between two continuous Bernoulli distributions also has closed form and\nC(\u03bb) is convex. Finally, the continuous Bernoulli admits a conjugate prior which we call the C-Beta\ndistribution. See appendix 2 for details. Proof: See appendix 3.\nProposition 3 (CB Bernoulli limit): CB(\u03bb) \u03bb\u21920\u2212\u2212\u2212\u2192 \u03b4(0) and CB(\u03bb) \u03bb\u21921\u2212\u2212\u2212\u2192 \u03b4(1) in distribution; that\nis, the continuous Bernoulli becomes a Bernoulli in the limit. Proof: See appendix 3.\nThis proposition might at a \ufb01rst glance suggest that, as long as the estimated parameters are close\nto 0 or 1 (which should happen when the data is close to binary), then the practice of erroneously\napplying a Bernoulli VAE should be of little consequence. However, the above reasoning is wrong, as\nit ignores the shape of log C(\u03bb): as \u03bb \u2192 0 or \u03bb \u2192 1, log C(\u03bb) \u2192 \u221e (Figure 1, left). Thus, if data is\nclose to binary, the term neglected by the Bernoulli VAE becomes even more important, the exact\nopposite conclusion than the naive one presented above.\nProposition 4 (CB normalizing constant bound): C(\u03bb) \u2265 2, with equality if and only if \u03bb = 0.5.\nAnd thus it follows that, for any x, \u03bb, we have log p(x|\u03bb) > log \u02dcp(x|\u03bb). Proof: See appendix 3.\n\n3\n\n0.00.20.40.60.81.0parameter 0.751.001.251.501.752.002.252.50log C() log normalizing constant0.00.20.40.60.81.0x0.51.01.52.02.5p(x|) density0.10.20.30.40.50.60.70.80.9parameter 0.00.20.40.60.81.0parameter 0.00.20.40.60.81.0() meancontinuous BernoulliBernoulli\fThis proposition allows us to interpret E(\u02dcp, \u03b8, \u03c6) as a lower lower bound of the log likelihood (\u00a74).\nProposition 5 (CB maximum likelihood): For an observed sample x1, . . . , xN \u223ciid CB(\u03bb), the\nmaximum likelihood estimator \u02c6\u03bb of \u03bb is such that \u00b5(\u02c6\u03bb) = 1\nN\nBeyond characterizing a novel and interesting distribution, these propositions now allow full analysis\nof the error in applying a Bernoulli VAE to the wrong data type.\n\nn xn. Proof: See appendix 3.\n\n(cid:80)\n\n4 The continuous Bernoulli VAE, and why the normalizing constant matters\n\nWe de\ufb01ne the continuous Bernoulli VAE analogously to the Bernoulli VAE:\n\nZn \u223c N (0, IM )\n\nand Xn|Zn \u223c CB (\u03bb\u03b8(zn)) , for n = 1, . . . , N\n\n(9)\nwhere again \u03bb\u03b8 : RM \u2192 RD is a neural network with parameters \u03b8, and CB(\u03bb) now denotes the\nproduct of D independent continuous Bernoulli distributions. Operationally, this modi\ufb01cation results\nonly in a change to the optimized objective; for clarity we compare the ELBO of the continuous\nBernoulli VAE (top), E(p, \u03b8, \u03c6), to the Bernoulli VAE (bottom):\n\nxn,d log \u03bb\u03b8,d(zn) + (1 \u2212 xn,d) log(1 \u2212 \u03bb\u03b8,d(zn)) + log C(\u03bb\u03b8,d(zn))\n\n(cid:35)\n\n(cid:35)\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\nE(p, \u03b8, \u03c6) =\n\nE( \u02dcp, \u03b8, \u03c6) =\n\n\u2212KL(q\u03c6||p0) + Eq\u03c6\n\n\u2212KL(q\u03c6||p0) + Eq\u03c6\n\n(cid:34) D(cid:88)\n(cid:34) D(cid:88)\n\nd=1\n\nxn,d log \u03bb\u03b8,d(zn) + (1 \u2212 xn,d) log(1 \u2212 \u03bb\u03b8,d(zn))\n\n,\n\nn=1\n\nd=1\n\nAnalogously, we denote \u03b8\u2217(p) and \u03c6\u2217(p) as the maximizers of the continuous Bernoulli ELBO:\n\n(\u03b8\u2217(p), \u03c6\u2217(p)) = argmax\n\nE(p, \u03b8, \u03c6).\n\n(\u03b8,\u03c6)\n\n(10)\n\nImmediately, a number of potential interpretations for the Bernoulli VAE come to mind, some of\nwhich have appeared in literature. We analyze each in turn.\n\n4.1 Changing the data, model or training objective\n\nOne obvious workaround is to simply binarize any [0, 1]-valued data (MNIST pixel values or oth-\nerwise), so that it accords with the Bernoulli likelihood [24], a practice that is commonly followed\n(e.g. [34, 2, 28, 15, 12]). First, modifying data to \ufb01t a model, particularly an unsupervised model, is\nfundamentally problematic. Second, many [0, 1]-valued datasets are heavily degraded by binarization\n(see appendices for CIFAR-10 samples), indicating major practical limitations. Another workaround\nis to change the likelihood of the model to a proper [0, 1]-supported distribution, such as a beta or a\ntruncated Gaussian. In \u00a75 we include comparisons against a VAE with a beta distribution likelihood\n(we also made comparisons against a truncated Gaussian but found this to severely underperform all\nthe alternatives). Gulrajani et al. [11] use a discrete distribution over the 256 possible pixel values.\nKnoblauch et al. [22] study changing the reconstruction and/or the KL term in the ELBO. While\ntheir main focus is to obtain more robust inference, they provide a framework in which the Bernoulli\nVAE corresponds simply to a different (nonprobabilistic) loss. In this perspective, empirical results\nmust determine the adequacy of this objective; \u00a75 shows the Bernoulli VAE to underperform its\nproper probabilistic counterpart across a range of settings. Finally, note that none of these alternatives\nprovide a way to understand the effect of using Bernoulli likelihoods on [0, 1]-valued data.\n\n4.2 Data augmentation\n\nSince the expectation of a Bernoulli random variable is precisely its parameter, the Bernoulli VAE\nmight (erroneously) be assumed to be equivalent to a continuous Bernoulli VAE on an in\ufb01nitely\naugmented dataset, obtained by sampling binary data whose mean is given by the observed data;\nindeed this idea is suggested by Kingma and Welling [20]2. This interpretation does not hold3; it\nwould result in a reconstruction term as in the \ufb01rst line in the equation below, while a correct Bernoulli\n\n2see the comments in https://openreview.net/forum?id=33X9fd2-9FyZd\n3see http://ruishu.io/2018/03/19/bernoulli-vae/ for a looser lower bound interpretation\n\n4\n\n\fVAE on the augmented dataset would have a reconstruction term given by the second line (not equal,\nas the order of expectation can not be switched since q\u03c6 depends on Xr on the second line):\n\nEzn\u223cq\u03c6(zn|xn)\n\nEXn\u223cB(xn)\n\nXn,d log \u03bb\u03b8,d(zn) + (1 \u2212 Xn,d) log \u03bb\u03b8,d(zn)\n\n(cid:104)\n(cid:104)\n\n(cid:104) D(cid:88)\n(cid:104) D(cid:88)\n\nd=1\n\n(cid:105)(cid:105)\n\n(cid:105)(cid:105)\n\n(11)\n\n.\n\n(cid:54)= EXn\u223cB(xn)\n\nEzn\u223cq\u03c6(zn|Xn)\n\nXn,d log \u03bb\u03b8,d(zn) + (1 \u2212 Xn,d) log \u03bb\u03b8,d(zn)\n\n4.3 Bernoulli VAE as a lower lower bound\n\nd=1\n\nN(cid:88)\n\nD(cid:88)\n\nBecause the continuous Bernoulli ELBO and the Bernoulli ELBO are related by:\n\nE(\u02dcp, \u03b8, \u03c6) +\n\nEzn\u223cq\u03c6(zn|xn)[log C(\u03bb\u03b8,d(zn))] = E(p, \u03b8, \u03c6)\n\n(12)\n\nn=1\n\nd=1\n\nand recalling Proposition 4, since log 2 > 0, we get that E(\u02dcp, \u03b8, \u03c6) < E(p, \u03b8, \u03c6). That is, using\nthe Bernoulli VAE results in optimizing an even-lower bound of the log likelihood than using the\ncontinuous Bernoulli ELBO. Note however that unlike the ELBO, E(\u02dcp, \u03b8, \u03c6) is not tight even if the\napproximate posterior matches the true posterior.\n\n(cid:80)\n\n4.4 Mean parameterization\n\nThe conventional maximum likelihood estimator for a Bernoulli, namely \u02c6\u03bbB = 1\nn xn, maximizes\n\u02dcp(x1, ..., xN|\u03bb) regardless of whether data is {0, 1} and [0, 1]. As a thought experiment, consider\nN\n(cid:80)\nx1, . . . , xN \u223ciid CB(\u03bb). Proposition 5 tells us that the correct maximum likelihood estimator, \u02c6\u03bbCB\nn xn, where \u00b5 is the CB mean of equation 8. Thus, while using \u02c6\u03bbB is\nis such that \u00b5(\u02c6\u03bbCB) = 1\nN\nincorrect, one can (surprisingly) still recover the correct maximum likelihood estimator via \u02c6\u03bbCB =\n\u00b5\u22121(\u02c6\u03bbB). One might then (wrongly) think that training a Bernoulli VAE, and then subsequently\ntransforming the decoder parameters with \u00b5\u22121, would be equivalent to training a continuous Bernoulli\nVAE; that is, \u03bb\u03b8\u2217(p) might be equal to \u00b5\u22121(\u03bb\u03b8\u2217( \u02dcp)). This reasoning is incorrect: the KL term in\nthe ELBO implies that \u03bb\u03b8\u2217(p)(zn) (cid:54)= \u00b5\u22121(xn), and so too \u03bb\u03b8\u2217( \u02dcp)(zn) (cid:54)= xn, and as such \u03bb\u03b8\u2217(p) (cid:54)=\n\u00b5\u22121(\u03bb\u03b8\u2217( \u02dcp)). In fact, our experiments will show that despite this \ufb02awed reasoning, applying this\ntransformation can recover some, but not all, of the performance loss from using a Bernoulli VAE.\n\n5 Experiments\n\nWe have introduced the continuous Bernoulli distribution to give a proper probabilistic VAE for\n[0, 1]-valued data. The essential question that we now address is how much, if any, improvement we\nachieve by making this choice.\n\n5.1 MNIST\n\nOne frequently noted shortcoming of VAE (and Bernoulli VAE on MNIST in particular) is that\nsamples from this model are blurry. As noted, the convexity of log C(\u03bb) can be seen as regularizing\nsample values from the VAE to be more extremal; that is, sharper. As such we \ufb01rst compared samples\nfrom a trained continuous Bernoulli VAE against samples from the MNIST dataset itself, from a\ntrained Bernoulli VAE, and from a trained Gaussian VAE, that is, the usual VAE model with a decoder\nlikelihood p\u03b8(x|z) = N (\u03b7\u03b8(z), \u03c32\n\u03b8 (z)), where we use \u03b7 to avoid confusion with \u00b5 of equation 8.\nThese samples are shown in Figure 2. In each case, as is standard, we show the parameter output\nby the generative/decoder network for a given latent draw: \u03bb\u03b8\u2217(p)(z) for the CB VAE, \u03bb\u03b8\u2217( \u02dcp)(z)\nfor the B VAE, and \u03b7\u03b8\u2217 (z) for the N VAE. Qualitatively, the continuous Bernoulli VAE achieves\nconsiderably superior samples vs the Bernoulli or Gaussian VAE, owing to the effect of log C(\u03bb)\npushing the decoder toward sharper images. Further samples are in the appendix. Dai and Wipf [4]\nconsider why Gaussian VAE produce blurry images; we point out that our work (considering the\nlikelihood) is orthogonal to theirs (considering the data manifold).\n\n5\n\n\fdata\n\nCB VAE\n\nB VAE\n\nN VAE\n\nFigure 2: Samples from MNIST, continuous Bernoulli VAE, Bernoulli VAE, and Gaussian VAE.\n\n5.2 Warped MNIST datasets\n\nThe most common justi\ufb01cation for the Bernoulli VAE is that MNIST pixel values are \u2018close\u2019 to\nbinary. An important study is thus to ask how the performance of continuous Bernoulli VAE vs the\nBernoulli VAE changes as a function of this \u2018closeness.\u2019 We formalize this concept by introducing\na warping function f\u03b3(x) that, depending on the warping parameter \u03b3, transforms individual pixel\nvalues to produce a dataset that is anywhere from fully binarized (every pixel becomes {0, 1}) to\nfully degraded (every pixel becomes 0.5). Figure 3 shows f\u03b3 for different values of \u03b3, and the (rather\nuninformative) warping equation appears next to Figure 3.\nImportantly, \u03b3 = 0 corresponds to an unwarped dataset, i.e., the original MNIST dataset. Further, note\nthat negative values of \u03b3 warp pixel values towards being more binarized, completely binarizing it in\nthe case where \u03b3 = \u22120.5, whereas positive values of \u03b3 push the pixel values towards 0.5, recovering\nconstant data at \u03b3 = 0.5. We then train our competing VAE models on the datasets induced by\ndifferent values of \u03b3 and compare the difference in performance as \u03b3 changes. Note importantly that,\nbecause \u03b3 induces different datasets, performance values should primarily be compared between\nVAE models at each \u03b3 value; the trend as a function of \u03b3 is not of particular interest.\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n(cid:17)(cid:17)\n\n1(x \u2265 0.5), if \u03b3 = \u22120.5\nx + \u03b3\nmin\n1 + 2\u03b3\n\u03b3 + (1 \u2212 2\u03b3)x, if \u03b3 \u2208 [0, 0.5]\n\n1, max\n\n(cid:16)\n\n(cid:16)\n\n0,\n\n, if \u03b3 \u2208 (\u22120.5, 0)\n\nf\u03b3(x) =\n\nFigure 3: f\u03b3 for different \u03b3 values.\n\nFigure 4 shows the results of various models applied to these different datasets (all values are an\naverage of 10 separate training runs). The same neural network architectures are used across this\n\ufb01gure, with architectural choices that are quite standard (detailed in appendix 4, along with training\nhyperparameters). The left panel shows ELBO values. In dark blue is the continuous Bernoulli\nVAE ELBO, namely E(p, \u03b8\u2217(p), \u03c6\u2217(p)).\nIn light blue is the same ELBO when evaluated on a\ntrained Bernoulli VAE: E(p, \u03b8\u2217(\u02dcp), \u03c6\u2217(\u02dcp)). Most importantly, note the \u03b3 = 0 values; the continuous\nBernoulli VAE achieves a 300 nat improvement over the Bernoulli VAE. This \ufb01nding supports the\nprevious qualitative \ufb01nding and the theoretical motivation for this work: signi\ufb01cant quantitative gains\nare achieved via the continuous Bernoulli model. This \ufb01nding remains true across a range of \u03b3 (dark\nblue above light blue in Figure 4), indicating that regardless of how \u2018close\u2019 to binary a dataset is, the\ncontinuous Bernoulli is a superior VAE model.\nOne might then wonder if the continuous Bernoulli is outperforming simply because the Bernoulli\nneeds a mean correction. We thus apply \u00b5\u22121, namely the map from Bernoulli to continuous Bernoulli\nmaximum likelihood estimators (equation 8 and \u00a74.4), and evaluate the same ELBO on \u00b5\u22121(\u03bb\u03b8\u2217( \u02dcp)) as\nthe decoder shown in light red (Figure 4, left). This result, which is only achieved via the introduction\nof the continuous Bernoulli, shows two important \ufb01ndings: \ufb01rst, that indeed some improvement over\n\n6\n\n0.00.20.40.60.81.0x0.00.20.40.60.81.0f(x)warping functions0.50.40.30.20.10.00.10.20.30.40.5warping \fthe Bernoulli VAE is achieved by post hoc correction to a continuous Bernoulli parameterization; but\nsecond, that this transformation is still inadequate to achieve the full performance of the continuous\nBernoulli VAE.\n\nFigure 4: Continuous Bernoulli comparisons against Bernoulli VAE. See text for details.\n\nWe also note that we ran the same experiment with log likelihood instead of ELBO (using importance\nweighted estimates with k = 100 samples; see Burda et al. [2]), and the same results held (up to small\nnumerical differences; these traces are omitted for \ufb01gure clarity). We also ran the same experiment\nfor the \u03b2-VAE [12], sweeping a range of \u03b2 values, and the same results held (see appendix 5.1).\nTo make sure that the discrepancy between the continuous Bernoulli and Bernoulli is not due to\nthe approximate posterior not being able to adequately approximate the true posterior, we ran the\nsame experiments with \ufb02exible posteriors using normalizing \ufb02ows [34] and found the discrepancy to\nbecome even larger (see appendix 5.1).\nIt is natural to then wonder if this performance is an artifact of ELBO and log likelihood; thus, we\nalso evaluated the same datasets and models using different evaluation metrics. In the middle panel\nof Figure 4, we use the inception score [35] to measure sample quality produced by the different\nmodels (higher is better). Once again, we see that including the normalizing constant produces\nbetter samples (dark traces / continuous Bernoulli lie above light traces / Bernoulli). We include\nthat comparison on both the decoder parameters \u03bb (dark and light green) and also samples from\ndistributions indexed by those parameters (dark and light orange). One can imagine a variety of other\nparameter transformations that may be of interest; we include several in appendix 5.1, where again\nwe \ufb01nd that the continuous Bernoulli VAE outperforms its Bernoulli counterpart.\nIn the right panel of Figure 4, to measure usefulness of the latent representations of these models, we\ncompute m\u03c6\u2217(p)(xn) and m\u03c6\u2217( \u02dcp)(xn) (note that m is the variational posterior mean from equation\n2 and not the continuous Bernoulli mean) for training data and use a 15-nearest neighbor classi\ufb01er\nto predict the test labels. The right panel of Figure 4 shows the accuracy of the classi\ufb01ers (denoted\nK(\u03c6)) as a function of \u03b3. Once again, the continuous Bernoulli VAE outperforms the Bernoulli VAE.\nNow that the continuous Bernoulli VAE gives us a proper model on [0, 1], we can also propose\nother natural models. Here we introduce and compare against the beta distribution VAE (not \u03b2-VAE\n[12]); as the name implies, the generative likelihood is Beta(\u03b1\u03b8(z), \u03b2\u03b8(z)). We repeated the same\nwarped MNIST experiments using Gaussian VAE and beta distribution VAE, both including and\nignoring the normalizing constants of those distributions, as an analogy to the continuous Bernoulli\nand Bernoulli distributions. First, Figure 5 shows again that that ignoring the normalizing constant\nhurts performance in every metric and model (dark above light). Second, interestingly, we \ufb01nd that\nthe continuous Bernoulli VAE outperforms both the beta distribution VAE and the Gaussian VAE in\nterms of inception scores, and that the beta distribution VAE dominates in terms of ELBO. This \ufb01gure\nclari\ufb01es that the continuous Bernoulli and beta distribution are to be preferred over the Gaussian for\nVAE applied to [0, 1] valued data, and that ignoring normalizing constants is indeed unwise.\nA few additional notes warrant mention on Figure 5. Unlike with the continuous Bernoulli, we should\nnot expect the Gaussian and beta normalizing constants to go to in\ufb01nity as extrema are reached, so\nwe do not observe the same patterns with respect to \u03b3 as we did with the continuous Bernoulli. Note\nalso that the lower lower bound property of ignoring normalizing constants does not hold in general,\nas it is a direct consequence of the continuous Bernoulli log normalizing constant being nonnegative.\n\n7\n\n0.40.20.00.20.4warping 02004006008001000120014001600ELBO147011391416ELBO for VAE(p,*(p),*(p))(p,*(p),*(p))(p,*(p),*(p)) with 10.40.20.00.20.4warping 246810inception scoreinception scores of VAEIS dataIS *(p)(z)IS *(p)(z)IS (*(p)(z))IS (*(p)(z))0.40.20.00.20.4warping 0.20.40.60.81.0accuracyknn accuracy of VAE latents(p)(p)\fFigure 5: Gaussian (top panels) and beta (bottom panels) distributions comparisons between including\nand excluding the normalizing constants. Left panels show ELBOs, middle panels inceptions scores,\nand right panels 15-nearest neighbors accuracy.\n\nTable 1: Comparisons of training with and without normalizing constants for CIFAR-10. For\nconnection to the panels in Figures 4 and 5, column headers are colored accordingly.\n\ndistribution\n\nCB/B\n\nGaussian\n\nbeta\n\nobjective map\nE(p, \u03b8, \u03c6)\n\u00b7\nE(\u02dcp, \u03b8, \u03c6)\n\u00b5\u22121\n\u00b7\nE(\u02dcp, \u03b8, \u03c6)\nE(p, \u03b8, \u03c6)\n\u00b7\nE(\u02dcp, \u03b8, \u03c6)\n\u00b7\n\u00b7\nE(p, \u03b8, \u03c6)\nE(\u02dcp, \u03b8, \u03c6)\n\u00b7\n\nE(p, \u03b8\u2217, \u03c6\u2217)\n\n1007\n916\n475\n1891\n-42411\n3121\n-38913\n\nIS w/ samples\n\n1.15\n1.49\n1.41\n1.86\n1.24\n2.98\n1.39\n\nIS w/ parameters K(\u03c6\u2217)\n0.43\n0.42\n0.42\n0.42\n0.1\n0.47\n0.1\n\n2.31\n4.55\n1.39\n3.04\n1.00\n4.07\n1.00\n\n5.3 CIFAR-10\n\nWe repeat the same experiments as in the previous section on the CIFAR-10 dataset (without common\npreprocessing that takes the data outside [0, 1] support), a dataset often considered to be a bad \ufb01t\nfor Bernoulli VAE. For brevity we evaluated only the non-warped data \u03b3 = 0, leading to the results\nshown in Table 1 (note the colored column headers, to connect to the panels in Figures 4,5). The top\nsection shows results for the continuous Bernoulli VAE (\ufb01rst row, top), the Bernoulli VAE (third row,\ntop), and the Bernoulli VAE with a continuous Bernoulli inverse map \u00b5\u22121 (second row, top). Across\nall metrics \u2013 ELBO, inception score with parameters \u03bb, inception score with continuous Bernoulli\nsamples given \u03bb, and k nearest neighbors \u2013 the Bernoulli VAE is suboptimal. Interestingly, unlike\nin MNIST, here we see that using the continuous Bernoulli parameter correction \u00b5\u22121 (\u00a74.4) to a\nBernoulli VAE is optimal under some metrics. Again we note that this is a result belonging to the\ncontinuous Bernoulli (the parameter correction is derived from equation 8), so even these results\nemphasize the importance of the continuous Bernoulli.\nWe then repeat the same set of experiments for Gaussian and beta distribution VAE (middle and\nbottom sections of Table 1). Again, ignoring normalizing constants produces signi\ufb01cant performance\nloss across all metrics. Comparing metrics across the continuous Bernoulli, Gaussian, and beta\nsections, we see again that the Gaussian VAE is suboptimal across all metrics, with the optimal\ndistribution being the continuous Bernoulli or beta distribution VAE, depending on the metric.\n\n8\n\n0.40.20.00.20.4warping 900800700600500ELBOELBO for Gaussian VAE(p,*(p),*(p))(p,*(p),*(p))0.40.20.00.20.4warping 246810inception scoreinception scores of Gaussian VAEIS dataIS *(p)(z)IS *(p)(z)IS (*(p)(z),2*(p)(z))IS (*(p)(z),2*(p)(z))0.40.20.00.20.4warping 0.20.40.60.81.0accuracyknn accuracy of Gaussian VAE latents(p)(p)0.40.20.00.20.4warping 0500100015002000ELBOELBO for beta distribution VAE(p,*(p),*(p))(p,*(p),*(p))0.40.20.00.20.4warping 246810inception scoreinception scores of beta distribution VAEIS dataIS mean(*(p)(z),*(p)(z))IS mean(*(p)(z),*(p)(z))IS Beta(*(p)(z),*(p)(z))IS Beta(*(p)(z),*(p)(z))0.40.20.00.20.4warping 0.20.40.60.8accuracyknn accuracy of beta distribution VAE latents(p)(p)\f5.4 Parameter estimation with EM\n\nFinally, one might wonder if the performance improvements of the continuous Bernoulli VAE over\nthe Bernoulli VAE and its corrected version with \u00b5\u22121 are merely an artifact of not having access to\nthe log likelihood and having to optimize the ELBO instead. In this section we show, empirically,\nthat the answer is no. We consider estimating the parameters of a mixture of continuous Bernoulli\ndistributions, of which the VAE can be thought of as a generalization with in\ufb01nitely many components.\nWe proceed as follows: We randomly set a mixture of continuous Bernoulli distributions, ptrue, with\nK components in 50 dimensions (independent of each other) and sample from this mixture 10000\ntimes in order to generate a simulated dataset. We then use the EM algorithm [5] to estimate the\nmixture coef\ufb01cients and the corresponding continuous Bernoulli parameters, \ufb01rst using a continuous\nBernoulli likelihood (correct), and second using a Bernoulli likelihood (incorrect). We then measure\nhow close the estimated parameters are from ground truth. To avoid a (hard) optimization problem\nover permutations, we measure closeness with KL divergence between the true distribution ptrue and\nthe estimated pest.\nThe results of performing the procedure described above 10 times and averaging the KL values\n(over these 10 repetitions), along with standard errors, are shown in Figure 6. First, we can see that\nwhen using the correct continuous Bernoulli likelihood, the EM algorithm correctly recovers the true\ndistribution. We can also see that, as the number of mixture components K gets larger, ignoring the\nnormalizing constant results in worse performance, even after correcting with \u00b5\u22121, which helps but\ndoes not fully remedy the situation (except at K = 1, as noted in \u00a74.4).\n\nFigure 6: Bias of the EM algorithm to estimate CB parameters when using a CB likelihood (dark\nblue), B likelihood (light blue) and a B likelihood plus a \u00b5\u22121 correction (light red).\n\n6 Conclusions\n\nIn this paper we introduce and characterize a novel probability distribution \u2013 the continuous Bernoulli\n\u2013 to study the effect of using a Bernoulli VAE on [0, 1]-valued intensity data, a pervasive error in\nhighly cited papers, publicly available implementations, and core software tutorials alike. We show\nthat this practice is equivalent to ignoring the normalizing constant of a continuous Bernoulli, and\nthat doing so results in signi\ufb01cant performance decrease in the qualitative appearance of samples\nfrom these models, the ELBO (approximately 300 nats), the inception score, and in terms of the latent\nrepresentation (via k nearest neighbors). Several surprising \ufb01ndings are shown, including: (i) that\nsome plausible interpretations of ignoring a normalizing constant are in fact wrong; (ii) the (possibly\ncounterintuitive) fact that this normalizing constant is most critical when data is near binary; and (iii)\nthat the Gaussian VAE often underperforms VAE models with the appropriate data type (continuous\nBernoulli or beta distributions).\nTaken together, these \ufb01ndings suggest an important potential role for the continuous Bernoulli\ndistribution going forward. On this point, we note that our characterization of the continuous\nBernoulli properties (such as its ease of reparameterization, likelihood evaluation, and sampling)\nmake it compatible with the vast array of VAE improvements that have been proposed in the literature,\nincluding \ufb02exible posterior approximations [34, 21], disentangling [12], discrete codes [28, 15],\nvariance control strategies [30], and more.\n\n9\n\n051015202530mixture components K02468101214KL(ptrue||pest)bias of likelihood for mixture plus 1\fAcknowledgments\n\nWe thank Yixin Wang, Aaron Schein, Andy Miller, and Keyon Vafa for helpful conversations, and the\nSimons Foundation, Sloan Foundation, McKnight Endowment Fund, NIH NINDS 5R01NS100066,\nNSF 1707398, and the Gatsby Charitable Foundation for support.\n\nReferences\n[1] Pytorch VAE turotial:\n\nKeras VAE turotial:\nhtml.\n\nhttps://github.com/pytorch/examples/tree/master/vae,\nhttps://blog.keras.io/building-autoencoders-in-keras.\n\n[2] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n[3] F. Chollet et al. Keras. https://keras.io, 2015.\n[4] B. Dai and D. Wipf. Diagnosing and enhancing vae models. In International Conference on\n\nLearning Representations, 2019.\n\n[5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\nthe em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):\n1\u201322, 1977.\n\n[6] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and\nM. Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders.\narXiv preprint arXiv:1611.02648, 2016.\n\n[7] C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.\n[8] Y. Gao, E. W. Archer, L. Paninski, and J. P. Cunningham. Linear dynamical neural population\nmodels through nonlinear embeddings. In Advances in neural information processing systems,\npages 163\u2013171, 2016.\n\n[9] R. G\u00f3mez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern\u00e1ndez-Lobato, B. S\u00e1nchez-Lengeling,\nD. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto-\nmatic chemical design using a data-driven continuous representation of molecules. ACS central\nscience, 4(2):268\u2013276, 2018.\n\n[10] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. Draw: A recurrent neural\nIn International Conference on Machine Learning, pages\n\nnetwork for image generation.\n1462\u20131471, 2015.\n\n[11] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville.\nPixelvae: A latent variable model for natural images. In International Conference on Learning\nRepresentations, 2017.\n\n[12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Ler-\nchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In\nInternational Conference on Learning Representations, volume 3, 2017.\n\n[13] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[14] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of\ntext. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 1587\u20131596. JMLR. org, 2017.\n\n[15] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax.\n\nInternational Conference on Learning Representations, 2017.\n\nIn\n\n[16] E. T. Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n[17] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: an unsupervised\nand generative approach to clustering. In Proceedings of the 26th International Joint Conference\non Arti\ufb01cial Intelligence, pages 1965\u20131972. AAAI Press, 2017.\n\n[18] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n10\n\n\f[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations, 2015.\n\nIn International\n\n[20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference\n\non Learning Representations, 2014.\n\n[21] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved\nIn Advances in neural information\n\nvariational inference with inverse autoregressive \ufb02ow.\nprocessing systems, pages 4743\u20134751, 2016.\n\n[22] J. Knoblauch, J. Jewson, and T. Damoulas. Generalized variational inference. arXiv preprint\n\narXiv:1904.02063, 2019.\n\n[23] D. Koller, N. Friedman, and F. Bach. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[24] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of\nthe Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 29\u201337,\n2011.\n\n[25] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels\nusing a learned similarity metric. In International Conference on Machine Learning, pages\n1558\u20131566, 2016.\n\n[26] G. Loaiza-Ganem, Y. Gao, and J. P. Cunningham. Maximum entropy \ufb02ow networks.\n\nInternational Conference on Learning Representations, 2017.\n\nIn\n\n[27] D. J. MacKay. Information theory, inference and learning algorithms. Cambridge university\n\npress, 2003.\n\n[28] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In International Conference on Learning Representations, 2017.\n\n[29] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In pro-\nceedings of the 6th conference on Natural language learning-Volume 20, pages 1\u20137. Association\nfor Computational Linguistics, 2002.\n\n[30] A. Miller, N. Foti, A. D\u2019Amour, and R. P. Adams. Reducing reparameterization gradient\n\nvariance. In Advances in Neural Information Processing Systems, pages 3708\u20133718, 2017.\n\n[31] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference:\nAn empirical study. In Proceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial\nintelligence, pages 467\u2013475. Morgan Kaufmann Publishers Inc., 1999.\n\n[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.\n\n[33] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive\nSystems Laboratory, School of Engineering and Applied Science, University of California, Los\nAngeles, 1982.\n\n[34] D. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In International\n\nConference on Machine Learning, pages 1530\u20131538, 2015.\n\n[35] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in neural information processing systems, pages\n2234\u20132242, 2016.\n\n[36] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory.\n\nColorado Univ at Boulder Dept of Computer Science, 1986.\n\n[37] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational\nautoencoders. In Advances in neural information processing systems, pages 3738\u20133746, 2016.\n[38] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n11\n\n\f", "award": [], "sourceid": 7283, "authors": [{"given_name": "Gabriel", "family_name": "Loaiza-Ganem", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Cunningham", "institution": "University of Columbia"}]}