{"title": "BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 6551, "page_last": 6562, "abstract": "With the introduction of the variational autoencoder (VAE), probabilistic latent variable models have received renewed attention as powerful generative models. However, their performance in terms of test likelihood and quality of generated samples has been surpassed by autoregressive models without stochastic units. Furthermore, flow-based models have recently been shown to be an attractive alternative that scales well to high-dimensional data. In this paper we close the performance gap by constructing VAE models that can effectively utilize a deep hierarchy of stochastic variables and model complex covariance structures. We introduce the Bidirectional-Inference Variational Autoencoder (BIVA), characterized by a skip-connected generative model and an inference network formed by a bidirectional stochastic inference path. We show that BIVA reaches state-of-the-art test likelihoods, generates sharp and coherent natural images, and uses the hierarchy of latent variables to capture different aspects of the data distribution. We observe that BIVA, in contrast to recent results, can be used for anomaly detection. We attribute this to the hierarchy of latent variables which is able to extract high-level semantic features. Finally, we extend BIVA to semi-supervised classification tasks and show that it performs comparably to state-of-the-art results by generative adversarial networks.", "full_text": "BIVA: A Very Deep Hierarchy of Latent Variables for\n\nGenerative Modeling\n\nLars Maal\u00f8e\n\nCorti\n\nCopenhagen\n\nDenmark\n\nlm@corti.ai\n\nMarco Fraccaro\n\nUnumed\n\nCopenhagen\n\nDenmark\n\nmf@unumed.com\n\nValentin Li\u00e9vin & Ole Winther\nTechnical University of Denmark\n\nCopenhagen\n\nDenmark\n\n{valv,olwi}@dtu.dk\n\nAbstract\n\nWith the introduction of the variational autoencoder (VAE), probabilistic latent\nvariable models have received renewed attention as powerful generative models.\nHowever, their performance in terms of test likelihood and quality of generated\nsamples has been surpassed by autoregressive models without stochastic units.\nFurthermore, \ufb02ow-based models have recently been shown to be an attractive\nalternative that scales well to high-dimensional data. In this paper we close the\nperformance gap by constructing VAE models that can effectively utilize a deep\nhierarchy of stochastic variables and model complex covariance structures. We in-\ntroduce the Bidirectional-Inference Variational Autoencoder (BIVA), characterized\nby a skip-connected generative model and an inference network formed by a bidi-\nrectional stochastic inference path. We show that BIVA reaches state-of-the-art test\nlikelihoods, generates sharp and coherent natural images, and uses the hierarchy of\nlatent variables to capture different aspects of the data distribution. We observe that\nBIVA, in contrast to recent results, can be used for anomaly detection. We attribute\nthis to the hierarchy of latent variables which is able to extract high-level semantic\nfeatures. Finally, we extend BIVA to semi-supervised classi\ufb01cation tasks and show\nthat it performs comparably to state-of-the-art results by generative adversarial\nnetworks.\n\n1\n\nIntroduction\n\nOne of the key aspirations in recent machine learning research is to build models that understand\nthe world [24, 40, 11, 57]. Generative models are providing the means to learn from a plethora of\nunlabeled data in order to model a complex data distribution, e.g. natural images, text, and audio.\nThese models are evaluated by their ability to generate data that is similar to the input data distribution\nfrom which they were trained on. The range of applications that come with generative models are\nvast, where audio synthesis [55] and semi-supervised classi\ufb01cation [38, 31, 44] are examples hereof.\nGenerative models can be broadly divided into explicit and implicit density models. The generative\nadversarial network (GAN) [11] is an example of an implicit model, since it is not possible to procure\na likelihood estimation from this model framework. The focus of this research is instead within\nexplicit density models, for which a tractable or approximate likelihood estimation can be performed.\nThe three main classes of powerful explicit density models are autoregressive models [26, 57], \ufb02ow-\nbased models [8, 9, 21, 16], and probabilistic latent variable models [24, 40, 33]. In recent years\nautoregressive models, such as the PixelRNN and the PixelCNN [57, 45], have achieved superior\nlikelihood performance and \ufb02ow-based models have proven ef\ufb01cacy on large-scale natural image\ngeneration tasks [21]. However, in the autoregressive models, the runtime performance of generation\nis scaling poorly with the complexity of the input distribution. The \ufb02ow-based models do not possess\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis restriction and do indeed generate visually compelling natural images when sampling close to\nthe mode of the distribution. However, generation from the actual learned distribution is still not\noutperforming autoregressive models [21, 16].\nProbabilistic latent variable models such as the variational auto-encoder (VAE) [24, 40] possess\nintriguing properties that are different from the other classes of explicit density models. They\nare characterized by a posterior distribution over the latent variables of the model, derived from\nBayes\u2019 theorem, which is typically intractable and needs to be approximated. This distribution\nmost commonly lies on a low-dimensional manifold that can provide insights into the internal\nrepresentation of the data [1]. However, the latent variable models have largely been disregarded as\npowerful generative models due to blurry generations and poor likelihood performances on natural\nimage tasks. [27, 10], amongst others, attribute this tendency to the usage of a similarity metric in\npixel space. Contrarily, we attribute it to the lack of overall model expressiveness for accurately\nmodeling complex input distributions, as discussed in [59, 41].\nThere has been much research into explicitly de\ufb01ning and learning more expressive latent variable\nmodels. Here, the complementary research into learning a covariance structure through a framework\nof normalizing \ufb02ows [39, 52, 23] and the stacking of a hierarchy of latent variables [4, 37, 31, 50]\nhave shown promising results. However, despite signi\ufb01cant improvements, the reported performance\nof these models has still been inferior to their autoregressive counterparts. This has spawned a new\nclass of explicit density models that adds an autoregressive component to the generative process of a\nlatent variable model [14, 5]. In this combination of model paradigms, the latent variables can be\nviewed as merely a lossy representation of the input data and the model still suffers from the same\nissues as autoregressive models.\n\nContributions.\nIn this research we argue that latent variable models that are de\ufb01ned in a suf-\n\ufb01ciently expressive way can compete with autoregressive and \ufb02ow-based models in terms of test\nlog-likelihood and quality of the generated samples. We introduce the Bidirectional-Inference Varia-\ntional Autoencoder (BIVA), a model formed by a deep hierarchy of stochastic variables that uses\nskip-connections to enhance the \ufb02ow of information and avoid inactive units. To de\ufb01ne a \ufb02exible\nposterior approximation, we construct a bidirectional inference network using stochastic variables\nin a bottom-up and a top-down inference path. The inference model is reminiscent to the stochastic\ntop-down path introduced in the Ladder VAE [50] and IAF VAE [50] with the addition that the\nbottom-up pass is now also stochastic and there are no autoregressive components. We perform\nan in-depth analysis of BIVA and show (i) an ablation study that analyses the contributions of the\nindividual novel components, (ii) that the model is able to improve on state-of-the-art results on\nbenchmark image datasets, (iii) that a small extension of the model can be used for semi-supervised\nclassi\ufb01cation and performs comparably to current state-of-the-art models, and (iv) that the model,\ncontrarily to other state-of-the-art explicit density models [34], can be utilized for anomaly detection\non complex data distributions.\n\n2 Variational Autoencoders\n\np\u03b8(x|z1)p\u03b8(zL)(cid:81)L\u22121\nfactorized with a bottom-up structure, q\u03c6(z|x) = q\u03c6(z1|x)(cid:81)L\u22121\n\nThe VAE is a generative model parameterized by a neural network \u03b8 and is de\ufb01ned by an observed\nvariable x that depends on a hierarchy of stochastic latent variables z = z1, ..., zL so that: p\u03b8(x, z) =\ni=1 p\u03b8(zi|zi+1). The posterior distribution over the latent variables of a VAE\nis commonly analytically intractable, and is approximated with a variational distribution which is\ni=1 q\u03c6(zi+1|zi), so that each latent\nvariable is conditioned on the variable below in the hierarchy. The parameters \u03b8 and \u03c6 can be\noptimized by maximizing the evidence lower bound (ELBO)\n\nlog p\u03b8(x) \u2265 Eq\u03c6(z|x)\n\n\u2261 L(\u03b8, \u03c6) .\n\n(1)\n\n(cid:20)\n\nlog\n\np\u03b8(x, z)\nq\u03c6(z|x)\n\n(cid:21)\n\nA detailed introduction on VAEs can be found in appendix A in the supplementary material. While a\ndeep hierarchy of latent stochastic variables will result in a more expressive model, in practice the top\nstochastic latent variables of standard VAEs have a tendency to collapse into the prior. The Ladder\nVAE (LVAE) [50] is amongst the \ufb01rst attempts towards VAEs that can effectively leverage multiple\nlayers of stochastic variables. This is achieved by parameterizing the variational approximation with\na bottom-up deterministic path followed by a top-down inference path that shares parameters with\n\n2\n\n\f\u03b8\n\nz3\n\nzBU\n2\n\nzBU\n1\n\nzTD\n2\n\nzTD\n1\n\nx\n\n(a) Generative model\n\n\u03c6, \u03b8\n\nz3\n\nzBU\n2\n\nzBU\n1\n\nzTD\n2\n\nzTD\n1\n\nx\n\n(b) Inference model\n\nFigure 1: A L = 3 layered BIVA with (a) the generative model and (b) inference model. Blue arrows\nindicate that the deterministic parameters are shared between the inference and generative models.\nSee Appendix B for a detailed explanation and a graphical model that includes the deterministic\nvariables.\n\nthe top-down structure of the generative model: q\u03c6,\u03b8(z|x) = q\u03c6(zL|x)(cid:81)L\u22121\n\ni=1 q\u03c6,\u03b8(zi|zi+1, x). See\nAppendix A for a graphical representation of the LVAE inference network. Thanks to the bottom-\nup path, all the latent variables in the hierarchy have a deterministic dependency on the observed\nvariable x, which allows data-dependent information to skip all the stochastic variables lower in the\nhierarchy (Figure 5d in Appendix A). The stochastic latent variables that are higher in the hierarchy\nwill therefore receive less noisy inputs, and will be empirically less likely to collapse. Despite the\nimprovements obtained thanks to the more \ufb02exible inference network, in practice LVAEs with a very\ndeep hierarchy of stochastic latent variables will still experience variable collapse. In the next section\nwe will introduce the Bidirectional-Inference Variational Autoencoder, that manages to avoid these\nissues by extending the LVAE in 2 ways: (i) adding a deterministic top-down path in the generative\nmodel and (ii) de\ufb01ning a factorization of the latent variables zi at each level of the hierarchy that\nallows to construct a bottom-up stochastic inference path.\n\n3 Bidirectional-Inference Variational Autoencoder\n\nIn this section, we will \ufb01rst describe the architecture of the Bidirectional-Inference Variational\nAutoencoder (Figure 1), and then provide the motivation behind the main ideas of the model as well\nas some intuitions on the role of each of its novel components. Finally, we will show how this model\ncan be used for a novel approach to detecting anomalous data.\n\n3.1 Model architecture\n\n, zTD\n\nIn BIVA, at each layer 1, ..., L \u2212 1 of the hierarchy we split the latent variable\nGenerative model.\nin two components, zi = (zBU\ni ), which belong to a bottom-up (BU) and top-down (TD) inference\ni\npath, respectively. More details on this will be given when introducing the inference network. The\ngenerative model of BIVA is illustrated in Figure 1a. We introduce a deterministic top-down path\ndL\u22121, . . . , d1 that is parameterized with neural networks and receives as input at each layer i of the\nhierarchy the latent variable zi+1. In the case of a convolutional model, this is done by concatenating\ni+1) and di+1 along the features\u2019 dimension. di can therefore be seen as a deterministic\n(zBU\nvariable that summarizes all the relevant information coming from the stochastic variables higher\nin the hierarchy, z>i. The latent variables zBU\ni are conditioned on all the information in the\nhigher layers, and are conditionally independent given z>i. The joint distribution of the model is then\ngiven by:\n\ni and zTD\n\ni+1, zTD\n\nL\u22121(cid:89)\n\np\u03b8(x, z) = p\u03b8(x|z)p\u03b8(zL)\n\np\u03b8(zBU\ni\n\n|z>i)p\u03b8(zTD\n\ni\n\n|z>i) ,\n\nwhere \u03b8 are the parameters of the generative model. The likelihood of the model p\u03b8(x|z) directly\ndepends on z1, and depends on z>1 through the deterministic top-down path. Each stochastic latent\n\ni=1\n\n3\n\n\fvariable 1, ..., L is parameterized by a Gaussian distribution with diagonal covariance, with one neural\nnetwork \u00b5(\u00b7) for the mean and another neural network \u03c3(\u00b7) for the variance. Since the zBU\ni+1 and zTD\ni+1\nvariables are on the same level in the generative model and of the same dimensionality, we share all\nthe deterministic parameters going to the layer below. See Appendix B for details.\n\nBidirectional inference network. Due to the non-linearities in the neural networks that param-\neterize the generative model, the exact posterior distribution p\u03b8(z|x) is intractable and needs to\nbe approximated. As for VAEs, we therefore de\ufb01ne a variational distribution, q\u03c6(z|x), that needs\nto be \ufb02exible enough to approximate the true posterior distribution, as closely as possible. We\nde\ufb01ne a bottom-up (BU) and a top-down (TD) inference path, which are computed sequentially\nwhen constructing the posterior approximation for each data point x, see Figure 1b. The variational\ndistribution over the BU latent variables depends on the data x and on all BU variables lower in the\nhierarchy, i.e. q\u03c6(zBU\ni has a direct\ni\n*i).\n>i, zTD\ni\nImportantly, all the parameters of the TD path are shared with the generative model, and are therefore\ndenoted as \u03b8. The overall inference network can be factorized as follows:\n\n|x, zBU\n\n**i, zTD\n\n>i) ,\n\ni=1\n\nwhere the variational distributions over the BU and TD latent variables are Gaussians whose mean\nand diagonal covariance are parameterized with neural networks that take as input the concatenation\nover the feature dimension of the conditioning variables. Training of BIVA is performed, as for VAEs,\nby maximizing the ELBO in eq. (1) with stochastic backpropagation and the reparameterization trick.\n\n3.2 Motivation\n\nBIVA can be seen as an extension of the LVAE in which we (i) add a deterministic top-down path and\n(ii) apply a bidirectional inference network. We will now provide the motivation and some intuitions\non the role of these two novel components, that will then be empirically validated with the ablation\nstudy of Section 4.1.\n\nDeterministic top-down path. Skip-connections represent one of the simplest yet most powerful\nadvancements of deep learning in recent years. They allow constructing very deep neural networks,\nby better propagating the information throughout the model and reducing the issue of vanishing\ngradients. Skip connections form for example the backbone of deep neural networks such as ResNets\n[15], which have shown impressive performances on a wide range of classi\ufb01cation tasks. Our goal\nin this paper is to build very deep latent variable models that are able to learn an expressive latent\nhierarchical representation of the data. In our experiments, we however found that the LVAE still had\ndif\ufb01culties in activating the top latent variables for deeper hierarchies. To limit this issue, we add skip\nconnections among the latent variables in the generative model by adding the deterministic top-down\npath, that makes each variable depend on all the variables above in the hierarchy (see Figure 1a for a\ngraphical representation). This allows a better \ufb02ow of information in the model and thereby avoids\nthe collapse of latent variables. A related idea was recently proposed by [7], that add skip connections\namong the neural network layers parameterizing a shallow VAE with a single latent variable.\n\nvariational approximation q\u03c6(z|x) = (cid:82) q\u03c6(z|a, x)q\u03c6(a|x)da. By making the inference network\nq\u03c6(z|x) (cid:54)=(cid:81)\n\nBidirectional inference. The inspiration for the bidirectional inference network of BIVA comes\nfrom the work on Auxiliary VAEs (AVAE) by [37, 31]. An AVAE can be viewed as a shallow VAE\nwith a single latent variable z and an auxiliary variable a that increases the expressiveness of the\nq\u03c6(z|a, x) depend on the stochastic variable a, the AVAE adds covariance structure to the posterior\napproximation over the stochastic unit z, since it no longer factorizes over its components z(k), i.e.\nk q\u03c6(z(k)|x). As discussed in the following, by factorizing the latent variables at each\nlevel of the hierarchy of BIVA we are able to achieve similar results without introducing additional\n\n4\n\n\fover zL no longer factorizes, i.e. q\u03c6(zL|x) =(cid:82) q\u03c6(zL|x, zBU\n\nauxiliary variables in the model. To see this, we can focus for example on the highest latent variable\ni variables makes the bottom-up inference path stochastic, as\nzL. In BIVA, the presence of the zBU\nopposed to the deterministic BU path of the LVAE. While the conditional distribution q\u03c6(zL|x, zBU\nk)q\u03c6(z>k|x, zBU\u2264k).\nIn the computation of q\u03c6(z>k|x, zBU\u2264k) we use samples zBU\u2264k from the inference network. Using this\nalternative distribution instead of q\u03c6(z|x) in the ELBO in eq. (1), we de\ufb01ne the score function for\nanomaly detection as:\n\n(cid:34)\n\n(cid:35)\n\nL>k = E\n\np\u03b8(z\u2264k|z>k)q\u03c6(z>k|x,zBU\u2264k)\n\nlog\n\np\u03b8(x|z)p\u03b8(z>k)\nq\u03c6(z>k|x, zBU\u2264k)\n\n.\n\n(2)\n\nL>0 = L is the ELBO in eq. (1). As for the ELBO, we approximate the computation of L>k\nby obtaining samples(cid:98)z>k from the inference network, that are then used to sample(cid:98)z\u2264k from the\nwith Monte Carlo integration. Sampling from p\u03b8(z\u2264k|z>k)q\u03c6(z>k|x, zBU\u2264k) can be easily performed\nconditional prior p\u03b8(z\u2264k|(cid:98)z>k).\n\nL>k with higher values of k represents a useful metric for anomaly detection, as shown empirically\nin the experiments of Section 4.4. By only sampling the top L \u2212 k variables from the variational\napproximation, in fact, we are forcing the model to only rely on the high-level semantics encoded in\nthe highest variables of the hierarchy when evaluating this metric, and not on the low-level statistics\nencoded in the lower variables.\n\n4 Experiments\n\nBIVA is empirically evaluated by (i) an ablation study analyzing each novel component, (ii) likelihood\nand semi-supervised classi\ufb01cation results on binary images, (iii) likelihood results on natural images,\nand (iv) an analysis of anomaly detection in complex data distributions. We employ a free bits strategy\nwith \u03bb = 2 [23] for all experiments to avoid latent variable collapse during the initial training epochs.\nTrained models are reported with 1 importance weighted sample, L1, and 1000 importance weighted\nsamples, L1e3 [3]. We evaluate the natural image experiments by bits per dimension (bits/dim),\nL/(hwc log(2)), where h, w, c denote the height, width, and channels respectively. For a detailed\n\n5\n\n\fFigure 2: The log KL(q||p) for each\nstochastic latent variable as a function of\nthe training epochs on CIFAR-10. (a) is\na L = N = 15 stochastic latent layer\nLVAE with no skip-connections and no\nbottom-up inference. (b) is a L = N =\n15 LVAE+ with skip-connections and no\nbottom-up inference.\n(c) is a L = 15\nstochastic latent layer (N = 29 latent\nvariables) BIVA for which 1, 2, ..., N de-\nnotes the stochastic latent variables fol-\nlowing the order zBU\n\n2 , ..., zL.\n\n1 , zBU\n\n1 , zTD\n\n2 , zTD\n\n(a) LVAE L = 15 (b) LVAE+ L = 15 (c) BIVA L = 15\n\nFigure 3: (left) images from the CelebA dataset preprocessed to 64x64 following [27]. (right) N (0, I)\ngenerations of BIVA with L = 20 layers that achieves a L1 = 2.48 bits/dim on the test set.\ndescription of the experimental setup see Appendix C and the source code12. In Appendix D we test\nBIVA on complex 2d densities, while Appendix E presents initial results for the model on text.\n\n4.1 Ablation Study\n\n1 \u2192 \u00b7\u00b7\u00b7 \u2192 zBU\n\nL\u22121 \u2192 zL \u2192 zTD\n\nBIVA can be viewed as an extension of the LVAE from [50] where we add (i) extra dependencies in the\ngenerative model (p\u03b8(x|z1) \u2192 p\u03b8(x|z) and p\u03b8(zi|zi+1) \u2192 p\u03b8(zi|z>i)) through the skip connections\nobtained with the deterministic top-down path and (ii) a bottom-up (BU) path of stochastic latent\nvariables to the inference model. In order to evaluate the effects of each added component we de\ufb01ne\nan LVAE with the exact same architecture as BIVA, but without the BU variables and the deterministic\ntop-down path. Next, we de\ufb01ne the LVAE+, where we add to the LVAE\u2019s generative model the\ndeterministic top-down path. It is therefore the same model as in Figure 1 but without the BU variables.\nFinally, we investigate a LVAE+ model with 2L \u2212 1 stochastic layers. This corresponds to the depth\nL\u22121 \u2192 \u00b7\u00b7\u00b7 \u2192 zTD\nof the hierarchy of the BIVA inference model x \u2192 zBU\n1 .\nIf this model is competitive with BIVA then it is an indication that it is the depth that determines the\nperformance. The ablation study is conducted on the CIFAR-10 dataset against the best reported\nBIVA with L = 15 layers (Section 4.3), which means 2L \u2212 1 = 29 stochastic latent layers in the\ndeep LVAE+.\nTable 1 presents a comparison of the different model\narchitectures. The positive effect of adding the skip\nconnections in the generative models can be evalu-\nated from the difference between the LVAE L = 15\nand LVAE+ L = 15 results, for which there is close\nto a 0.2 bits/dim difference in the ELBO. Thanks to\nthe more expressive posterior approximation obtained\nusing its bidirectional inference network, BIVA im-\nproves the ELBO signi\ufb01cantly w.r.t the LVAE+, by\nmore than 0.3 bits/dim. Notice that a deeper hierar-\nchy of stochastic latent variables in the LVAE+ will\nnot necessarily provide a better likelihood performance, since the LVAE+ L = 29 performs worse\nthan the LVAE+ L = 15 despite having signi\ufb01cantly more parameters. In Figure 2 we plot for LVAE,\nLVAE+ and BIVA the KL divergence between the variational approximation over each latent variable\n\nTable 1: A comparison of the LVAE with\nno skip-connections and no bottom-up infer-\nence, the LVAE+ with skip-connections and\nno bottom-up inference, and BIVA. All mod-\nels are trained on the CIFAR-10 dataset.\nBITS/DIM\nLVAE L=15, L1\n\u2264 3.60\n\u2264 3.41\nLVAE+ L=15, L1\nLVAE+ L=29, L1\n\u2264 3.45\n\u2264 3.12\nBIVA L=15, L1\n\nPARAM.\n72.36M\n73.35M\n119.71M\n102.95M\n\n1Source code (Tensor\ufb02ow): https://github.com/larsmaaloee/BIVA.\n2Source code (PyTorch): https://github.com/vlievin/biva-pytorch.\n\n6\n\n\fTable 2: Test log-likelihood on statically bina-\nrized MNIST for different number of importance\nweighted samples. The \ufb01netuned models are\ntrained for an additional number of epochs with\nno free bits, \u03bb = 0. For testing resiliency we\ntrained 4 models and evaluated the standard de-\nviations to be \u00b10.031 for L1.\n\u2212 log p(x)\nWith autoregressive components\nPIXELCNN [57]\nDRAW [13]\nIAFVAE [23]\nPIXELVAE [14]\nPIXELRNN [57]\nVLAE [5]\nWithout autoregressive components\nDISCRETE VAE [42]\nBIVA, L1\nBIVA, L1e3\nBIVA FINETUNED, L1\nBIVA FINETUNED, L1e3\n\n= 81.30\n< 80.97\n\u2264 79.88\n\u2264 79.66\n= 79.20\n\u2264 79.03\n\u2264 81.01\n\u2264 81.20\n\u2264 78.67\n\u2264 80.47\n\u2264 78.59\n\nFigure 4: Histograms and kernel density estima-\ntion of the L>k for k = 13, 11, 0 evaluated in\nbits/dim by a model trained on the CIFAR-10\ntrain dataset and evaluated on the CIFAR-10 and\nthe SVHN test set.\n\nTable 3: Semi-supervised test error for BIVA on\nMNIST for 100 randomly chosen and evenly dis-\ntributed labelled samples.\n\nM1+M2 [22]\nVAT [32]\nCATGAN [51]\nSDGM [31]\nLADDERNET [38]\nADGM [31]\nIMPGAN [44]\nTRIPLEGAN [29]\nSSLGAN [6]\nBIVA\n\nERROR %\n3.33% (\u00b10.14)\n2.12%\n1.91% (\u00b10.10)\n1.32% (\u00b10.07)\n1.06% (\u00b10.37)\n0.96% (\u00b10.02)\n0.93% (\u00b10.07)\n0.91% (\u00b10.58)\n0.80% (\u00b10.10)\n0.83% (\u00b10.02)\n\nTable 4: Test log-likelihood on CIFAR-10 for dif-\nferent number of importance weighted samples.\nWe evaluated two different BIVA with various\nnumber of layers (L). For testing resiliency we\ntrained 3 models and evaluated the standard de-\nviations to be \u00b10.013 for L1 and L = 15.\nWith autoregressive components\nCONVDRAW [12]\nIAFVAE L1 [23]\nIAFVAE L1e3 [23]\nGATEDPIXELCNN [56]\nPIXELRNN [57]\nVLAE [5]\nPIXELCNN++ [45]\nWithout autoregressive components\nNICE [8]\nDEEPGMMS [58]\nREALNVP [9]\nDISCRETEVAE++ [54]\nGLOW [21]\nFLOW++ [16]\nBIVA L=10, L1\nBIVA L=15, L1\nBIVA L=15, L1e3\n\n= 4.48\n= 4.00\n= 3.49\n\u2264 3.38\n= 3.35\n= 3.08\n\u2264 3.17\n\u2264 3.12\n\u2264 3.08\n\nBITS/DIM\n\n< 3.58\n\u2264 3.15\n\u2264 3.12\n= 3.03\n= 3.00\n\u2264 2.95\n= 2.92\n\nand its prior distribution, KL(q||p). This KL divergence is 0 when the two distributions match, in\nwhich case we say that the variable has collapsed, since its posterior approximation is not using\nany data-dependent information. We can see that while the LVAE is only able to utilize its lowest 7\nstochastic variables, all variables in both LVAE+ and BIVA are active. We attribute this tendency\nto the deterministic top-down path that is present in both models, which creates skip-connections\nbetween all latent variables that allow to better propagate the information throughout the model.\n\n4.2 Binary Images\n\nWe evaluate BIVA L = 6 in terms of test log-likelihood on statically binarized MNIST [43],\ndynamically binarized MNIST [28] and dynamically binarized OMNIGLOT [25]. The model param-\neterization and optimization parameters have been kept identical for all binary image experiments\n(see Appendix C). For each experiment on binary image datasets, we \ufb01netune each model by setting\nthe free bits to \u03bb = 0 until convergence in order to test the tightness of the L1 ELBO.\nTo the best of our knowledge, BIVA achieves state-of-the-art results on statically binarized MNIST,\noutperforming other latent variable models, autoregressive models, and \ufb02ow-based models (see Table\n2). Finetuning the model with \u03bb = 0 improves the L1 ELBO signi\ufb01cantly and achieves slightly\nbetter performance for the 1000 importance weighted samples. For dynamically binarized MNIST\n\n7\n\n\fModel trained on CIFAR-10:\nCIFAR-10\nSVHN\nModel trained on FashionMNIST:\nFASHIONMNIST\nMNIST\n\nL>L\u22122\n\nL>L\u22124\n\nL>L\u22126 L>0\n\n79.36\n121.04\n\n228.38\n295.95\n\n35.34\n58.82\n\n20.93\n26.76\n\n3.12\n2.28\n\n107.07\n130.39\n\n-\n-\n\n94.05\n128.60\n\nTable 5: The test L>k\nfor different values of\nk and train/test dataset\ncombinations evaluated\nin bits/dim for natural\nimages and negative log-\nlikelihood for binary im-\nages (lower is better).\n\nand OMNIGLOT, BIVA achieves similar improvements with L1e3 = 78.41 (state-of-the-art) and\nL1e3 = 91.34 respectively, see Tables 10 and 11 in Appendix G.\n\nSemi-supervised learning. BIVA can be easily extended for semi-supervised classi\ufb01cation by\nadding a categorical variable y to represent the class, as done in [22]. We add a classi\ufb01cation\nmodel q\u03c6(y|x, zBU\n 0).\n\n8\n\n\f5 Conclusion\n\nIn this paper, we have introduced BIVA, that signi\ufb01cantly improves performances over previously\nintroduced probabilistic latent variable models and \ufb02ow-based models. BIVA is able to generate natu-\nral images that are both sharp and coherent, to improve on semi-supervised classi\ufb01cation benchmarks\nand, contrarily to other models, allows for anomaly detection using the extracted high-level semantics\nof the data.\n\n9\n\n\fReferences\n[1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8), 2013.\n\n[2] S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a\n\ncontinuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[3] Y. Burda, R. Grosse, and R. Salakhutdinov. Accurate and conservative estimates of mrf log-likelihood\nusing reverse annealing. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2015.\n\n[4] Y. Burda, R. Grosse, and R. Salakhutdinov.\n\narXiv:1509.00519, 2015.\n\nImportance Weighted Autoencoders. arXiv preprint\n\n[5] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel.\n\nVariational Lossy Autoencoder. In International Conference on Learning Representations, 2017.\n\n[6] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdinov. Good semi-supervised learning that requires\n\na bad GAN. In Advances in Neural Information Processing Systems, 2017.\n\n[7] A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei. Avoiding latent variable collapse with generative skip\n\nmodels. arXiv preprint arXiv:1807.04863, 2018.\n\n[8] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. arXiv preprint\n\narXiv:1410.8516, 2014.\n\n[9] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp.\n\narXiv:1605.08803, 2016.\n\narXiv preprint\n\n[10] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks.\n\nIn Advances in Neural Information Processing Systems, 2016.\n\n[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In Advances in Neural Information Processing Systems. 2014.\n\n[12] K. Gregor, R. D. J. Besse, Fredric, I. Danihelka, and D. Wierstra. Towards conceptual compression. arXiv\n\npreprint arXiv:1604.08772, 2016.\n\n[13] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. arXiv preprint arXiv:1502.04623, 2015.\n\n[14] I. Gulrajani, K. Kumar, F. Ahmed, A. Ali Taiga, F. Visin, D. Vazquez, and A. Courville. PixelVAE: A\n\nlatent variable model for natural images. arXiv e-prints, 1611.05013, Nov. 2016.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint\n\narXiv:1512.03385, 2015.\n\n[16] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving \ufb02ow-based generative models\n\nwith variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019.\n\n[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n\n[18] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 2013.\n\n[19] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37(2):183\u2013233, Nov. 1999.\n\n[20] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 12\n\n2014.\n\n[21] D. P. Kingma and P. Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In Advances in\n\nNeural Information Processing Systems, 2018.\n\n[22] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-Supervised Learning with Deep\n\nGenerative Models. In Proceedings of the International Conference on Machine Learning, 2014.\n\n[23] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational\ninference with inverse autoregressive \ufb02ow. In Advances in Neural Information Processing Systems. 2016.\n\n10\n\n\f[24] M. Kingma, Diederik P; Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 12\n\n2013.\n\n[25] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositional causal\n\nprocess. In Advances in Neural Information Processing Systems. 2013.\n\n[26] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[27] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. In Proceedings of the International Conference on Machine Learning, 2016.\n\n[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\nIn Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,\npages 2278\u20132324, 1998.\n\n[29] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291,\n\n2017.\n\n[30] L. Maal\u00f8e, M. Fraccaro, and O. Winther. Semi-supervised generation with cluster-aware generative models.\n\narXiv preprint arXiv:1704.00637, 2017.\n\n[31] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary Deep Generative Models. In\n\nProceedings of the International Conference on Machine Learning, 2016.\n\n[32] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional Smoothing with Virtual\n\nAdversarial Training. arXiv preprint arXiv:1507.00677, 7 2015.\n\n[33] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In Proceedings of\n\nthe International Conference on Machine Learning, pages 1791\u20131799, 2014.\n\n[34] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models\n\nknow what they don\u2019t know? arXiv preprint arXiv:1810.09136, 2018.\n\n[35] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\nunsupervised feature learning. In Deep Learning and Unsupervised Feature Learning, workshop at Neural\nInformation Processing Systems 2011, 2011.\n\n[36] J. Paisley, D. M. Blei, and M. I. Jordan. Variational bayesian inference with stochastic search.\n\nProceedings of the International Conference on Machine Learning, pages 1363\u20131370, 2012.\n\nIn\n\n[37] R. Ranganath, D. Tran, and D. M. Blei. Hierarchical variational models. In Proceedings of the International\n\nConference on Machine Learning, 2016.\n\n[38] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder\n\nnetworks. In Advances in Neural Information Processing Systems, 2015.\n\n[39] D. J. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In Proceedings of the\n\nInternational Conference on Machine Learning, 2015.\n\n[40] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in\n\nDeep Generative Models. arXiv preprint arXiv:1401.4082, 04 2014.\n\n[41] D. J. Rezende and F. Viola. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.\n\n[42] J. T. Rolfe. Discrete variational autoencoders. In Proceedings of the International Conference on Learning\n\nRepresentations, 2017.\n\n[43] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe International Conference on Machine Learning, 2008.\n\n[44] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. arXiv preprint arXiv:1606.03498, 2016.\n\n[45] T. Salimans, A. Karparthy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with\ndiscretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint:1701.05517, 2017, 2017.\n\n[46] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of\ndeep neural networks. In Proceedings of the International Conference on Neural Information Processing\nSystems, 2016.\n\n11\n\n\f[47] T. Salimans, D. P. Kingma, and M. Welling. Markov chain monte carlo and variational inference: Bridging\n\nthe gap. In Proceedings of the International Conference on Machine Learning, 2015.\n\n[48] S. Semeniuta, A. Severyn, and E. Barth. A hybrid convolutional variational autoencoder for text generation.\n\narXiv preprint arXiv:1702.02390, 2017.\n\n[49] H. Shah, B. Zheng, and D. Barber. Generating sentences using a dynamic canvas, 2018.\n\n[50] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational autoencoders. In\n\nAdvances in Neural Information Processing Systems 29. 2016.\n\n[51] J. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial\n\nnetworks. arXiv preprint arXiv:1511.06390, 2015.\n\n[52] J. M. Tomczak and M. Welling. Improving variational auto-encoders using householder \ufb02ow. arXiv\n\npreprint arXiv:1611.09630, 2016.\n\n[53] D. Tran, R. Ranganath, and D. M. Blei. Variational Gaussian process. In Proceedings of the International\n\nConference on Learning Representations, 2016.\n\n[54] A. Vahdat, W. G. Macready, Z. Bian, A. Khoshaman, and E. Andriyash. DVAE++: discrete variational\nautoencoders with overlapping transformations. In Proceedings of the International Conference on Machine\nLearning, 2018.\n\n[55] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,\nand K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[56] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional\n\nimage generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016.\n\n[57] A. van den Oord, K. Nal, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint\n\narXiv:1601.06759, 01 2016.\n\n[58] A. van den Oord and B. Schrauwen. Factoring variations in natural images with deep gaussian mixture\n\nmodels. In Advances in Neural Information Processing Systems, 2014.\n\n[59] S. Zhao, J. Song, and S. Ermon. Towards deeper understanding of variational autoencoding models. arXiv\n\npreprint arXiv:1702.08658, 2017.\n\n[60] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and\nmovies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of\nthe IEEE international conference on computer vision, pages 19\u201327, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3529, "authors": [{"given_name": "Lars", "family_name": "Maal\u00f8e", "institution": "Corti"}, {"given_name": "Marco", "family_name": "Fraccaro", "institution": "Unumed"}, {"given_name": "Valentin", "family_name": "Li\u00e9vin", "institution": "DTU"}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}]}*