{"title": "A Prior of a Googol Gaussians: a Tensor Ring Induced Prior for Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4102, "page_last": 4112, "abstract": "Generative models produce realistic objects in many domains, including text, image, video, and audio synthesis. Most popular models\u2014Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs)\u2014usually employ a standard Gaussian distribution as a prior. Previous works show that the richer family of prior distributions may help to avoid the mode collapse problem in GANs and to improve the evidence lower bound in VAEs. We propose a new family of prior distributions\u2014Tensor Ring Induced Prior (TRIP)\u2014that packs an exponential number of Gaussians into a high-dimensional lattice with a relatively small number of parameters. We show that these priors improve Fr\u00e9chet Inception Distance for GANs and Evidence Lower Bound for VAEs. We also study generative models with TRIP in the conditional generation setup with missing conditions. Altogether, we propose a novel plug-and-play framework for generative models that can be utilized in any GAN and VAE-like architectures.", "full_text": "A Prior of a Googol Gaussians: a Tensor Ring\n\nInduced Prior for Generative Models\n\nMaksim Kuznetsov1,\u2217 Daniil Polykovskiy1,\u2217\n\nDmitry Vetrov2\n\nAlexander Zhebrak1\n\n1Insilico Medicine 2National Research University Higher School of Economics\n\n{kuznetsov,daniil,zhebrak}@insilico.com\n\nvetrovd@yandex.ru\n\nAbstract\n\nGenerative models produce realistic objects in many domains, including text, im-\nage, video, and audio synthesis. Most popular models\u2014Generative Adversarial\nNetworks (GANs) and Variational Autoencoders (VAEs)\u2014usually employ a stan-\ndard Gaussian distribution as a prior. Previous works show that the richer family\nof prior distributions may help to avoid the mode collapse problem in GANs and\nto improve the evidence lower bound in VAEs. We propose a new family of prior\ndistributions\u2014Tensor Ring Induced Prior (TRIP)\u2014that packs an exponential num-\nber of Gaussians into a high-dimensional lattice with a relatively small number\nof parameters. We show that these priors improve Fr\u00e9chet Inception Distance for\nGANs and Evidence Lower Bound for VAEs. We also study generative models\nwith TRIP in the conditional generation setup with missing conditions. Altogether,\nwe propose a novel plug-and-play framework for generative models that can be\nutilized in any GAN and VAE-like architectures.\n\n1\n\nIntroduction\n\nModern generative models are widely applied to the generation of realistic and diverse images, text,\nand audio \ufb01les [1\u20135]. Generative Adversarial Networks (GAN) [6], Variational Autoencoders (VAE)\n[7], and their variations are the most commonly used neural generative models. Both architectures\nlearn a mapping from some prior distribution p(z)\u2014usually a standard Gaussian\u2014to the data\ndistribution p(x). Previous works showed that richer prior distributions might improve the generative\nmodels\u2014reduce mode collapse for GANs [8, 9] and obtain a tighter Evidence Lower Bound (ELBO)\nfor VAEs [10].\nIf the prior p(z) lies in a parametric family, we can learn the most suitable distribution for it during\ntraining.\nIn this work, we investigate Gaussian Mixture Models as prior distributions with an\nexponential number of Gaussians in nodes of a multidimensional lattice. In our experiments, we\nused a prior with more than a googol (10100) Gaussians. To handle such complex distributions,\nwe represent p(z) using a Tensor Ring decomposition [11]\u2014a method for approximating high-\ndimensional tensors with a relatively small number of parameters. We call this family of distributions\na Tensor Ring Induced Prior (TRIP). For this distribution, we can compute marginal and conditional\nprobabilities and sample from them ef\ufb01ciently.\nWe also extend TRIP to conditional generation, where a generative model p(x | y) produces new\nobjects x with speci\ufb01ed attributes y. With TRIP, we can produce new objects conditioned only on a\nsubset of attributes, leaving some labels unspeci\ufb01ed during both training and inference.\nOur main contributions are summarized as follows:\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) 2D Tensor Ring Induced Prior.\n\n(b) An example Tensor Ring decomposition.\n\nFigure 1: (a) The TRIP distribution is a multidimensional Gaussian Mixture Model with an exponen-\n\ntially large number of modes located on the lattice nodes. (b) To compute the value (cid:98)P [0, 2, 1], one\nshould multiply the highlighted matrices and compute the trace (cid:98)P [0, 2, 1] = Tr(Q1[0]\u00b7 Q2[2]\u00b7 Q3[1]).\n\nimproves quality on sparsely labeled datasets.\n\nuse it as a prior for generative models\u2014VAE, GAN, and its variations.\n\n\u2022 We introduce a family of distributions that we call a Tensor Ring Induced Prior (TRIP) and\n\u2022 We investigate an application of TRIP to conditional generation and show that this prior\n\u2022 We evaluate TRIP models on the generation of CelebA faces for both conditional and\nunconditional setups. For GANs, we show improvement in Fr\u00e9chet Inception Distance (FID)\nand improved ELBO for VAEs. For the conditional generation, we show lower rates of\ncondition violation compared to standard conditional models.\n\n2 Tensor Ring Induced Prior\n\nIn this section, we introduce a Tensor Ring-induced distribution for both discrete and continuous\nvariables. We also de\ufb01ne a Tensor Ring Induced Prior (TRIP) family of distributions.\n\n2.1 Tensor Ring decomposition\n\nTensor Ring decomposition [11] represents large high-dimensional tensors (such as discrete distribu-\ntions) with a relatively small number of parameters. Consider a joint distribution p(r1, r2, . . . rd) of\nd discrete random variables rk taking values from {0, 1, . . . Nk \u2212 1}. We write these probabilities as\nelements of a d-dimensional tensor P [r1, r2, . . . rd] = p(r1, r2, . . . rd). For the brevity of notation,\nwe use r1:d for (r1, . . . , rd). The number of elements in this tensor grows exponentially with the\nnumber of dimensions d, and for only 50 binary variables the tensor contains 250 \u2248 1015 real\nnumbers. Tensor Ring decomposition reduces the number of parameters by approximating tensor P\nwith low-rank non-negative tensors cores Qk \u2208 RNk\u00d7mk\u00d7mk+1\n, where m1, . . . , md+1 are core sizes,\nand md+1 = m1:\n\n+\n\np(r1:d) \u221d (cid:98)P [r1:d] = Tr\n\n(cid:16) d(cid:89)\n\n(cid:17)\n\nQj[rj]\n\n(1)\n\nj=1\n\n+\n\nTo compute P [r1:d], for each random variable rk, we slice a tensor Qk along the \ufb01rst dimension\nand obtain a matrix Qk[rk] \u2208 Rmk\u00d7mk+1\n. We multiply these matrices for all random variables and\ncompute the trace of the resulting matrix to get a scalar (see Figure 1(b) for an example). In Tensor\nRing decomposition, the number of parameters grows linearly with the number of dimensions. With\nlarger core sizes mk, Tensor Ring decomposition can approximate more complex distributions. Note\nthat the order of the variables matters: Tensor Ring decomposition better captures dependencies\nbetween closer variables than between the distant ones.\nWith Tensor Ring decomposition, we can compute marginal distributions without computing the\n\nwhole tensor (cid:98)P [r1:d]. To marginalize out the random variable rk, we replace cores Qk in Eq 1 with\n\n2\n\n\fmatrix (cid:101)Qk =(cid:80)Nk\u22121\n\nrk=0 Qk[rk]:\n\np(r1:k\u22121, rk+1:d) \u221d (cid:98)P [r1:k\u22121, rk+1:d] = Tr\n\n(cid:32) k\u22121(cid:89)\n\nj=1\n\nQj[rj] \u00b7 (cid:101)Qk \u00b7\n\nd(cid:89)\n\nj=k+1\n\n(cid:33)\n\nQj[rj]\n\n(2)\n\nIn Supplementary Materials, we show an Algorithm for computing marginal distributions. We\ncan also compute conditionals as a ratio between the joint and marginal probabilities p(A | B) =\np(A, B)/p(B); we sample from conditional or marginal distributions using the chain rule.\n\n2.2 Continuous Distributions parameterized with Tensor Ring Decomposition\n\nIn this section, we apply the Tensor Ring decomposition to continuous distributions over vectors\nz = [z1, . . . , zd]. In our Learnable Prior model, we assume that each component of zk is a Gaussian\nMixture Model with Nk fully factorized components. The joint distribution p(z) is a multidimensional\nGaussian Mixture Model with modes placed in the nodes of a multidimensional lattice (Figure 1(a)).\nThe latent discrete variables s1, . . . , sd indicate the index of mixture component for each dimension\n(sk corresponds to the k-th dimension of the latent code zk):\n\n(cid:88)\n\np(s1:d)p(z1:d | s1:d) \u221d(cid:88)\n\n(cid:98)P [s1:d]\n\nd(cid:89)\n\np(z1:d) =\n\nN (zj | \u00b5sj\n\nj , \u03c3sj\nj )\n\n(3)\n\nHere, p(s) is a discrete distribution of prior probabilities of mixture components, which we store as a\n\ntensor (cid:98)P [s] in a Tensor Ring decomposition. Note that p(s) is not a factorized distribution, and the\n\ns1:d\n\ns1:d\n\nj=1\n\n(cid:110)\n\nQ1, . . . , Qd, \u00b50\n\nlearnable prior p(z) may learn complex weightings of the mixture components. We call the family of\ndistributions parameterized in this form a Tensor Ring Induced Prior (TRIP) and denote its learnable\nparameters (cores, means, and standard deviations) as \u03c8:\n1, . . . , \u00b5Nd\u22121\n\n(4)\nTo highlight that the prior distribution is learnable, we further write it as p\u03c8(z). As we show later, we\ncan optimize \u03c8 directly using gradient descent for VAE models and REINFORCE [12] for GANs.\nAn important property of the proposed TRIP family is that we can derive its one-dimensional\nconditional distributions in a closed form. For example, to sample using a chain rule, we need\ndistributions p\u03c8(zk | z1:k\u22121):\np\u03c8(zk | z1:k\u22121) =\n\np\u03c8(sk | z1:k\u22121)p\u03c8(zk | sk, z1:k\u22121)\n\n1, . . . , \u03c3Nd\u22121\n\nNk\u22121(cid:88)\n\n\u03c8 =\n\n(cid:111)\n\n.\n\n, \u03c30\n\nd\n\nd\n\nNk\u22121(cid:88)\n\n=\n\nsk=0\n\np\u03c8(sk | z1:k\u22121)p\u03c8(zk | sk) =\n\nNk\u22121(cid:88)\n\np\u03c8(sk | z1:k\u22121)N (zk | \u00b5sk\n\nk , \u03c3sk\nk )\n\n(5)\n\nsk=0\n\nsk=0\n\nFrom Equation 5 we notice that one-dimensional conditional distributions are Gaussian Mixture\nModels with the same means and variances as priors, but with different weights p\u03c8(sk | z1:k\u22121) (see\nSupplementary Materials).\nComputations for marginal probabilities in the general case are shown in Algorithm 1; conditional\nprobabilities can be computed as a ratio between the joint and marginal probabilities. Note that we\ncompute a normalizing constant on-the-\ufb02y.\n\n3 Generative Models With Tensor Ring Induced Prior\n\nIn this section, we describe how popular generative models\u2014Variational Autoencoders (VAEs) and\nGenerative Adversarial Networks (GANs)\u2014can bene\ufb01t from using Tensor Ring Induced Prior.\n\n3.1 Variational Autoencoder\n\nVariational Autoencoder (VAE) [7, 13] is an autoencoder-based generative model that maps data\npoints x onto a latent space with a probabilistic encoder q\u03c6(z | x) and reconstructs objects with a\nprobabilistic decoder p\u03b8(x | z). We used a Gaussian encoder with the reparameterization trick:\n\nq\u03c6(z | x) = N (z | \u00b5\u03c6(x), \u03c3\u03c6(x)) = N (\u0001 | 0, I) \u00b7 \u03c3\u03c6(x) + \u00b5\u03c6(x).\n\n(6)\n\n3\n\n\fAlgorithm 1 Calculation of marginal probabilities in TRIP\n\nInput: A set M of variable indices for which we compute the probability, and values of these\nlatent codes zi for i \u2208 M\nOutput: Joint probability log p(zM ), where zM = {zi \u2200i \u2208 M}\nInitialize Qbuff = I \u2208 Rm1\u00d7m1, Qnorm = I \u2208 Rm1\u00d7m1\nfor j = 1 to d do\n\nif j is marginalized out (j /\u2208 M) then\n\nQbuff = Qbuff \u00b7(cid:16)(cid:80)Nj\u22121\nQbuff = Qbuff \u00b7(cid:16)(cid:80)Nj\u22121\nQnorm = Qnorm \u00b7(cid:16)(cid:80)Nj\u22121\n\nend if\n\nelse\n\n(cid:17)\n(cid:17)\n\nk=0 Qj[k]\n\nk=0 Qj[k] \u00b7 N(cid:0)zk | \u00b5sj\n\nj , \u03c3sj\n\nj\n\n(cid:1)(cid:17)\n\nk=0 Qj[k]\n\nend for\nlog p(zM ) = log Tr (Qbuff) \u2212 log Tr (Qnorm)\n\nThe most common choice for a prior distribution p\u03c8(z) in the latent space is a standard Gaussian\ndistribution N (0, I). VAEs are trained by maximizing the lower bound of the log marginal likelihood\nlog p(x), also known as the Evidence Lower Bound (ELBO):\n\nL(\u03b8, \u03c6, \u03c8) = Eq\u03c6(z|x)log p\u03b8(x | z) \u2212 KL(cid:0)q\u03c6(z | x) || p\u03c8(z)(cid:1),\n\n(7)\nwhere KL is a Kullback-Leibler divergence. We get an unbiased estimate of L(\u03b8, \u03c6, \u03c8) by sampling\n\u0001i \u223c N (0, I) and computing a Monte Carlo estimate\n\nl(cid:88)\n\ni=1\n\n(cid:18) p\u03b8(x | zi)p\u03c8(zi)\n\n(cid:19)\n\nq\u03c6(zi | x)\n\nL(\u03b8, \u03c6, \u03c8) \u2248 1\nl\n\nlog\n\nzi = \u0001i \u00b7 \u03c3\u03c6(x) + \u00b5\u03c6(x)\n\n,\n\n(8)\n\nWhen p\u03c8(z) is a standard Gaussian, the KL term can be computed analytically, reducing the\nestimation variance.\nFor VAEs, \ufb02exible priors give tighter evidence lower bound [10, 14] and can help with a problem\nof the decoder ignoring the latent codes [14, 15]. In this work, we parameterize the learnable prior\np\u03c8(z) as a Tensor Ring Induced Prior model and train its parameters \u03c8 jointly with encoder and\ndecoder (Figure 2). We call this model a Variational Autoencoder with Tensor Ring Induced Prior\n(VAE-TRIP). We initialize the means and the variances by \ufb01tting 1D Gaussian Mixture models for\neach component using samples from the latent codes and initialize cores with a Gaussian noise. We\nthen re-initialize means, variances and cores after the \ufb01rst epoch, and repeat such procedure every 5\nepochs.\n\nFigure 2: A Variational Autoencoder with a Tensor Ring Induced Prior (VAE-TRIP).\n\n3.2 Generative Adversarial Networks\n\nGenerative Adversarial Networks (GANs) [6] consist of two networks: a generator G(z) and a\ndiscriminator D(x). The discriminator is trying to distinguish real objects from objects produced\nby a generator. The generator, on the other hand, is trying to produce objects that the discriminator\nconsiders real. The optimization setup for all models from the GAN family is a min-max problem.\nFor the standard GAN, the learning procedure alternates between optimizing the generator and the\n\n4\n\n\fFigure 3: A Generative Adversarial Network with a Tensor Ring Induced Prior (GAN-TRIP).\n\ndiscriminator networks with a gradient descent/ascent:\n\nmin\nG,\u03c8\n\nmax\n\nD\n\nLGAN = Ex\u223cp(x) log D(x) + Ez\u223cp\u03c8(z) log\n\n(cid:16)\n\n1 \u2212 D(cid:0)G(z)(cid:1)(cid:17)\n\n(9)\n\nSimilar to VAE, the prior distribution p\u03c8(z) is usually a standard Gaussian, although Gaussian Mixture\nModels were also previously studied [16]. In this work, we use a TRIP family of distributions to\nparameterize a multimodal prior of GANs (Figure 3). We expect that having multiple modes as the\nprior improves the overall quality of generation and helps to avoid anomalies during sampling, such\nas partially present eyeglasses.\nDuring training, we sample multiple latent codes from the prior p\u03c8(z) and use REINFORCE [12]\nto propagate the gradient through the parameters \u03c8. We reduce the variance by using average\ndiscriminator output as a baseline:\n\nl(cid:88)\n\ni=1\n\n\u2207\u03c8 log p\u03c8(zi)\n\n\uf8ee\uf8f0di \u2212 1\n\nl\n\n\uf8f9\uf8fb ,\n\ndj\n\nl(cid:88)\n\nj=1\n\n(10)\n\n\u2207\u03c8LGAN \u2248 1\nl\n\n(cid:16)\n\n1\u2212 D(cid:0)G(z)(cid:1)(cid:17)\n\nis the discriminator\u2019s output and zi are samples from the prior p\u03c8(z).\nwhere di = log\nWe call this model a Generative Adversarial Network with Tensor Ring Induced Prior (GAN-TRIP).\nWe initialize means uniformly in a range [\u22121, 1] and standard deviations as 1/Nk.\n\n4 Conditional Generation\n\n(cid:88)\n\n(cid:101)P [s1:d, y]\n\nd(cid:89)\n\np(z, y) =\n\nIn conditional generation problem, data objects x (for example, face images) are coupled with\nproperties y describing the objects (for example, sex and hair color). The goal of this model is to\nlearn a distribution p(x | y) that produces objects with speci\ufb01ed attributes. Some of the attributes\ny for a given x may be unknown (yun), and the model should learn solely from observed attributes\n(yob): p(x | yob).\nFor VAE-TRIP, we train a joint model p\u03c8(z, y) on all attributes y and latent codes z parameterized\nwith a Tensor Ring. For discrete conditions, the joint distribution is:\n\nN (zd | \u00b5sd\n\nd , \u03c3sd\n\nd ),\n\n(11)\n\nwhere tensor (cid:101)P [s1:d, y] is represented in a Tensor Ring decomposition. In this work, we focus on\n\ns1:d\n\nj=1\n\ndiscrete attributes, although we can extend the model to continuous attributes with Gaussian Mixture\nModels as we did for the latent codes.\nWith the proposed parameterization, we can marginalize out missing attributes and compute condi-\ntional probabilities. We can ef\ufb01ciently compute both probabilities similar to Algorithm 1.\nFor conditional VAE model, the lower bound on log p(x, yob) is:\n\n(cid:101)L(\u03b8, \u03c6, \u03c8) = Eq\u03c6(z|x,yob) log p\u03b8(x, yob | z) \u2212 KL(cid:0)q\u03c6(z | x, yob) || p\u03c8(z)(cid:1).\n\n(12)\n\n5\n\n\fFigure 4: Visualization of the \ufb01rst two dimensions of the learned prior p\u03c8(z1, z2). Left: VAE-TRIP,\nRight: WGAN-GP-TRIP.\n\nWe simplify the lower bound by making two restrictions. First, we assume that the conditions y are\nfully de\ufb01ned by the object x, which implies q\u03c6(z | x, yob) = q\u03c6(z | x). For example, an image with\na person wearing a hat de\ufb01nes the presence of a hat. The second restriction is that we can reconstruct\nan object directly from its latent code: p\u03b8(x | z, yob) = p\u03b8(x | z). This restriction also gives:\n\np\u03b8(x, yob | z) = p\u03b8(x | z, yob)p\u03c8(yob | z) = p\u03b8(x | z)p\u03c8(yob | z).\n\n(13)\n\nThe resulting Evidence Lower Bound is\n\n(cid:101)L(\u03b8, \u03c6, \u03c8) = Eq\u03c6(z|x)\n\n(cid:2) log p\u03b8(x | z) + log p\u03c8(yob | z)(cid:3) \u2212 KL(cid:0)q\u03c6(z | x) || p\u03c8(z)(cid:1).\n\n(14)\nIn the proposed model, an autoencoder learns to map objects onto a latent manifolds, while TRIP\nprior log p\u03c8(z | yob) \ufb01nds areas on the manifold corresponding to objects with the speci\ufb01ed attributes.\nThe quality of the model depends on the order of the latent codes and the conditions in p\u03c8(z, y), since\nthe Tensor Ring poorly captures dependence between variables that are far apart. In our experiments,\nwe found that randomly permuting latent codes and conditions gives good results.\nWe can train the proposed model on partially labeled datasets and use it to draw conditional samples\nwith partially speci\ufb01ed constraints. For example, we can ask the model to generate images of men in\nhats, not specifying hair color or the presence of glasses.\n\n5 Related Work\n\nThe most common generative models are based on Generative Adversarial Networks [6] or Variational\nAutoencoders [7]. Both GAN and VAE models usually use continuous unimodal distributions (like a\nstandard Gaussian) as a prior. A space of natural images, however, is multimodal: a person either\nwears glasses or not\u2014there are no intermediate states. Although generative models are \ufb02exible\nenough to transform unimodal distributions to multimodal, they tend to ignore some modes (mode\ncollapse) or produce images with artifacts (half-present glasses).\nA few models with learnable prior distributions were proposed. Tomczak and Welling [10] used a\nGaussian mixture model based on encoder proposals as a prior on the latent space of VAE. Chen et al.\n[14] and Rezende and Mohamed [17] applied normalizing \ufb02ows [18\u201320] to transform a standard\nnormal prior into a more complex latent distribution. [14, 15] applied auto-regressive models to learn\nbetter prior distribution over the latent variables. [21] proposed to update a prior distribution of a\ntrained VAE to avoid samples that have low marginal posterior, but high prior probability.\nSimilar to Tensor Ring decomposition, a Tensor-Train decomposition [22] is used in machine learning\nand numerical methods to represent tensors with a small number of parameters. Tensor-Train was\napplied to the compression of fully connected [23], convolutional [24] and recurrent [25] layers. In\nour models, we can use a Tensor-Train decomposition instead of Tensor Ring, but it requires larger\ncores to achieve comparable results, as \ufb01rst and last dimensions are farther apart.\nMost conditional models work with missing values by imputing them with a predictive model or\nsetting them to a special value. With this approach, we cannot sample objects specifying conditions\npartially. VAE TELBO model [26] proposes to train a Product of Experts-based model, where the\np\u03c8(z | yi), requiring to train\n\nposterior on the latent codes is approximated as p\u03c8(z | yob) =(cid:81)\n\nyi\u2208yob\n\n6\n\n432101234z1432101234z21.51.00.50.00.51.01.5z11.51.00.50.00.51.01.5z2\fa separate posterior model for each condition. JMVAE model [27] contains three encoders that take\nboth image and condition, only a condition, or only an image.\n\n6 Experiments\n\nWe conducted experiments on CelebFaces Attributes Dataset (CelebA) [28] of approximately 400,000\nphotos with a random train-test split. For conditional generation, we selected 14 binary image\nattributes, including sex, hair color, presence mustache, and beard. We compared both GAN and\nVAE models with and without TRIP. We also compared our best model with known approaches on\nCIFAR-10 [29] dataset with a standard split. Model architecture and training details are provided in\nSupplementary Materials.\n\n6.1 Generating Objects With VAE-TRIP and GAN-TRIP\n\nTable 1: FID for GAN and VAE-based architectures trained on CelebA dataset, and ELBO for VAE.\nF = Fixed, L = Learnable. We also report ELBO for importance-weighted autoencoder with k = 100\npoints [30]\n\nMETRIC\n\nMODEL\n\nFID\n\nVAE\nWGAN\nWGAN-GP\n\nELBO\nIWAE ELBO (k = 100)\n\nVAE\n\nN (0, I)\n\n86.72\n63.46\n54.71\n-194.16\n-185.09\n\nGMM\n\nTRIP\n\nF\n\n85.64\n67.10\n57.82\n-201.60\n-191.99\n\nL\n\n84.48\n61.82\n62.10\n-193.88\n-184.73\n\nF\n\n85.31\n62.48\n63.06\n-202.04\n-190.09\n\nL\n\n83.54\n57.6\n52.86\n-193.32\n-184.43\n\nWe evaluate GAN-based models with and without Tensor Ring Learnable Prior by measuring a\nFr\u00e9chet Inception Distance (FID). For the baseline models, we used Wasserstein GAN (WGAN)\n[31] and Wasserstein GAN with Gradient Penalty (WGAN-GP) [32] on CelebA dataset. We also\ncompared learnable priors with \ufb01xed randomly initialized parameters \u03c8. The results in Table 1\n(CelebA) and Table 2 (CIFAR-10) suggest that with a TRIP prior the quality improves compared\nto standard models and models with GMM priors. In some experiments, the GMM-based model\nperformed worse than a standard Gaussian, since KL had to be estimated with Monte-Carlo sampling,\nresulting in higher gradient variance.\n\nTable 2: FID for CIFAR-10 GAN-based models\n\nModel\nSN-GANs [33]\nWGAN-GP + Two Time-Scale [34]\nWGAN-GP [32]\nWGAN-GP-TRIP (ours)\n\nFID\n21.7\n24.8\n29.3\n16.72\n\n6.2 Visualization of TRIP\n\nIn Figure 4, we visualize \ufb01rst two dimensions of the learned prior p\u03c8(z1, z2) in VAE-TRIP and\nWGAN-GP-TRIP models. For both models, prior uses most of the components to produce a complex\ndistribution. Also, notice that the components learned different non-uniform weights.\n\n6.3 Generated Images\n\nHere, we visualize the correspondence of modes and generated images by a procedure that we call\nmode hopping. We start by randomly sampling a latent code and producing the \ufb01rst image. After that,\nwe randomly select \ufb01ve dimensions and sample them conditioned on the remaining dimensions. We\n\n7\n\n\fFigure 5: Mode hopping in WGAN-GP-TRIP. We start with a random sample from the prior and\nconditionally sample \ufb01ve random dimensions on each iteration. Each line shows a single trajectory.\n\nrepeat this procedure multiple times and obtain a sequence of sampled images shown in Figure 5.\nWith these results, we see that similar images are localized in the learned prior space, and changes in\na few dimensions change only a few \ufb01ne-grained features.\n\n6.4 Generated Conditional Images\n\nIn this experiment, we generate images given a subset of attributes to estimate the diversity of\ngenerated images. For example, if we specify \u2018Young man,\u2019 we would expect different images to\nhave different hair colors, presence and absence of glasses or hat. Generated images shown in Figure\n3 indicate that the model learned to produce diverse images with multiple varying attributes.\n\nTable 3: Generated images with VAE-TRIP for different attributes.\n\nYoung man\n\nSmiling woman in eyeglasses\n\nSmiling woman with a hat\n\nBlond woman with eyeglasses\n\n7 Discussion\n\nWe designed our prior utilizing Tensor Ring decomposition due to its higher representation capacity\ncompared to other decompositions. For example, a Tensor Ring with core size m has the same\ncapacity as a Tensor-Train with core size m2 [35]. Although the prior contains an exponential\nnumber of modes, in our experiments, its learnable parameters accounted for less than 10% of total\nweights, which did not cause over\ufb01tting. The results can be improved by increasing the core size m;\nhowever, the computational complexity has a cubic growth with the core size. We also implemented\na conditional GAN but found the REINFORCE-based training of this model very unstable. Further\nresearch with variance reduction techniques might improve this approach.\n\n8\n\n\f8 Acknowledgements\n\nImage generation for Section 6.3 was supported by the Russian Science Foundation grant no. 17-71-\n20072.\n\nReferences\n[1] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for\nImproved Quality, Stability, and Variation. International Conference on Learning Representa-\ntions, 2018.\n\n[2] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan \u00d6mer Arik, Ajay Kannan, Sharan Narang,\nJonathan Raiman, and John Miller. Deep Voice 3: Scaling Text-to-Speech with Convolutional\nSequence Learning. International Conference on Learning Representations, 2018.\n\n[3] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray\nKavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Nor-\nman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner,\nHeiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel\nWaveNet: Fast High-Fidelity Speech Synthesis. 2018.\n\n[4] Daniil Polykovskiy, Alexander Zhebrak, Dmitry Vetrov, Yan Ivanenkov, Vladimir Aladinskiy,\nPolina Mamoshina, Marine Bozdaganyan, Alexander Aliper, Alex Zhavoronkov, and Artur\nKadurin. Entangled conditional adversarial autoencoder for de novo drug discovery. Mol.\nPharm., September 2018.\n\n[5] Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy,\nAnastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov,\nArip Asadulaev, et al. Deep learning enables rapid identi\ufb01cation of potent ddr1 kinase inhibitors.\nNature biotechnology, pages 1\u20134, 2019.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[7] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes.\n\nConference on Learning Representations, 2013.\n\nInternational\n\n[8] Matan Ben-Yosef and Daphna Weinshall. Gaussian Mixture Generative Adversarial Net-\nworks for Diverse Datasets, and the Unsupervised Clustering of Images. arXiv preprint\narXiv:1808.10356, 2018.\n\n[9] Lili Pan, Shen Cheng, Jian Liu, Yazhou Ren, and Zenglin Xu. Latent dirichlet allocation in\n\ngenerative adversarial networks. arXiv preprint arXiv:1812.06571, 2018.\n\n[10] Jakub M Tomczak and Max Welling. VAE with a VampPrior. International Conference on\n\nArti\ufb01cial Intelligence and Statistics, 2018.\n\n[11] Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor Ring\n\nDecomposition. arXiv preprint arXiv:1606.05535, 2016.\n\n[12] Ronald J Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein-\n\nforcement Learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[13] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation\nand Approximate Inference in Deep Generative Models. International Conference on Machine\nLearning, 2014.\n\n[14] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman,\nIlya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder. International Conference on\nLearning Representations, 2017.\n\n9\n\n\f[15] Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural Discrete Representation\n\nLearning. Advances in Neural Information Processing Systems, 2017.\n\n[16] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. DeLiGAN:\nGenerative Adversarial Networks for Diverse and Limited Data. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 166\u2013174, 2017.\n\n[17] Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows.\n\nInternational Conference on Machine Learning, 2015.\n\n[18] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Independent Components\n\nEstimation. International Conference on Learning Representations Workshop, 2015.\n\n[19] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nWelling. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in\nNeural Information Processing Systems, pages 4743\u20134751, 2016.\n\n[20] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation Using Real NVP.\n\nInternational Conference on Learning Representations, 2017.\n\n[21] Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. 89:66\u201375,\n\n16\u201318 Apr 2019. URL http://proceedings.mlr.press/v89/bauer19a.html.\n\n[22] Ivan V Oseledets. Tensor-Train Decomposition. SIAM Journal on Scienti\ufb01c Computing, 33(5):\n\n2295\u20132317, 2011.\n\n[23] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing\n\nNeural Networks. Advances in Neural Information Processing Systems, 2015.\n\n[24] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate Ten-\nsorization: Compressing Convolutional and FC Layers Alike. Advances in Neural Information\nProcessing Systems, 2016.\n\n[25] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing Recurrent Neural Network\n\nwith Tensor Train. International Joint Conference on Neural Networks, 2017.\n\n[26] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative Models\nof Visually Grounded Imagination. International Conference on Learning Representations,\n2018.\n\n[27] Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint Multimodal Learning with Deep\nGenerative Models. International Conference on Learning Representations Workshop, 2017.\n\n[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the\n\nWild. In Proceedings of International Conference on Computer Vision (ICCV), 12 2015.\n\n[29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced\n\nresearch). URL http://www.cs.toronto.edu/~kriz/cifar.html.\n\n[30] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[31] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein Generative Adversarial\n\nNetworks. International Conference on Machine Learning, 2017.\n\n[32] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved Training of Wasserstein Gans. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n[33] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. International Conference on Learning Representations,\n2018.\n\n10\n\n\f[34] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6629\u20136640, 2017.\n\n[35] Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep learning and quantum\nentanglement: Fundamental connections with implications to network design. International\nConference on Learning Representations, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2260, "authors": [{"given_name": "Maxim", "family_name": "Kuznetsov", "institution": "Insilico Medicine"}, {"given_name": "Daniil", "family_name": "Polykovskiy", "institution": "Insilico Medicine"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Samsung AI Center, Moscow"}, {"given_name": "Alex", "family_name": "Zhebrak", "institution": "Insilico Medicine"}]}