{"title": "DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 1864, "page_last": 1874, "abstract": "Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors. An implementation which reproduces these results is available.", "full_text": "DVAE#: Discrete Variational Autoencoders with\n\nRelaxed Boltzmann Priors\n\nArash Vahdat\u2217, Evgeny Andriyash\u2217, William G. Macready\n\n{arash,evgeny,bill}@quadrant.ai\n\nQuadrant.ai, D-Wave Systems Inc.\n\nBurnaby, BC, Canada\n\nAbstract\n\nBoltzmann machines are powerful distributions that have been shown to be an\neffective prior over binary latent variables in variational autoencoders (VAEs).\nHowever, previous methods for training discrete VAEs have used the evidence lower\nbound and not the tighter importance-weighted bound. We propose two approaches\nfor relaxing Boltzmann machines to continuous distributions that permit training\nwith importance-weighted bounds. These relaxations are based on generalized\noverlapping transformations and the Gaussian integral trick. Experiments on the\nMNIST and OMNIGLOT datasets show that these relaxations outperform previous\ndiscrete VAEs with Boltzmann priors. An implementation which reproduces these\nresults is available at https://github.com/QuadrantAI/dvae.\n\n1\n\nIntroduction\n\nAdvances in amortized variational inference [1, 2, 3, 4] have enabled novel learning methods [4, 5, 6]\nand extended generative learning into complex domains such as molecule design [7, 8], music [9] and\nprogram [10] generation. These advances have been made using continuous latent variable models in\nspite of the computational ef\ufb01ciency and greater interpretability offered by discrete latent variables.\nFurther, models such as clustering, semi-supervised learning, and variational memory addressing [11]\nall require discrete variables, which makes the training of discrete models an important challenge.\nPrior to the deep learning era, Boltzmann machines were widely used for learning with discrete latent\nvariables. These powerful multivariate binary distributions can represent any distribution de\ufb01ned on\na set of binary random variables [12], and have seen application in unsupervised learning [13], super-\nvised learning [14, 15], reinforcement learning [16], dimensionality reduction [17], and collaborative\n\ufb01ltering [18]. Recently, Boltzmann machines have been used as priors for variational autoencoders\n(VAEs) in the discrete variational autoencoder (DVAE) [19] and its successor DVAE++ [20]. It has\nbeen demonstrated that these VAE models can capture discrete aspects of data. However, both these\nmodels assume a particular variational bound and tighter bounds such as the importance weighted\n(IW) bound [21] cannot be used for training.\nWe remove this constraint by introducing two continuous relaxations that convert a Boltzmann ma-\nchine to a distribution over continuous random variables. These relaxations are based on overlapping\ntransformations introduced in [20] and the Gaussian integral trick [22] (known as the Hubbard-\nStratonovich transform [23] in physics). Our relaxations are made tunably sharp by using an inverse\ntemperature parameter.\nVAEs with relaxed Boltzmann priors can be trained using standard techniques developed for continu-\nous latent variable models. In this work, we train discrete VAEs using the same IW bound on the\nlog-likelihood that has been shown to improve importance weighted autoencoders (IWAEs) [21].\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis paper makes two contributions: i) We introduce two continuous relaxations of Boltzmann\nmachines and use these relaxations to train a discrete VAE with a Boltzmann prior using the IW bound.\nii) We generalize the overlapping transformations of [20] to any pair of distributions with computable\nprobability density function (PDF) and cumulative density function (CDF). Using these more general\noverlapping transformations, we propose new smoothing transformations using mixtures of Gaussian\nand power-function [24] distributions. Power-function overlapping transformations provide lower\nvariance gradient estimates and improved test set log-likelihoods when the inverse temperature is\nlarge. We name our framework DVAE# because the best results are obtained when the power-function\ntransformations are sharp.2\n\n1.1 Related Work\n\nPrevious work on training discrete latent variable models can be grouped into \ufb01ve main categories:\n\ni) Exhaustive approaches marginalize all discrete variables [25, 26] and which are not scalable to\n\nmore than a few discrete variables.\n\nii) Local expectation gradients [27] and reparameterization and marginalization [28] estimators\ncompute low-variance estimates at the cost of multiple function evaluations per gradient. These\napproaches can be applied to problems with a moderate number of latent variables.\n\niii) Relaxed computation of discrete densities [29] replaces discrete variables with continuous\nrelaxations for gradient computation. A variation of this approach, known as the straight-through\ntechnique, sets the gradient of binary variables to the gradient of their mean [30, 31].\n\niv) Continuous relaxations of discrete distributions [32] replace discrete distributions with con-\ntinuous ones and optimize a consistent objective. This method cannot be applied directly to\nBoltzmann distributions. The DVAE [19] solves this problem by pairing each binary variable\nwith an auxiliary continuous variable. This approach is described in Sec. 2.\n\nv) The REINFORCE estimator [33] (also known as the likelihood ratio [34] or score-function\nestimator) replaces the gradient of an expectation with the expectation of the gradient of the\nscore function. This estimator has high variance, but many increasingly sophisticated methods\nprovide lower variance estimators. NVIL [3] uses an input-dependent baseline, and MuProp [35]\nuses a \ufb01rst-order Taylor approximation along with an input-dependent baseline to reduce noise.\nVIMCO [36] trains an IWAE with binary latent variables and uses a leave-one-out scheme to\nde\ufb01ne the baseline for each sample. REBAR [37] and its generalization RELAX [38] use the\nreparameterization of continuous distributions to de\ufb01ne baselines.\n\nThe method proposed here is of type iv) and differs from [19, 20] in the way that binary latent\nvariables are marginalized. The resultant relaxed distribution allows for DVAE training with a tighter\nbound. Moreover, our proposal encompasses a wider variety of smoothing methods and one of these\nempirically provides lower-variance gradient estimates.\n\n2 Background\n\nLet xxx represent observed random variables and \u03b6\u03b6\u03b6 continuous latent variables. We seek a generative\nmodel p(xxx, \u03b6\u03b6\u03b6) = p(\u03b6\u03b6\u03b6)p(xxx|\u03b6\u03b6\u03b6) where p(\u03b6\u03b6\u03b6) denotes the prior distribution and p(xxx|\u03b6\u03b6\u03b6) is a probabilistic\ndecoder. In the VAE [1], training maximizes a variational lower bound on the marginal log-likelihood:\n\n(cid:2)log p(xxx|\u03b6\u03b6\u03b6)(cid:3)\n\n\u2212 KL(cid:0)q(\u03b6\u03b6\u03b6|xxx)||p(\u03b6\u03b6\u03b6)(cid:1).\n\nlog p(xxx) \u2265 Eq(\u03b6\u03b6\u03b6|xxx)\n\nA probabilistic encoder q(\u03b6\u03b6\u03b6|xxx) approximates the posterior over latent variables. For continuous \u03b6\u03b6\u03b6,\nthe bound is maximized using the reparameterization trick. With reparameterization, expectations\nwith respect to q(\u03b6\u03b6\u03b6|xxx) are replaced by expectations against a base distribution and a differentiable\nfunction that maps samples from the base distribution to q(\u03b6\u03b6\u03b6|xxx). This can always be accomplished\nwhen q(\u03b6\u03b6\u03b6|xxx) has an analytic inverse cumulative distribution function (CDF) by mapping uniform\nsamples through the inverse CDF. However, reparameterization cannot be applied to binary latent\nvariables because the CDF is not differentiable.\n\n2And not because our model is proposed after DVAE and DVAE++.\n\n2\n\n\f(cid:26)\u03b4(\u03b6)\n\nwhere r(\u03b6\u03b6\u03b6|zzz) =(cid:81)\n\nThe DVAE [19] resolves this issue by pairing each binary latent variable with a continuous counterpart.\nDenoting a binary vector of length D by zzz \u2208 {0, 1}D, the Boltzmann prior is p(zzz) = e\u2212E\u03b8\u03b8\u03b8(zzz)/Z\u03b8\u03b8\u03b8\nwhere E\u03b8\u03b8\u03b8(zzz) = \u2212aaaT zzz \u2212 1\n2zzzT WWWzzz is an energy function with parameters \u03b8\u03b8\u03b8 \u2261 {WWW , aaa} and partition\nfunction Z\u03b8\u03b8\u03b8. The joint model over discrete and continuous variables is p(xxx, zzz, \u03b6\u03b6\u03b6) = p(zzz)r(\u03b6\u03b6\u03b6|zzz)p(xxx|\u03b6\u03b6\u03b6)\ni r(\u03b6i|zi) is a smoothing transformation that maps each discrete zi to its continuous\nanalogue \u03b6i.\nDVAE [19] and DVAE++ [20] differ in the type of smoothing transformations r(\u03b6|z): [19] uses\nspike-and-exponential transformation (Eq. (1) left), while [20] uses two overlapping exponential\ndistributions (Eq. (1) right). Here, \u03b4(\u03b6) is the (one-sided) Dirac delta distribution, \u03b6 \u2208 [0, 1], and Z\u03b2\nis the normalization constant:\n\n(cid:26)e\u2212\u03b2\u03b6/Z\u03b2\nThe variational bound for a factorial approximation to the posterior where q(\u03b6\u03b6\u03b6|xxx) =(cid:81)\nq(zzz|xxx) =(cid:81)\n(cid:2)Eq(zzz|xxx,\u03b6\u03b6\u03b6) log p(zzz))(cid:3) ,\nHere q(\u03b6i|xxx) =(cid:80)\n1) with weights q(zi|xxx). The probability of binary units conditioned on \u03b6i, q(zzz|xxx, \u03b6\u03b6\u03b6) =(cid:81)\n\ni q(\u03b6i|xxx) and\n(2)\nq(zi|xxx)r(\u03b6i|zi) is a mixture distribution combining r(\u03b6i|zi = 0) and r(\u03b6i|zi =\ni q(zi|xxx, \u03b6i),\ncan be computed analytically. H(q(zzz|xxx)) is the entropy of q(zzz|xxx). The second and third terms in\nEq. (2) have analytic solutions (up to the log normalization constant) that can be differentiated easily\nwith an automatic differentiation (AD) library. The expectation over q(\u03b6\u03b6\u03b6|xxx) is approximated with\nreparameterized sampling.\nWe extend [19, 20] to tighten the bound of Eq. (2) by importance weighting [21, 39]. These tighter\nbounds are shown to improve VAEs. For continuous latent variables, the K-sample IW bound is\n\ni q(zi|xxx) is derived in [20] as\nlog p(xxx) \u2265 Eq(\u03b6\u03b6\u03b6|xxx) [log p(xxx|\u03b6\u03b6\u03b6)] + H(q(zzz|xxx)) + Eq(\u03b6\u03b6\u03b6|xxx)\n\nif z = 0\notherwise ,\n\nr(\u03b6|z) =\n\ne\u03b2(\u03b6\u22121)/Z\u03b2\n\nr(\u03b6|z) =\n\ne\u03b2(\u03b6\u22121)/Z\u03b2\n\nif z = 0\notherwise .\n\n(1)\n\nzi\n\n(cid:33)(cid:35)\n\nlog p(xxx) \u2265 LK(xxx) = E\n\n\u03b6\u03b6\u03b6(k)\u223cq(\u03b6\u03b6\u03b6|xxx)\n\nlog\n\n1\nK\n\nThe tightness of the IW bound improves as K increases [21].\n\n3 Model\n\n(cid:34)\n\n(cid:32)\n\nK(cid:88)\n\nk=1\n\np(\u03b6\u03b6\u03b6 (k))p(xxx|\u03b6\u03b6\u03b6 (k))\n\nq(\u03b6\u03b6\u03b6 (k)|xxx)\n\n.\n\n(3)\n\nWe introduce two relaxations of Boltzmann machines to de\ufb01ne the continuous prior distribution\np(\u03b6\u03b6\u03b6) in the IW bound of Eq. (3). These relaxations rely on either overlapping transformations\n(Sec. 3.1) or the Gaussian integral trick (Sec. 3.2). Sec. 3.3 then generalizes the class of overlapping\ntransformations that can be used in the approximate posterior q(\u03b6\u03b6\u03b6|xxx).\n3.1 Overlapping Relaxations\n\nWe obtain a continuous relaxation of p(zzz) through the marginal p(\u03b6\u03b6\u03b6) = (cid:80)\nand \u03b6\u03b6\u03b6 independently; i.e., r(\u03b6\u03b6\u03b6|zzz) =(cid:81)\nr(\u03b6\u03b6\u03b6|zzz) approaches \u03b4(\u03b6\u03b6\u03b6 \u2212 zzz) and p(\u03b6\u03b6\u03b6) =(cid:80)\n\nz p(zzz)r(\u03b6\u03b6\u03b6|zzz) where\nr(\u03b6\u03b6\u03b6|zzz) is an overlapping smoothing transformation [20] that operates on each component of zzz\ni r(\u03b6i|zi). Overlapping transformations such as mixture of\nexponential in Eq. (1) may be used for r(\u03b6\u03b6\u03b6|zzz). These transformations are equipped with an inverse\ntemperature hyperparameter \u03b2 to control the sharpness of the smoothing transformation. As \u03b2 \u2192 \u221e,\nz p(zzz)\u03b4(\u03b6\u03b6\u03b6 \u2212 zzz) becomes a mixture of 2D delta function\ndistributions centered on the vertices of the hypercube in RD. At \ufb01nite \u03b2, p(\u03b6\u03b6\u03b6) provides a continuous\nrelaxation of the Boltzmann machine.\nTo train an IWAE using Eq. (3) with p(\u03b6\u03b6\u03b6) as a prior, we must compute log p(\u03b6\u03b6\u03b6) and its gradient\nwith respect to the parameters of the Boltzmann distribution and the approximate posterior. This\ncomputation involves marginalization over zzz, which is generally intractable. However, we show that\nthis marginalization can be approximated accurately using a mean-\ufb01eld model.\n\n3.1.1 Computing log p(\u03b6\u03b6\u03b6) and its Gradient for Overlapping Relaxations\n\nSince overlapping transformations are factorial, the log marginal distribution of \u03b6\u03b6\u03b6 is\n\n(cid:17)\n\n(cid:16)(cid:88)\n\ne\u2212E\u03b8\u03b8\u03b8(zzz)+bbb\u03b2 (\u03b6\u03b6\u03b6)T zzz+ccc\u03b2 (\u03b6\u03b6\u03b6)(cid:17)\n\nlog p(\u03b6\u03b6\u03b6) = log\n\np(zzz)r(\u03b6\u03b6\u03b6|zzz)\n\n= log\n\n\u2212 log Z\u03b8\u03b8\u03b8,\n\n(4)\n\n(cid:16)(cid:88)\n\nzzz\n\nzzz\n\n3\n\n\fi (\u03b6\u03b6\u03b6) = \u2212\u03b2\u03b6i \u2212 log Z\u03b2.\n\ni (\u03b6\u03b6\u03b6) = \u03b2(2\u03b6i \u2212 1) and c\u03b2\n\ni (\u03b6\u03b6\u03b6) = log r(\u03b6i|zi = 0). For the mixture of\n\ni (\u03b6\u03b6\u03b6) = log r(\u03b6i|zi = 1) \u2212 log r(\u03b6i|zi = 0) and c\u03b2\n\nwhere b\u03b2\nexponential smoothing b\u03b2\nThe \ufb01rst term in Eq. (4) is the log partition function of the Boltzmann machine \u02c6p(zzz) with augmented\nenergy function \u02c6E\u03b2\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz) := E\u03b8\u03b8\u03b8(zzz)\u2212bbb\u03b2(\u03b6\u03b6\u03b6)T zzz\u2212ccc\u03b2(\u03b6\u03b6\u03b6). Estimating the log partition function accurately\ncan be expensive, particularly because it has to be done for each \u03b6\u03b6\u03b6. However, we note that each \u03b6i\ncomes from a bimodal distribution centered at zero and one, and that the bias bbb\u03b2(\u03b6\u03b6\u03b6) is usually large\nfor most components i (particularly for large \u03b2). In this case, mean \ufb01eld is likely to provide a good\napproximation of \u02c6p(zzz), a fact we demonstrate empirically in Sec. 4.\n\nTo compute log p(\u03b6\u03b6\u03b6) and its gradient, we \ufb01rst \ufb01t a mean-\ufb01eld distribution m(zzz) =(cid:81)\n\ni mi(zi) by\n\nminimizing KL(m(zzz)||\u02c6p(zzz)) [40]. The gradient of log p(\u03b6\u03b6\u03b6) with respect to \u03b2, \u03b8\u03b8\u03b8 or \u03b6\u03b6\u03b6 is:\n\n(cid:2)\n(cid:2)\n\u2207 log p(\u03b6\u03b6\u03b6) = \u2212Ezzz\u223c \u02c6p(zzz)\n\u2207 \u02c6E\u03b2\n\u2248 \u2212Ezzz\u223cm(zzz)\n\u2207 \u02c6E\u03b2\n= \u2212\u2207 \u02c6E\u03b2\n\n(cid:2)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3)\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz)(cid:3) + Ezzz\u223cp(zzz)\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz)(cid:3) + Ezzz\u223cp(zzz)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3)\n(cid:2)\n(cid:2)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3),\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm) + Ezzz\u223cp(zzz)\n\n(5)\n\u00b7\u00b7\u00b7 mD(zD = 1)] is the mean-\ufb01eld solution and where the gradient\nwhere mmmT = [m1(z1 = 1)\ndoes not act on mmm. The \ufb01rst term in Eq. (5) is the result of computing the average energy under\na factorial distribution.3 The second expectation corresponds to the negative phase in training\nBoltzmann machines and is approximated by Monte Carlo sampling from p(zzz).\nTo compute the importance weights for the IW bound of Eq. (3) we must compute the value of\n\u2248 0\n(6)\n\nlog p(\u03b6\u03b6\u03b6) up to the normalization; i.e. the \ufb01rst term in Eq. (4). Assuming that KL(cid:0)m(zzz)||\u02c6p(zzz)(cid:1)\nthe \ufb01rst term of Eq. (4) is approximated as H(cid:0)m(zzz)(cid:1)\n\nKL(m(zzz)||\u02c6p(zzz)) = \u02c6E\u03b2\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6 (zzz)(cid:17)\n\n\u2212 H(m(zzz)),\n\n(cid:16)(cid:88)\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm) + log\n\nand using\n\n\u2212 \u02c6E\u03b2\n\ne\n\nz\n\n\u2212 \u02c6E\u03b2\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm).\n\n3.2 The Gaussian Integral Trick\n\nThe computational complexity of log p(\u03b6\u03b6\u03b6) arises from the pairwise interactions zzzT WWWzzz present in\nE\u03b8\u03b8\u03b8(zzz). Instead of applying mean \ufb01eld, we remove these interactions using the Gaussian integral\ntrick [41]. This is achieved by de\ufb01ning Gaussian smoothing:\n\nr(\u03b6\u03b6\u03b6|zzz) = N (\u03b6\u03b6\u03b6|AAA(WWW + \u03b2III)zzz, AAA(WWW + \u03b2III)AAAT )\n\nfor an invertible matrix AAA and a diagonal matrix \u03b2III with \u03b2 > 0. Here, \u03b2 must be large enough so that\nWWW + \u03b2III is positive de\ufb01nite. Common choices for AAA include AAA = III or AAA = \u039b\u039b\u039b\u2212 1\n2 VVV T where VVV \u039b\u039b\u039bVVV T is\nthe eigendecomposition of WWW + \u03b2III [41]. However, neither of these choices places the modes of p(\u03b6\u03b6\u03b6)\non the vertices of the hypercube in RD. Instead, we take AAA = (WWW + \u03b2III)\u22121 giving the smoothing\ntransformation r(\u03b6\u03b6\u03b6|zzz) = N (\u03b6\u03b6\u03b6|zzz, (WWW + \u03b2III)\u22121). The joint density is then\n2 \u03b2111)T zzz,\n\n2 \u03b6\u03b6\u03b6T (WWW +\u03b2III)\u03b6\u03b6\u03b6+zzzT (WWW +\u03b2III)\u03b6\u03b6\u03b6+(aaa\u2212 1\n\np(zzz, \u03b6\u03b6\u03b6) \u221d e\u2212 1\n\nwhere 111 is the D-vector of all ones. Since p(zzz, \u03b6\u03b6\u03b6) no longer contains pairwise interactions zzz can be\nmarginalized out giving\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n2\u03c0\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n2\n\n2 \u03b6\u03b6\u03b6T (WWW +\u03b2III)\u03b6\u03b6\u03b6(cid:89)\n\n(cid:104)\n\ne\u2212 1\n\ni\n\n(cid:105)\n\np(\u03b6\u03b6\u03b6) = Z\u22121\n\n\u03b8\u03b8\u03b8\n\n(WWW + \u03b2III)\n\n1 + eai+ci\u2212 \u03b2\n\n2\n\n,\n\n(7)\n\nwhere ci is the ith element of (WWW + \u03b2III)\u03b6\u03b6\u03b6.\nThe marginal p(\u03b6\u03b6\u03b6) in Eq. (7) is a mixture of 2D Gaussian distributions centered on the vertices of\nthe hypercube in RD with mixing weights given by p(zzz). Each mixture component has covariance\n\u03a3\u03a3\u03a3 = (WWW + \u03b2III)\u22121 and, as \u03b2 gets large, the precision matrix becomes diagonally dominant. As\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz) is a multi-linear function of {zi} and under the mean-\ufb01eld assumption each\n\n3The augmented energy \u02c6E\u03b2\n\nzi is replaced by its average value m(zi = 1).\n\n4\n\n\f\u03b2 \u2192 \u221e, each mixture component becomes a delta function and p(\u03b6\u03b6\u03b6) approaches(cid:80)\n\nz p(zzz)\u03b4(\u03b6\u03b6\u03b6 \u2212 zzz).\nThis Gaussian smoothing allows for simple evaluation of log p(\u03b6\u03b6\u03b6) (up to Z\u03b8\u03b8\u03b8), but we note that each\nmixture component has a nondiagonal covariance matrix, which should be accommodated when\ndesigning the approximate posterior q(\u03b6\u03b6\u03b6|xxx).\nThe hyperparameter \u03b2 must be larger than the absolute value of the most negative eigenvalue of WWW to\nensure that WWW + \u03b2III is positive de\ufb01nite. Setting \u03b2 to even larger values has the bene\ufb01t of making\nthe Gaussian mixture components more isotropic, but this comes at the cost of requiring a sharper\napproximate posterior with potentially noisier gradient estimates.\n\n3.3 Generalizing Overlapping Transformations\nThe previous sections developed two r(\u03b6\u03b6\u03b6|zzz) relaxations for Boltzmann priors. Depending on this\nchoice, compatible q(\u03b6\u03b6\u03b6|xxx) parameterizations must be used. For example, if Gaussian smoothing is\nused, then a mixture of Gaussian smoothers should be used in the approximate posterior. Unfor-\ntunately, the overlapping transformations introduced in DVAE++ [20] are limited to mixtures of\nexponential or logistic distributions where the inverse CDF can be computed analytically. Here, we\nprovide a general approach for reparameterizing overlapping transformations that does not require an-\nalytic inverse CDFs. Our approach is a special case of the reparameterization method for multivariate\nmixture distributions proposed in [42].\nAssume q(\u03b6|xxx) = (1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1) is the mixture distribution resulting from an\noverlapping transformation de\ufb01ned for one-dimensional z and \u03b6 where q \u2261 q(z = 1|xxx). Ancestral\nsampling from q(\u03b6|xxx) is accomplished by \ufb01rst sampling from the binary distribution q(z|xxx) and then\nsampling \u03b6 from r(\u03b6|z). This process generates samples but is not differentiable with respect to q.\nTo compute the gradient (with respect to q) of samples from q(\u03b6|xxx), we apply the implicit function\ntheorem. The inverse CDF of q(\u03b6|xxx) at \u03c1 is obtained by solving:\n(8)\nwhere \u03c1 \u2208 [0, 1] and R(\u03b6|z) is the CDF for r(\u03b6|z). Assuming that \u03b6 is a function of q but \u03c1 is not,\nwe take the gradient from both sides of Eq. (8) with respect to q giving\n\nCDF(\u03b6) = (1 \u2212 q)R(\u03b6|z = 0) + qR(\u03b6|z = 1) = \u03c1,\n\n\u2202\u03b6\n\u2202q\n\n=\n\nR(\u03b6|z = 0) \u2212 R(\u03b6|z = 1)\n\n(1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1)\n\n,\n\n(9)\n\nwhich can be easily computed for a sampled \u03b6 if the PDF and CDF of r(\u03b6|z) are known. This\ngeneralization allows us to compute gradients of samples generated from a wide range of overlapping\ntransformations. Further, the gradient of \u03b6 with respect to the parameters of r(\u03b6|z) (e.g. \u03b2) is\ncomputed similarly as\n\n\u2202\u03b6\n\u2202\u03b2\n\n= \u2212\n\n(1 \u2212 q) \u2202\u03b2R(\u03b6|z = 0) + q \u2202\u03b2R(\u03b6|z = 1)\n\n(1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1)\n\n.\n\nWith this method, we can apply overlapping transformations beyond the mixture of exponentials\nconsidered in [20]. The inverse CDF of exponential mixtures is shown in Fig. 1(a) for several \u03b2. As\n\u03b2 increases, the relaxation approaches the original binary variables, but this added \ufb01delity comes at\nthe cost of noisy gradients. Other overlapping transformations offer alternative tradeoffs:\nUniform+Exp Transformation: We ensure that the gradient remains \ufb01nite as \u03b2 \u2192 \u221e by mixing\nthe exponential with a uniform distribution. This is achieved by de\ufb01ning r(cid:48)(\u03b6|z) = (1 \u2212 \u0001)r(\u03b6|z) + \u0001\nwhere r(\u03b6|z) is the exponential smoothing and \u03b6 \u2208 [0, 1]. The inverse CDF resulting from this\nsmoothing is shown in Fig. 1(b).\nPower-Function Transformation: Instead of adding a uniform distribution we substitute the expo-\nnential distribution for one with heavier tails. One choice is the power-function distribution [24]:\n\n(cid:40) 1\n\u03b2 \u22121)\n\u03b2 \u03b6 ( 1\n\u03b2 (1 \u2212 \u03b6)( 1\n\n1\n\nr(\u03b6|z) =\n\n\u03b2 \u22121)\n\nif z = 0\notherwise\n\nfor \u03b6 \u2208 [0, 1] and \u03b2 > 1.\n\n(10)\n\nThe conditionals in Eq. (10) correspond to the Beta distributions B(1/\u03b2, 1) and B(1, 1/\u03b2) respec-\ntively. The inverse CDF resulted from this smoothing is visualized in Fig. 1(c).\n\n5\n\n\f(a) Exponential Transformation\n\nFigure 1: In the \ufb01rst row, we visualize the inverse CDF of the mixture q(\u03b6) =(cid:80)\n\n(b) Uniform+Exp Transformation (c) Power-Function Transformation\nz q(z)r(\u03b6|z) for\nq = q(z = 1) = 0.5 as a function of the random noise \u03c1 \u2208 [0, 1]. In the second row, the gradient of\nthe inverse CDF with respect to q is visualized. Each column corresponds to a different smoothing\ntransformation. As the transition region sharpens with increasing \u03b2, a sampling based estimate of the\ngradient becomes noisier; i.e., the variance of \u2202\u03b6/\u2202q increases. The uniform+exp exponential has a\nvery similar inverse CDF (\ufb01rst row) to the exponential but has potentially lower variance (bottom\nrow). In comparison, the power-function smoothing with \u03b2 = 40 provides a good relaxation of the\ndiscrete variables while its gradient noise is still moderate. See the supplementary material for a\ncomparison of the gradient noise.\n\nGaussian Transformations: The transformations introduced above have support \u03b6 \u2208 [0, 1]. We also\nexplore Gaussian smoothing r(\u03b6|z) = N (\u03b6|z, 1\nNone of these transformations have an analytic inverse CDF for q(\u03b6|xxx) so we use Eq. (9) to calculate\ngradients.\n\n\u03b2 ) with support \u03b6 \u2208 R.\n\n4 Experiments\n\nq(\u03b6\u03b6\u03b6|xxx) =(cid:81)G\n\nIn this section we compare the various relaxations for training DVAEs with Boltzmann priors on\nstatically binarized MNIST [43] and OMNIGLOT [44] datasets. For all experiments we use a\ngenerative model of the form p(xxx, \u03b6\u03b6\u03b6) = p(\u03b6\u03b6\u03b6)p(xxx|\u03b6\u03b6\u03b6) where p(\u03b6\u03b6\u03b6) is a continuous relaxation obtained\nfrom either the overlapping relaxation of Eq. (4) or the Gaussian integral trick of Eq. (7). The\nunderlying Boltzmann distribution is a restricted Boltzmann machine (RBM) with bipartite connec-\ntivity which allows for parallel Gibbs updates. We use a hierarchical autoregressively-structured\ng=1 q(\u03b6\u03b6\u03b6 g|xxx, \u03b6\u03b6\u03b6 0.5]) vs. the variance of \u2202\u03b6i/\u2202q is visualized in this\n\ufb01gure. For a given gradient variance, power function smoothing provides a closer approximation to the binary\nvariables.\n\nFigure 3: Average distance between \u03b6 and its binarized z vs. variance of \u2202\u03b6/\u2202q measured on 106\nsamples from q(\u03b6). For a given gradient variance, power function smoothing provides a closer\napproximation to the binary variables.\n\n11\n\n510152025GradientVariance0.020.040.060.080.100.12L1Distanceexponentialpower\f", "award": [], "sourceid": 931, "authors": [{"given_name": "Arash", "family_name": "Vahdat", "institution": "Quadrant.ai (D-Wave)"}, {"given_name": "Evgeny", "family_name": "Andriyash", "institution": "D-Wave"}, {"given_name": "William", "family_name": "Macready", "institution": "D-Wave"}]}