{"title": "DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 1864, "page_last": 1874, "abstract": "Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors. An implementation which reproduces these results is available.", "full_text": "DVAE#: Discrete Variational Autoencoders with\n\nRelaxed Boltzmann Priors\n\nArash Vahdat\u2217, Evgeny Andriyash\u2217, William G. Macready\n\n{arash,evgeny,bill}@quadrant.ai\n\nQuadrant.ai, D-Wave Systems Inc.\n\nBurnaby, BC, Canada\n\nAbstract\n\nBoltzmann machines are powerful distributions that have been shown to be an\neffective prior over binary latent variables in variational autoencoders (VAEs).\nHowever, previous methods for training discrete VAEs have used the evidence lower\nbound and not the tighter importance-weighted bound. We propose two approaches\nfor relaxing Boltzmann machines to continuous distributions that permit training\nwith importance-weighted bounds. These relaxations are based on generalized\noverlapping transformations and the Gaussian integral trick. Experiments on the\nMNIST and OMNIGLOT datasets show that these relaxations outperform previous\ndiscrete VAEs with Boltzmann priors. An implementation which reproduces these\nresults is available at https://github.com/QuadrantAI/dvae.\n\n1\n\nIntroduction\n\nAdvances in amortized variational inference [1, 2, 3, 4] have enabled novel learning methods [4, 5, 6]\nand extended generative learning into complex domains such as molecule design [7, 8], music [9] and\nprogram [10] generation. These advances have been made using continuous latent variable models in\nspite of the computational ef\ufb01ciency and greater interpretability offered by discrete latent variables.\nFurther, models such as clustering, semi-supervised learning, and variational memory addressing [11]\nall require discrete variables, which makes the training of discrete models an important challenge.\nPrior to the deep learning era, Boltzmann machines were widely used for learning with discrete latent\nvariables. These powerful multivariate binary distributions can represent any distribution de\ufb01ned on\na set of binary random variables [12], and have seen application in unsupervised learning [13], super-\nvised learning [14, 15], reinforcement learning [16], dimensionality reduction [17], and collaborative\n\ufb01ltering [18]. Recently, Boltzmann machines have been used as priors for variational autoencoders\n(VAEs) in the discrete variational autoencoder (DVAE) [19] and its successor DVAE++ [20]. It has\nbeen demonstrated that these VAE models can capture discrete aspects of data. However, both these\nmodels assume a particular variational bound and tighter bounds such as the importance weighted\n(IW) bound [21] cannot be used for training.\nWe remove this constraint by introducing two continuous relaxations that convert a Boltzmann ma-\nchine to a distribution over continuous random variables. These relaxations are based on overlapping\ntransformations introduced in [20] and the Gaussian integral trick [22] (known as the Hubbard-\nStratonovich transform [23] in physics). Our relaxations are made tunably sharp by using an inverse\ntemperature parameter.\nVAEs with relaxed Boltzmann priors can be trained using standard techniques developed for continu-\nous latent variable models. In this work, we train discrete VAEs using the same IW bound on the\nlog-likelihood that has been shown to improve importance weighted autoencoders (IWAEs) [21].\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis paper makes two contributions: i) We introduce two continuous relaxations of Boltzmann\nmachines and use these relaxations to train a discrete VAE with a Boltzmann prior using the IW bound.\nii) We generalize the overlapping transformations of [20] to any pair of distributions with computable\nprobability density function (PDF) and cumulative density function (CDF). Using these more general\noverlapping transformations, we propose new smoothing transformations using mixtures of Gaussian\nand power-function [24] distributions. Power-function overlapping transformations provide lower\nvariance gradient estimates and improved test set log-likelihoods when the inverse temperature is\nlarge. We name our framework DVAE# because the best results are obtained when the power-function\ntransformations are sharp.2\n\n1.1 Related Work\n\nPrevious work on training discrete latent variable models can be grouped into \ufb01ve main categories:\n\ni) Exhaustive approaches marginalize all discrete variables [25, 26] and which are not scalable to\n\nmore than a few discrete variables.\n\nii) Local expectation gradients [27] and reparameterization and marginalization [28] estimators\ncompute low-variance estimates at the cost of multiple function evaluations per gradient. These\napproaches can be applied to problems with a moderate number of latent variables.\n\niii) Relaxed computation of discrete densities [29] replaces discrete variables with continuous\nrelaxations for gradient computation. A variation of this approach, known as the straight-through\ntechnique, sets the gradient of binary variables to the gradient of their mean [30, 31].\n\niv) Continuous relaxations of discrete distributions [32] replace discrete distributions with con-\ntinuous ones and optimize a consistent objective. This method cannot be applied directly to\nBoltzmann distributions. The DVAE [19] solves this problem by pairing each binary variable\nwith an auxiliary continuous variable. This approach is described in Sec. 2.\n\nv) The REINFORCE estimator [33] (also known as the likelihood ratio [34] or score-function\nestimator) replaces the gradient of an expectation with the expectation of the gradient of the\nscore function. This estimator has high variance, but many increasingly sophisticated methods\nprovide lower variance estimators. NVIL [3] uses an input-dependent baseline, and MuProp [35]\nuses a \ufb01rst-order Taylor approximation along with an input-dependent baseline to reduce noise.\nVIMCO [36] trains an IWAE with binary latent variables and uses a leave-one-out scheme to\nde\ufb01ne the baseline for each sample. REBAR [37] and its generalization RELAX [38] use the\nreparameterization of continuous distributions to de\ufb01ne baselines.\n\nThe method proposed here is of type iv) and differs from [19, 20] in the way that binary latent\nvariables are marginalized. The resultant relaxed distribution allows for DVAE training with a tighter\nbound. Moreover, our proposal encompasses a wider variety of smoothing methods and one of these\nempirically provides lower-variance gradient estimates.\n\n2 Background\n\nLet xxx represent observed random variables and \u03b6\u03b6\u03b6 continuous latent variables. We seek a generative\nmodel p(xxx, \u03b6\u03b6\u03b6) = p(\u03b6\u03b6\u03b6)p(xxx|\u03b6\u03b6\u03b6) where p(\u03b6\u03b6\u03b6) denotes the prior distribution and p(xxx|\u03b6\u03b6\u03b6) is a probabilistic\ndecoder. In the VAE [1], training maximizes a variational lower bound on the marginal log-likelihood:\n\n(cid:2)log p(xxx|\u03b6\u03b6\u03b6)(cid:3)\n\n\u2212 KL(cid:0)q(\u03b6\u03b6\u03b6|xxx)||p(\u03b6\u03b6\u03b6)(cid:1).\n\nlog p(xxx) \u2265 Eq(\u03b6\u03b6\u03b6|xxx)\n\nA probabilistic encoder q(\u03b6\u03b6\u03b6|xxx) approximates the posterior over latent variables. For continuous \u03b6\u03b6\u03b6,\nthe bound is maximized using the reparameterization trick. With reparameterization, expectations\nwith respect to q(\u03b6\u03b6\u03b6|xxx) are replaced by expectations against a base distribution and a differentiable\nfunction that maps samples from the base distribution to q(\u03b6\u03b6\u03b6|xxx). This can always be accomplished\nwhen q(\u03b6\u03b6\u03b6|xxx) has an analytic inverse cumulative distribution function (CDF) by mapping uniform\nsamples through the inverse CDF. However, reparameterization cannot be applied to binary latent\nvariables because the CDF is not differentiable.\n\n2And not because our model is proposed after DVAE and DVAE++.\n\n2\n\n\f(cid:26)\u03b4(\u03b6)\n\nwhere r(\u03b6\u03b6\u03b6|zzz) =(cid:81)\n\nThe DVAE [19] resolves this issue by pairing each binary latent variable with a continuous counterpart.\nDenoting a binary vector of length D by zzz \u2208 {0, 1}D, the Boltzmann prior is p(zzz) = e\u2212E\u03b8\u03b8\u03b8(zzz)/Z\u03b8\u03b8\u03b8\nwhere E\u03b8\u03b8\u03b8(zzz) = \u2212aaaT zzz \u2212 1\n2zzzT WWWzzz is an energy function with parameters \u03b8\u03b8\u03b8 \u2261 {WWW , aaa} and partition\nfunction Z\u03b8\u03b8\u03b8. The joint model over discrete and continuous variables is p(xxx, zzz, \u03b6\u03b6\u03b6) = p(zzz)r(\u03b6\u03b6\u03b6|zzz)p(xxx|\u03b6\u03b6\u03b6)\ni r(\u03b6i|zi) is a smoothing transformation that maps each discrete zi to its continuous\nanalogue \u03b6i.\nDVAE [19] and DVAE++ [20] differ in the type of smoothing transformations r(\u03b6|z): [19] uses\nspike-and-exponential transformation (Eq. (1) left), while [20] uses two overlapping exponential\ndistributions (Eq. (1) right). Here, \u03b4(\u03b6) is the (one-sided) Dirac delta distribution, \u03b6 \u2208 [0, 1], and Z\u03b2\nis the normalization constant:\n\n(cid:26)e\u2212\u03b2\u03b6/Z\u03b2\nThe variational bound for a factorial approximation to the posterior where q(\u03b6\u03b6\u03b6|xxx) =(cid:81)\nq(zzz|xxx) =(cid:81)\n(cid:2)Eq(zzz|xxx,\u03b6\u03b6\u03b6) log p(zzz))(cid:3) ,\nHere q(\u03b6i|xxx) =(cid:80)\n1) with weights q(zi|xxx). The probability of binary units conditioned on \u03b6i, q(zzz|xxx, \u03b6\u03b6\u03b6) =(cid:81)\n\ni q(\u03b6i|xxx) and\n(2)\nq(zi|xxx)r(\u03b6i|zi) is a mixture distribution combining r(\u03b6i|zi = 0) and r(\u03b6i|zi =\ni q(zi|xxx, \u03b6i),\ncan be computed analytically. H(q(zzz|xxx)) is the entropy of q(zzz|xxx). The second and third terms in\nEq. (2) have analytic solutions (up to the log normalization constant) that can be differentiated easily\nwith an automatic differentiation (AD) library. The expectation over q(\u03b6\u03b6\u03b6|xxx) is approximated with\nreparameterized sampling.\nWe extend [19, 20] to tighten the bound of Eq. (2) by importance weighting [21, 39]. These tighter\nbounds are shown to improve VAEs. For continuous latent variables, the K-sample IW bound is\n\ni q(zi|xxx) is derived in [20] as\nlog p(xxx) \u2265 Eq(\u03b6\u03b6\u03b6|xxx) [log p(xxx|\u03b6\u03b6\u03b6)] + H(q(zzz|xxx)) + Eq(\u03b6\u03b6\u03b6|xxx)\n\nif z = 0\notherwise ,\n\nr(\u03b6|z) =\n\ne\u03b2(\u03b6\u22121)/Z\u03b2\n\nr(\u03b6|z) =\n\ne\u03b2(\u03b6\u22121)/Z\u03b2\n\nif z = 0\notherwise .\n\n(1)\n\nzi\n\n(cid:33)(cid:35)\n\nlog p(xxx) \u2265 LK(xxx) = E\n\n\u03b6\u03b6\u03b6(k)\u223cq(\u03b6\u03b6\u03b6|xxx)\n\nlog\n\n1\nK\n\nThe tightness of the IW bound improves as K increases [21].\n\n3 Model\n\n(cid:34)\n\n(cid:32)\n\nK(cid:88)\n\nk=1\n\np(\u03b6\u03b6\u03b6 (k))p(xxx|\u03b6\u03b6\u03b6 (k))\n\nq(\u03b6\u03b6\u03b6 (k)|xxx)\n\n.\n\n(3)\n\nWe introduce two relaxations of Boltzmann machines to de\ufb01ne the continuous prior distribution\np(\u03b6\u03b6\u03b6) in the IW bound of Eq. (3). These relaxations rely on either overlapping transformations\n(Sec. 3.1) or the Gaussian integral trick (Sec. 3.2). Sec. 3.3 then generalizes the class of overlapping\ntransformations that can be used in the approximate posterior q(\u03b6\u03b6\u03b6|xxx).\n3.1 Overlapping Relaxations\n\nWe obtain a continuous relaxation of p(zzz) through the marginal p(\u03b6\u03b6\u03b6) = (cid:80)\nand \u03b6\u03b6\u03b6 independently; i.e., r(\u03b6\u03b6\u03b6|zzz) =(cid:81)\nr(\u03b6\u03b6\u03b6|zzz) approaches \u03b4(\u03b6\u03b6\u03b6 \u2212 zzz) and p(\u03b6\u03b6\u03b6) =(cid:80)\n\nz p(zzz)r(\u03b6\u03b6\u03b6|zzz) where\nr(\u03b6\u03b6\u03b6|zzz) is an overlapping smoothing transformation [20] that operates on each component of zzz\ni r(\u03b6i|zi). Overlapping transformations such as mixture of\nexponential in Eq. (1) may be used for r(\u03b6\u03b6\u03b6|zzz). These transformations are equipped with an inverse\ntemperature hyperparameter \u03b2 to control the sharpness of the smoothing transformation. As \u03b2 \u2192 \u221e,\nz p(zzz)\u03b4(\u03b6\u03b6\u03b6 \u2212 zzz) becomes a mixture of 2D delta function\ndistributions centered on the vertices of the hypercube in RD. At \ufb01nite \u03b2, p(\u03b6\u03b6\u03b6) provides a continuous\nrelaxation of the Boltzmann machine.\nTo train an IWAE using Eq. (3) with p(\u03b6\u03b6\u03b6) as a prior, we must compute log p(\u03b6\u03b6\u03b6) and its gradient\nwith respect to the parameters of the Boltzmann distribution and the approximate posterior. This\ncomputation involves marginalization over zzz, which is generally intractable. However, we show that\nthis marginalization can be approximated accurately using a mean-\ufb01eld model.\n\n3.1.1 Computing log p(\u03b6\u03b6\u03b6) and its Gradient for Overlapping Relaxations\n\nSince overlapping transformations are factorial, the log marginal distribution of \u03b6\u03b6\u03b6 is\n\n(cid:17)\n\n(cid:16)(cid:88)\n\ne\u2212E\u03b8\u03b8\u03b8(zzz)+bbb\u03b2 (\u03b6\u03b6\u03b6)T zzz+ccc\u03b2 (\u03b6\u03b6\u03b6)(cid:17)\n\nlog p(\u03b6\u03b6\u03b6) = log\n\np(zzz)r(\u03b6\u03b6\u03b6|zzz)\n\n= log\n\n\u2212 log Z\u03b8\u03b8\u03b8,\n\n(4)\n\n(cid:16)(cid:88)\n\nzzz\n\nzzz\n\n3\n\n\fi (\u03b6\u03b6\u03b6) = \u2212\u03b2\u03b6i \u2212 log Z\u03b2.\n\ni (\u03b6\u03b6\u03b6) = \u03b2(2\u03b6i \u2212 1) and c\u03b2\n\ni (\u03b6\u03b6\u03b6) = log r(\u03b6i|zi = 0). For the mixture of\n\ni (\u03b6\u03b6\u03b6) = log r(\u03b6i|zi = 1) \u2212 log r(\u03b6i|zi = 0) and c\u03b2\n\nwhere b\u03b2\nexponential smoothing b\u03b2\nThe \ufb01rst term in Eq. (4) is the log partition function of the Boltzmann machine \u02c6p(zzz) with augmented\nenergy function \u02c6E\u03b2\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz) := E\u03b8\u03b8\u03b8(zzz)\u2212bbb\u03b2(\u03b6\u03b6\u03b6)T zzz\u2212ccc\u03b2(\u03b6\u03b6\u03b6). Estimating the log partition function accurately\ncan be expensive, particularly because it has to be done for each \u03b6\u03b6\u03b6. However, we note that each \u03b6i\ncomes from a bimodal distribution centered at zero and one, and that the bias bbb\u03b2(\u03b6\u03b6\u03b6) is usually large\nfor most components i (particularly for large \u03b2). In this case, mean \ufb01eld is likely to provide a good\napproximation of \u02c6p(zzz), a fact we demonstrate empirically in Sec. 4.\n\nTo compute log p(\u03b6\u03b6\u03b6) and its gradient, we \ufb01rst \ufb01t a mean-\ufb01eld distribution m(zzz) =(cid:81)\n\ni mi(zi) by\n\nminimizing KL(m(zzz)||\u02c6p(zzz)) [40]. The gradient of log p(\u03b6\u03b6\u03b6) with respect to \u03b2, \u03b8\u03b8\u03b8 or \u03b6\u03b6\u03b6 is:\n\n(cid:2)\n(cid:2)\n\u2207 log p(\u03b6\u03b6\u03b6) = \u2212Ezzz\u223c \u02c6p(zzz)\n\u2207 \u02c6E\u03b2\n\u2248 \u2212Ezzz\u223cm(zzz)\n\u2207 \u02c6E\u03b2\n= \u2212\u2207 \u02c6E\u03b2\n\n(cid:2)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3)\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz)(cid:3) + Ezzz\u223cp(zzz)\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz)(cid:3) + Ezzz\u223cp(zzz)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3)\n(cid:2)\n(cid:2)\n\u2207E\u03b8\u03b8\u03b8(zzz)(cid:3),\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm) + Ezzz\u223cp(zzz)\n\n(5)\n\u00b7\u00b7\u00b7 mD(zD = 1)] is the mean-\ufb01eld solution and where the gradient\nwhere mmmT = [m1(z1 = 1)\ndoes not act on mmm. The \ufb01rst term in Eq. (5) is the result of computing the average energy under\na factorial distribution.3 The second expectation corresponds to the negative phase in training\nBoltzmann machines and is approximated by Monte Carlo sampling from p(zzz).\nTo compute the importance weights for the IW bound of Eq. (3) we must compute the value of\n\u2248 0\n(6)\n\nlog p(\u03b6\u03b6\u03b6) up to the normalization; i.e. the \ufb01rst term in Eq. (4). Assuming that KL(cid:0)m(zzz)||\u02c6p(zzz)(cid:1)\nthe \ufb01rst term of Eq. (4) is approximated as H(cid:0)m(zzz)(cid:1)\n\nKL(m(zzz)||\u02c6p(zzz)) = \u02c6E\u03b2\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6 (zzz)(cid:17)\n\n\u2212 H(m(zzz)),\n\n(cid:16)(cid:88)\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm) + log\n\nand using\n\n\u2212 \u02c6E\u03b2\n\ne\n\nz\n\n\u2212 \u02c6E\u03b2\n\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(mmm).\n\n3.2 The Gaussian Integral Trick\n\nThe computational complexity of log p(\u03b6\u03b6\u03b6) arises from the pairwise interactions zzzT WWWzzz present in\nE\u03b8\u03b8\u03b8(zzz). Instead of applying mean \ufb01eld, we remove these interactions using the Gaussian integral\ntrick [41]. This is achieved by de\ufb01ning Gaussian smoothing:\n\nr(\u03b6\u03b6\u03b6|zzz) = N (\u03b6\u03b6\u03b6|AAA(WWW + \u03b2III)zzz, AAA(WWW + \u03b2III)AAAT )\n\nfor an invertible matrix AAA and a diagonal matrix \u03b2III with \u03b2 > 0. Here, \u03b2 must be large enough so that\nWWW + \u03b2III is positive de\ufb01nite. Common choices for AAA include AAA = III or AAA = \u039b\u039b\u039b\u2212 1\n2 VVV T where VVV \u039b\u039b\u039bVVV T is\nthe eigendecomposition of WWW + \u03b2III [41]. However, neither of these choices places the modes of p(\u03b6\u03b6\u03b6)\non the vertices of the hypercube in RD. Instead, we take AAA = (WWW + \u03b2III)\u22121 giving the smoothing\ntransformation r(\u03b6\u03b6\u03b6|zzz) = N (\u03b6\u03b6\u03b6|zzz, (WWW + \u03b2III)\u22121). The joint density is then\n2 \u03b2111)T zzz,\n\n2 \u03b6\u03b6\u03b6T (WWW +\u03b2III)\u03b6\u03b6\u03b6+zzzT (WWW +\u03b2III)\u03b6\u03b6\u03b6+(aaa\u2212 1\n\np(zzz, \u03b6\u03b6\u03b6) \u221d e\u2212 1\n\nwhere 111 is the D-vector of all ones. Since p(zzz, \u03b6\u03b6\u03b6) no longer contains pairwise interactions zzz can be\nmarginalized out giving\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n2\u03c0\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n2\n\n2 \u03b6\u03b6\u03b6T (WWW +\u03b2III)\u03b6\u03b6\u03b6(cid:89)\n\n(cid:104)\n\ne\u2212 1\n\ni\n\n(cid:105)\n\np(\u03b6\u03b6\u03b6) = Z\u22121\n\n\u03b8\u03b8\u03b8\n\n(WWW + \u03b2III)\n\n1 + eai+ci\u2212 \u03b2\n\n2\n\n,\n\n(7)\n\nwhere ci is the ith element of (WWW + \u03b2III)\u03b6\u03b6\u03b6.\nThe marginal p(\u03b6\u03b6\u03b6) in Eq. (7) is a mixture of 2D Gaussian distributions centered on the vertices of\nthe hypercube in RD with mixing weights given by p(zzz). Each mixture component has covariance\n\u03a3\u03a3\u03a3 = (WWW + \u03b2III)\u22121 and, as \u03b2 gets large, the precision matrix becomes diagonally dominant. As\n\u03b8\u03b8\u03b8,\u03b6\u03b6\u03b6(zzz) is a multi-linear function of {zi} and under the mean-\ufb01eld assumption each\n\n3The augmented energy \u02c6E\u03b2\n\nzi is replaced by its average value m(zi = 1).\n\n4\n\n\f\u03b2 \u2192 \u221e, each mixture component becomes a delta function and p(\u03b6\u03b6\u03b6) approaches(cid:80)\n\nz p(zzz)\u03b4(\u03b6\u03b6\u03b6 \u2212 zzz).\nThis Gaussian smoothing allows for simple evaluation of log p(\u03b6\u03b6\u03b6) (up to Z\u03b8\u03b8\u03b8), but we note that each\nmixture component has a nondiagonal covariance matrix, which should be accommodated when\ndesigning the approximate posterior q(\u03b6\u03b6\u03b6|xxx).\nThe hyperparameter \u03b2 must be larger than the absolute value of the most negative eigenvalue of WWW to\nensure that WWW + \u03b2III is positive de\ufb01nite. Setting \u03b2 to even larger values has the bene\ufb01t of making\nthe Gaussian mixture components more isotropic, but this comes at the cost of requiring a sharper\napproximate posterior with potentially noisier gradient estimates.\n\n3.3 Generalizing Overlapping Transformations\nThe previous sections developed two r(\u03b6\u03b6\u03b6|zzz) relaxations for Boltzmann priors. Depending on this\nchoice, compatible q(\u03b6\u03b6\u03b6|xxx) parameterizations must be used. For example, if Gaussian smoothing is\nused, then a mixture of Gaussian smoothers should be used in the approximate posterior. Unfor-\ntunately, the overlapping transformations introduced in DVAE++ [20] are limited to mixtures of\nexponential or logistic distributions where the inverse CDF can be computed analytically. Here, we\nprovide a general approach for reparameterizing overlapping transformations that does not require an-\nalytic inverse CDFs. Our approach is a special case of the reparameterization method for multivariate\nmixture distributions proposed in [42].\nAssume q(\u03b6|xxx) = (1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1) is the mixture distribution resulting from an\noverlapping transformation de\ufb01ned for one-dimensional z and \u03b6 where q \u2261 q(z = 1|xxx). Ancestral\nsampling from q(\u03b6|xxx) is accomplished by \ufb01rst sampling from the binary distribution q(z|xxx) and then\nsampling \u03b6 from r(\u03b6|z). This process generates samples but is not differentiable with respect to q.\nTo compute the gradient (with respect to q) of samples from q(\u03b6|xxx), we apply the implicit function\ntheorem. The inverse CDF of q(\u03b6|xxx) at \u03c1 is obtained by solving:\n(8)\nwhere \u03c1 \u2208 [0, 1] and R(\u03b6|z) is the CDF for r(\u03b6|z). Assuming that \u03b6 is a function of q but \u03c1 is not,\nwe take the gradient from both sides of Eq. (8) with respect to q giving\n\nCDF(\u03b6) = (1 \u2212 q)R(\u03b6|z = 0) + qR(\u03b6|z = 1) = \u03c1,\n\n\u2202\u03b6\n\u2202q\n\n=\n\nR(\u03b6|z = 0) \u2212 R(\u03b6|z = 1)\n\n(1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1)\n\n,\n\n(9)\n\nwhich can be easily computed for a sampled \u03b6 if the PDF and CDF of r(\u03b6|z) are known. This\ngeneralization allows us to compute gradients of samples generated from a wide range of overlapping\ntransformations. Further, the gradient of \u03b6 with respect to the parameters of r(\u03b6|z) (e.g. \u03b2) is\ncomputed similarly as\n\n\u2202\u03b6\n\u2202\u03b2\n\n= \u2212\n\n(1 \u2212 q) \u2202\u03b2R(\u03b6|z = 0) + q \u2202\u03b2R(\u03b6|z = 1)\n\n(1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1)\n\n.\n\nWith this method, we can apply overlapping transformations beyond the mixture of exponentials\nconsidered in [20]. The inverse CDF of exponential mixtures is shown in Fig. 1(a) for several \u03b2. As\n\u03b2 increases, the relaxation approaches the original binary variables, but this added \ufb01delity comes at\nthe cost of noisy gradients. Other overlapping transformations offer alternative tradeoffs:\nUniform+Exp Transformation: We ensure that the gradient remains \ufb01nite as \u03b2 \u2192 \u221e by mixing\nthe exponential with a uniform distribution. This is achieved by de\ufb01ning r(cid:48)(\u03b6|z) = (1 \u2212 \u0001)r(\u03b6|z) + \u0001\nwhere r(\u03b6|z) is the exponential smoothing and \u03b6 \u2208 [0, 1]. The inverse CDF resulting from this\nsmoothing is shown in Fig. 1(b).\nPower-Function Transformation: Instead of adding a uniform distribution we substitute the expo-\nnential distribution for one with heavier tails. One choice is the power-function distribution [24]:\n\n(cid:40) 1\n\u03b2 \u22121)\n\u03b2 \u03b6 ( 1\n\u03b2 (1 \u2212 \u03b6)( 1\n\n1\n\nr(\u03b6|z) =\n\n\u03b2 \u22121)\n\nif z = 0\notherwise\n\nfor \u03b6 \u2208 [0, 1] and \u03b2 > 1.\n\n(10)\n\nThe conditionals in Eq. (10) correspond to the Beta distributions B(1/\u03b2, 1) and B(1, 1/\u03b2) respec-\ntively. The inverse CDF resulted from this smoothing is visualized in Fig. 1(c).\n\n5\n\n\f(a) Exponential Transformation\n\nFigure 1: In the \ufb01rst row, we visualize the inverse CDF of the mixture q(\u03b6) =(cid:80)\n\n(b) Uniform+Exp Transformation (c) Power-Function Transformation\nz q(z)r(\u03b6|z) for\nq = q(z = 1) = 0.5 as a function of the random noise \u03c1 \u2208 [0, 1]. In the second row, the gradient of\nthe inverse CDF with respect to q is visualized. Each column corresponds to a different smoothing\ntransformation. As the transition region sharpens with increasing \u03b2, a sampling based estimate of the\ngradient becomes noisier; i.e., the variance of \u2202\u03b6/\u2202q increases. The uniform+exp exponential has a\nvery similar inverse CDF (\ufb01rst row) to the exponential but has potentially lower variance (bottom\nrow). In comparison, the power-function smoothing with \u03b2 = 40 provides a good relaxation of the\ndiscrete variables while its gradient noise is still moderate. See the supplementary material for a\ncomparison of the gradient noise.\n\nGaussian Transformations: The transformations introduced above have support \u03b6 \u2208 [0, 1]. We also\nexplore Gaussian smoothing r(\u03b6|z) = N (\u03b6|z, 1\nNone of these transformations have an analytic inverse CDF for q(\u03b6|xxx) so we use Eq. (9) to calculate\ngradients.\n\n\u03b2 ) with support \u03b6 \u2208 R.\n\n4 Experiments\n\nq(\u03b6\u03b6\u03b6|xxx) =(cid:81)G\n\nIn this section we compare the various relaxations for training DVAEs with Boltzmann priors on\nstatically binarized MNIST [43] and OMNIGLOT [44] datasets. For all experiments we use a\ngenerative model of the form p(xxx, \u03b6\u03b6\u03b6) = p(\u03b6\u03b6\u03b6)p(xxx|\u03b6\u03b6\u03b6) where p(\u03b6\u03b6\u03b6) is a continuous relaxation obtained\nfrom either the overlapping relaxation of Eq. (4) or the Gaussian integral trick of Eq. (7). The\nunderlying Boltzmann distribution is a restricted Boltzmann machine (RBM) with bipartite connec-\ntivity which allows for parallel Gibbs updates. We use a hierarchical autoregressively-structured\ng=1 q(\u03b6\u03b6\u03b6 g|xxx, \u03b6\u03b6\u03b6 <g) to approximate the posterior distribution over \u03b6\u03b6\u03b6. This structure divides\nthe components of \u03b6\u03b6\u03b6 into G equally-sized groups and de\ufb01nes each conditional using a factorial\ndistribution conditioned on xxx and all \u03b6\u03b6\u03b6 from previous groups.\nThe smoothing transformation used in q(\u03b6\u03b6\u03b6|xxx) depends on the type of relaxation used in p(\u03b6\u03b6\u03b6). For\noverlapping relaxations, we compare exponential, uniform+exp, Gaussian, and power-function. With\nthe Gaussian integral trick, we use shifted Gaussian smoothing as described below. The decoder\np(xxx|\u03b6\u03b6\u03b6) and conditionals q(\u03b6\u03b6\u03b6 g|xxx, \u03b6\u03b6\u03b6 <g) are modeled with neural networks. Following [20], we consider\nboth linear (\u2014) and nonlinear (\u223c) versions of these networks. The linear models use a single linear\nlayer to predict the parameters of the distributions p(xxx|\u03b6\u03b6\u03b6) and q(\u03b6\u03b6\u03b6 g|xxx, \u03b6\u03b6\u03b6 <g) given their input. The\nnonlinear models use two deterministic hidden layers with 200 units, tanh activation and batch-\nnormalization. We use the same initialization scheme, batch-size, optimizer, number of training\niterations, schedule of learning rate, weight decay and KL warm-up for training that was used in [20]\n(See Sec. 7.2 in [20]). For the mean-\ufb01eld optimization, we use 5 iterations. To evaluate the trained\nmodels, we estimate the log-likelihood on the discrete graphical model using the importance-weighted\n\n6\n\n01\u03c101\u03b6Bernoulli\u03b2=14\u03b2=12\u03b2=801\u03c101\u03b6Bernoulli\u03b2=14\u03b2=12\u03b2=801\u03c101\u03b6Bernoulli\u03b2=40\u03b2=20\u03b2=1001\u03c1020406080\u2202\u03b6\u2202q\u03b2=14\u03b2=12\u03b2=801\u03c10481216\u2202\u03b6\u2202q\u03b2=14\u03b2=12\u03b2=801\u03c105101520\u2202\u03b6\u2202q\u03b2=40\u03b2=20\u03b2=10\f(cid:0)\u03b6i|\u00b5i + \u2206\u00b5i(\u03b6\u03b6\u03b6 <i), \u03c3i\n\n(cid:1) where \u2206\u00b5i(\u03b6\u03b6\u03b6 <i) is linear in \u03b6\u03b6\u03b6 <i. Motivated by this observation, we\n\nbound with 4000 samples [21]. At evaluation p(\u03b6\u03b6\u03b6) is replaced with the Boltzmann distribution p(zzz),\nand q(\u03b6\u03b6\u03b6|xxx) with q(zzz|xxx) (corresponding to \u03b2 = \u221e).\nFor DVAE, we use the original spike-and-exp smoothing. For DVAE++, in addition to exponential\nsmoothing, we use a mixture of power-functions. The DVAE# models are trained using the IW bound\nin Eq. (3) with K = 1, 5, 25 samples. To fairly compare DVAE# with DVAE and DVAE++ (which\ncan only be trained with the variational bound), we use the same number of samples K \u2265 1 when\nestimating the variational bound during DVAE and DVAE++ training.\nThe smoothing parameter \u03b2 is \ufb01xed throughout training (i.e. \u03b2 is not annealed). However, since \u03b2\nacts differently for each smoothing function r, its value is selected by cross validation per smoothing\nand structure. We select from \u03b2 \u2208 {4, 5, 6, 8} for spike-and-exp, \u03b2 \u2208 {8, 10, 12, 16} for exponential,\n\u03b2 \u2208 {16, 20, 30, 40} with \u0001 = 0.05 for uniform+exp, \u03b2 \u2208 {15, 20, 30, 40} for power-function, and\n\u03b2 \u2208 {20, 25, 30, 40} for Gaussian smoothing. For models other than the Gaussian integral trick, \u03b2 is\nset to the same value in q(\u03b6\u03b6\u03b6|xxx) and p(\u03b6\u03b6\u03b6). For the Gaussian integral case, \u03b2 in the encoder is trained\nas discussed next, but is selected in the prior from \u03b2 \u2208 {20, 25, 30, 40}.\nWith the Gaussian integral trick, each mixture component in the prior contains off-diagonal cor-\n(cid:81)\nrelations and the approximation of the posterior over \u03b6\u03b6\u03b6 should capture this. We recall that a mul-\ntivariate Gaussian N (\u03b6\u03b6\u03b6|\u00b5\u00b5\u00b5, \u03a3\u03a3\u03a3) can always be represented as a product of Gaussian conditionals\ni N\nprovide \ufb02exibility in the approximate posterior q(\u03b6\u03b6\u03b6|xxx) by using shifted Gaussian smoothing where\nr(\u03b6i|zi) = N (\u03b6i|zi + \u2206\u00b5i(\u03b6\u03b6\u03b6 <i), 1/\u03b2i), and \u2206\u00b5i(\u03b6\u03b6\u03b6 <i) is an additional parameter that shifts the\ndistribution. As the approximate posterior in our model is hierarchical, we generate \u2206\u00b5i(\u03b6\u03b6\u03b6 <g) for\nthe ith element in gth group as the output of the same neural network that generates the parameters\nof q(\u03b6\u03b6\u03b6 g|xxx, \u03b6\u03b6\u03b6 <g). The parameter \u03b2i for each component of \u03b6\u03b6\u03b6 g is a trainable parameter shared for all xxx.\nTraining also requires sampling from the discrete RBM to compute the \u03b8\u03b8\u03b8-gradient of log Z\u03b8\u03b8\u03b8. We\nhave used both population annealing [45] with 40 sweeps across variables per parameter update\nand persistent contrastive divergence [46] for sampling. Population annealing usually results in a\nbetter generative model (see the supplementary material for a comparison). We use QuPA4, a GPU\nimplementation of population annealing. To obtain test set log-likelihoods we require log Z\u03b8\u03b8\u03b8, which\nwe estimate with annealed importance sampling [47, 48]. We use 10,000 temperatures and 1,000\nsamples to ensure that the standard deviation of the log Z\u03b8\u03b8\u03b8 estimate is small (\u223c 0.01).\nWe compare the performance of DVAE# against DVAE and DVAE++ in Table 1. We consider four\nneural net structures when examining the various smoothing models. Each structure is denoted\n\u201cG \u2014/\u223c\u201d where G represent the number of groups in the approximate posterior and \u2014/\u223c indicates\nlinear/nonlinear conditionals. The RBM prior for the structures \u201c1 \u2014/\u223c\u201d is 100\u00d7 100 (i.e. D = 200)\nand for structures \u201c2/4 \u223c\u201d the RBM is 200 \u00d7 200 (i.e. D = 400).\nWe make several observations based on Table 1: i) Most baselines improve as K increases. The\nimprovements are generally larger for DVAE# as they optimize the IW bound. ii) Power-function\nsmoothing improves the performance of DVAE++ over the original exponential smoothing.\niii)\nDVAE# and DVAE++ both with power-function smoothing for K = 1 optimizes a similar variational\nbound with same smoothing transformation. The main difference here is that DVAE# uses the\nmarginal p(\u03b6\u03b6\u03b6) in the prior whereas DVAE++ has the joint p(zzz, \u03b6\u03b6\u03b6) = p(zzz)r(zzz|\u03b6\u03b6\u03b6). For this case, it can\nbe seen that DVAE# usually outperforms DVAE++ . iv) Among the DVAE# variants, the Gaussian\nintegral trick and Gaussian overlapping relaxation result in similar performance, and both are usually\ninferior to the other DVAE# relaxations. v) In DVAE#, the uniform+exp smoothing performs better\nthan exponential smoothing alone. vi) DVAE# with the power-function smoothing results in the best\ngenerative models, and in most cases outperforms both DVAE and DVAE++.\nGiven the superior performance of the models obtained using the mean-\ufb01eld approximation of\nSec. 3.1.1 to \u02c6p(\u03b6\u03b6\u03b6), we investigate the accuracy of this approximation. In Fig. 2(a), we show that the\nmean-\ufb01eld model converges quickly by plotting the KL divergence of Eq. (6) with the number of\nmean-\ufb01eld iterations for a single \u03b6\u03b6\u03b6. To assess the quality of the mean-\ufb01eld approximation, in Fig. 2(b)\nwe compute the KL divergence for randomly selected \u03b6\u03b6\u03b6s during training at different iterations for\nexponential and power-function smoothings with different \u03b2s. As it can be seen, throughout the\n\n4This library is publicly available at https://try.quadrant.ai/qupa\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) The KL divergence between the mean-\ufb01eld model and the augmented Boltzmann\nmachine \u02c6p(zzz) as a function of the number of optimization iterations of the mean-\ufb01eld. The mean-\ufb01eld\nmodel converges to KL = 0.007 in three iterations. (b) The KL value is computed for randomly\nselected \u03b6s during training at different iterations for exponential and power-function smoothings with\ndifferent \u03b2. (c) The variance of the gradient of the objective function with respect to the logit of q is\nvisualized for exponential and power-function smoothing transformations. Power-function smoothing\ntends to have lower variance than exponential smoothing. The artifact seen early in training is due to\nthe warm-up of KL. Models in (b) and (c) are trained for 100K iterations with batch size of 1,000.\n\nDVAE\n\nExp\n\nPower\n\n1 \u2014\n\n1 \u223c\n\n2 \u223c\n\n4 \u223c\n\n1 \u2014\n\nStruct.\n\nT\nS\nI\nN\nM\n\nDVAE++\n\nPower\n\n89.12\u00b10.05\n89.09\u00b10.05\n89.04\u00b10.07\n85.05\u00b10.02\n85.29\u00b10.10\n85.59\u00b10.10\n83.62\u00b10.04\n83.57\u00b10.07\n83.58\u00b10.15\n83.44\u00b10.05\n83.17\u00b10.09\n83.20\u00b10.08\n\n90.43\u00b10.06\n90.13\u00b10.03\n89.92\u00b10.07\n85.13\u00b10.06\n85.13\u00b10.09\n86.14\u00b10.18\n84.15\u00b10.07\n84.85\u00b10.13\n85.49\u00b10.12\n84.63\u00b10.11\n85.41\u00b10.09\n85.42\u00b10.07\n\nTable 1: The performance of DVAE# is compared against DVAE and DVAE++ on MNIST and\nOMNIGLOT. Mean\u00b1standard deviation of the negative log-likelihood for \ufb01ve runs are reported.\nGaussian\n89.35\u00b10.06\n91.33\u00b10.13\n88.25\u00b10.03\n90.15\u00b10.04\n87.67\u00b10.07\n89.55\u00b10.10\n84.93\u00b10.02\n86.24\u00b10.05\n84.21\u00b10.02\n84.91\u00b10.07\n83.93\u00b10.06\n84.30\u00b10.04\n83.37\u00b10.02\n84.35\u00b10.04\n82.99\u00b10.04\n83.61\u00b10.04\n82.85\u00b10.03\n83.26\u00b10.04\n83.18\u00b10.05\n84.81\u00b10.19\n82.95\u00b10.07\n84.20\u00b10.15\n83.80\u00b10.04\n82.82\u00b10.02\n106.81\u00b10.07 107.21\u00b10.14 105.89\u00b10.06 105.47\u00b10.09\n106.16\u00b10.11 106.86\u00b10.10 104.94\u00b10.05 104.42\u00b10.09\n105.75\u00b10.10 106.88\u00b10.09 104.49\u00b10.07 103.98\u00b10.05\n102.74\u00b10.08 102.23\u00b10.08 101.86\u00b10.06 101.70\u00b10.01\n102.00\u00b10.09 101.59\u00b10.06 101.22\u00b10.05 101.00\u00b10.02\n101.60\u00b10.09 101.48\u00b10.04 100.93\u00b10.07 100.60\u00b10.05\n99.75\u00b10.05\n102.84\u00b10.23 100.38\u00b10.09 99.84\u00b10.06\n101.43\u00b10.11 99.93\u00b10.07\n99.57\u00b10.06\n99.24\u00b10.05\n98.93\u00b10.05\n100.45\u00b10.08 100.10\u00b10.28 99.59\u00b10.16\n99.65\u00b10.09\n103.43\u00b10.10 100.85\u00b10.12 99.92\u00b10.11\n99.13\u00b10.10\n101.82\u00b10.13 100.32\u00b10.19 99.61\u00b10.07\n100.97\u00b10.21 99.92\u00b10.30\n99.36\u00b10.09\n98.88\u00b10.09\n\nGauss. Int\nSpike-Exp\n92.14\u00b10.12\n89.00\u00b10.09\n91.32\u00b10.09\n89.15\u00b10.12\n91.18\u00b10.21\n89.20\u00b10.13\n86.23\u00b10.05\n85.48\u00b10.06\n84.99\u00b10.03\n85.29\u00b10.03\n84.36\u00b10.04\n85.92\u00b10.10\n84.30\u00b10.05\n83.97\u00b10.04\n83.68\u00b10.02\n83.74\u00b10.03\n83.39\u00b10.04\n84.19\u00b10.21\n84.59\u00b10.06\n84.38\u00b10.03\n83.89\u00b10.09\n83.93\u00b10.07\n84.12\u00b10.07\n83.52\u00b10.06\n105.11\u00b10.11 106.71\u00b10.08 105.45\u00b10.08 110.81\u00b10.32\n104.68\u00b10.21 106.83\u00b10.09 105.34\u00b10.05 112.26\u00b10.70\n104.38\u00b10.15 106.85\u00b10.07 105.38\u00b10.14 111.92\u00b10.30\n102.95\u00b10.07 101.84\u00b10.08 101.88\u00b10.06 103.50\u00b10.06\n102.45\u00b10.08 102.13\u00b10.11 101.67\u00b10.07 102.15\u00b10.04\n102.74\u00b10.05 102.66\u00b10.09 101.80\u00b10.15 101.42\u00b10.04\n103.10\u00b10.31 101.34\u00b10.04 100.42\u00b10.03 102.07\u00b10.16\n100.88\u00b10.13 100.55\u00b10.09 99.51\u00b10.05\n100.85\u00b10.02\n100.55\u00b10.08 100.31\u00b10.15 99.49\u00b10.07\n100.20\u00b10.02\n104.63\u00b10.47 101.58\u00b10.22 100.42\u00b10.08 102.91\u00b10.25\n101.79\u00b10.25\n101.77\u00b10.20 101.01\u00b10.09 99.52\u00b10.09\n100.89\u00b10.13 100.37\u00b10.09 99.43\u00b10.14\n100.73\u00b10.08\n\nUn+Exp\n89.57\u00b10.08\n88.56\u00b10.04\n88.02\u00b10.04\n85.19\u00b10.05\n84.47\u00b10.02\n84.22\u00b10.01\n83.54\u00b10.06\n83.33\u00b10.04\n83.30\u00b10.04\n83.52\u00b10.06\n83.41\u00b10.04\n83.39\u00b10.04\n\nK\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n\nDVAE#\n\nExp\n\n90.55\u00b10.11\n89.62\u00b10.08\n89.27\u00b10.09\n85.37\u00b10.05\n84.83\u00b10.03\n84.69\u00b10.08\n83.96\u00b10.06\n83.70\u00b10.04\n83.76\u00b10.04\n84.06\u00b10.06\n84.15\u00b10.05\n84.22\u00b10.13\n\nT\nO\nL\nG\n\nI\nN\nM\nO\n\n1 \u223c\n\n2 \u223c\n\n4 \u223c\n\ntraining the KL value is typically < 0.2. For larger \u03b2s, the KL value is smaller due to the stronger\nbias that bbb\u03b2(\u03b6\u03b6\u03b6) imposes on zzz.\nLastly, we demonstrate that the lower variance of power-function smoothing may contribute to its\nsuccess. As noted in Fig. 1, power-function smoothing potentially has moderate gradient noise while\nstill providing a good approximation of binary variables at large \u03b2. We validate this hypothesis in\nFig. 2(c) by measuring the variance of the derivative of the variational bound (with K = 1) with\nrespect to the logit of q during training of a 2-layer nonlinear model on MNIST. When comparing\nthe exponential (\u03b2 = 10) to power-function smoothing (\u03b2 = 30) at the \u03b2 that performs best for each\nsmoothing method, we \ufb01nd that power-function smoothing has signi\ufb01cantly lower variance.\n\n8\n\n1510Mean-\ufb01eldIterations10\u2212210\u22121100101KL(m(z)||\u02c6p(z))20406080100TrainingIterations(x1000)10\u2212510\u2212410\u2212310\u2212210\u22121100101102KL(m(z)||\u02c6p(z))power\u03b2=20power\u03b2=40exponential\u03b2=8exponential\u03b2=1220406080100TrainingIterations(x1000)10\u2212710\u2212610\u22125GradientVarianceexponentialDVAE#powerDVAE#exponentialDVAE++powerDVAE++\f5 Conclusions\n\nWe have introduced two approaches for relaxing Boltzmann machines to continuous distributions, and\nshown that the resulting distributions can be trained as priors in DVAEs using an importance-weighted\nbound. We have proposed a generalization of overlapping transformations that removes the need for\ncomputing the inverse CDF analytically. Using this generalization, the mixture of power-function\nsmoothing provides a good approximation of binary variables while the gradient noise remains\nmoderate. In the case of sharp power smoothing, our model outperforms previous discrete VAEs.\n\nReferences\n[1] Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In The International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In International Conference on Machine Learning, 2014.\n\n[3] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Interna-\n\ntional Conference on Machine Learning, 2014.\n\n[4] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive\n\nnetworks. In International Conference on Machine Learning, 2014.\n\n[5] Yuchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE learning via\n\nStein variational gradient descent. In Advances in Neural Information Processing Systems. 2017.\n\n[6] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference:\n\nBridging the gap. In International Conference on Machine Learning, pages 1218\u20131226, 2015.\n\n[7] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Benjam\u00edn\nS\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams,\nand Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of\nmolecules. ACS Central Science, 2016.\n\n[8] Matt J Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational autoencoder. In\n\nInternational Conference on Machine Learning, 2017.\n\n[9] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector\nmodel for learning long-term structure in music. In International Conference on Machine Learning, 2018.\n[10] Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for\n\nconditional program generation. In The International Conference on Learning Representations, 2018.\n\n[11] J\u00f6rg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo Jimenez Rezende. Variational memory addressing\n\nin generative models. In Advances in Neural Information Processing Systems, pages 3923\u20133932, 2017.\n\n[12] Nicolas Le Roux and Yoshua Bengio. Representational power of restricted Boltzmann machines and deep\n\nbelief networks. Neural computation, 2008.\n\n[13] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep Boltzmann machines. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics, 2009.\n\n[14] Hugo Larochelle and Yoshua Bengio. Classi\ufb01cation using discriminative restricted Boltzmann machines.\n\nIn International Conference on Machine Learning (ICML), 2008.\n\n[15] Tu Dinh Nguyen, Dinh Phung, Viet Huynh, and Trung Le. Supervised restricted Boltzmann machines. In\n\nUAI, 2017.\n\n[16] Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. J. Mach.\n\nLearn. Res., 5:1063\u20131088, December 2004.\n\n[17] Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[18] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collab-\norative \ufb01ltering. In Proceedings of the 24th International Conference on Machine Learning, ICML \u201907,\npages 791\u2013798, New York, NY, USA, 2007. ACM.\n\n[19] Jason Tyler Rolfe. Discrete variational autoencoders. In International Conference on Learning Representa-\n\ntions (ICLR), 2017.\n\n[20] Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. DVAE++:\nIn International Conference on\n\nDiscrete variational autoencoders with overlapping transformations.\nMachine Learning (ICML), 2018.\n\n9\n\n\f[21] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.\n\nImportance weighted autoencoders.\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\nIn The\n\n[22] John Hertz, Richard Palmer, and Anders Krogh. Introduction to the theory of neural computation. 1991.\n[23] J Hubbard. Calculation of partition functions. Physical Review Letters, 3(2):77, 1959.\n[24] Zakkula Govindarajulu. Characterization of the exponential and power distributions. Scandinavian\n\nActuarial Journal, 1966(3-4):132\u2013136, 1966.\n\n[25] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In Advances in Neural Information Processing Systems, 2014.\n\n[26] Lars Maal\u00f8e, Marco Fraccaro, and Ole Winther. Semi-supervised generation with cluster-aware generative\n\nmodels. arXiv preprint arXiv:1704.00637, 2017.\n\n[27] Michalis Titsias RC AUEB and Miguel L\u00e1zaro-Gredilla. Local expectation gradients for black box\n\nvariational inference. In Advances in neural information processing systems, pages 2638\u20132646, 2015.\n\n[28] Seiya Tokui and Issei Sato. Evaluating the variance of likelihood-ratio gradient estimators. In International\n\nConference on Machine Learning, pages 3414\u20133423, 2017.\n\n[29] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In Interna-\n\ntional Conference on Learning Representations, 2017.\n\n[30] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[31] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary\n\nstochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.\n\n[32] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In International Conference on Learning Representations (ICLR), 2017.\n\n[33] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[34] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM,\n\n33(10):75\u201384, 1990.\n\n[35] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for\nstochastic neural networks. In The International Conference on Learning Representations (ICLR), 2016.\n[36] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International\n\nConference on Machine Learning, pages 2188\u20132196, 2016.\n\n[37] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:\nLow-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural\nInformation Processing Systems, pages 2624\u20132633, 2017.\n\n[38] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through\nthe void: Optimizing control variates for black-box gradient estimation. In International Conference on\nLearning Representations (ICLR), 2018.\n\n[39] Yingzhen Li and Richard E Turner. R\u00e9nyi divergence variational inference.\n\nInformation Processing Systems, pages 1073\u20131081, 2016.\n\nIn Advances in Neural\n\n[40] Max Welling and Geoffrey E Hinton. A new learning algorithm for mean \ufb01eld Boltzmann machines. In\n\nInternational Conference on Arti\ufb01cial Neural Networks, pages 351\u2013357. Springer, 2002.\n\n[41] Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous relaxations\nfor discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages\n3194\u20133202, 2012.\n\n[42] Alex Graves. Stochastic backpropagation through mixture density distributions.\n\narXiv:1607.05690, 2016.\n\narXiv preprint\n\n[43] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n[44] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[45] K Hukushima and Y Iba. Population annealing and its application to a spin glass. In AIP Conference\n\nProceedings, volume 690, pages 200\u2013206. AIP, 2003.\n\n[46] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient.\nIn Proceedings of the 25th international conference on Machine learning, pages 1064\u20131071. ACM, 2008.\n\n[47] Radford M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139, 2001.\n[48] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n10\n\n\fA Population Annealing vs. Persistence Contrastive Divergence\n\nIn this section, we compare population annealing (PA) to persistence contrastive divergence (PCD) for sampling\nin the negative phase. In Table 2, we train DVAE# with the power-function smoothing on the binarized MNIST\ndataset using PA and PCD. As shown, PA results in a comparable generative model when there is one group of\nlatent variables and better models in other cases.\n\nTable 2: The performance of DVAE# with power-function smoothing for binarized MNIST when\nPCD or PA is used in the negative phase.\n\nStruct.\n\n1 \u2014\n\n1 \u223c\n\n2 \u223c\n\n4 \u223c\n\nK\n1\n5\n25\n1\n5\n25\n1\n5\n25\n1\n5\n25\n\nPCD\n\n89.25\u00b10.04\n88.18\u00b10.08\n87.66\u00b10.09\n84.95\u00b10.05\n84.25\u00b10.04\n83.91\u00b10.05\n83.48\u00b10.04\n83.12\u00b10.04\n83.06\u00b10.03\n83.62\u00b10.06\n83.34\u00b10.06\n83.18\u00b10.05\n\nPA\n\n89.35\u00b10.06\n88.25\u00b10.03\n87.67\u00b10.07\n84.93\u00b10.02\n84.21\u00b10.02\n83.93\u00b10.06\n83.37\u00b10.02\n82.99\u00b10.04\n82.85\u00b10.03\n83.18\u00b10.05\n82.95\u00b10.07\n82.82\u00b10.02\n\nB On the Gradient Variance of the Power-function Smoothing\n\nOur experiments show that power-function smoothing performs best because it provides a better approximation of\nthe binary random variables. We demonstrate this qualitatively in Fig. 1 and quantitatively in Fig. 2(c) of the paper.\nThis is also visualized in Fig. 3. Here, we generate 106 samples from q(\u03b6) = (1 \u2212 q)r(\u03b6|z = 0) + qr(\u03b6|z = 1)\nfor q = 0.5 using both the exponential and power smoothings with different values of \u03b2 (\u03b2 \u2208 {8, 9, 10, . . . , 15}\nfor exponential, and \u03b2 \u2208 {10, 20, 30, . . . , 80} for power smoothing). The value of \u03b2 is increasing from left to\nright on each curve. The mean of |\u03b6i \u2212 zi| (for zi = 1[\u03b6i>0.5]) vs. the variance of \u2202\u03b6i/\u2202q is visualized in this\n\ufb01gure. For a given gradient variance, power function smoothing provides a closer approximation to the binary\nvariables.\n\nFigure 3: Average distance between \u03b6 and its binarized z vs. variance of \u2202\u03b6/\u2202q measured on 106\nsamples from q(\u03b6). For a given gradient variance, power function smoothing provides a closer\napproximation to the binary variables.\n\n11\n\n510152025GradientVariance0.020.040.060.080.100.12L1Distanceexponentialpower\f", "award": [], "sourceid": 931, "authors": [{"given_name": "Arash", "family_name": "Vahdat", "institution": "Quadrant.ai (D-Wave)"}, {"given_name": "Evgeny", "family_name": "Andriyash", "institution": "D-Wave"}, {"given_name": "William", "family_name": "Macready", "institution": "D-Wave"}]}