{"title": "GumBolt: Extending Gumbel trick to Boltzmann priors", "book": "Advances in Neural Information Processing Systems", "page_first": 4061, "page_last": 4070, "abstract": "Boltzmann machines (BMs) are appealing candidates for powerful priors in variational autoencoders (VAEs), as they are capable of capturing nontrivial and multi-modal distributions over discrete variables. However, non-differentiability of the discrete units prohibits using the reparameterization trick, essential for low-noise back propagation. The Gumbel trick resolves this problem in a consistent way by relaxing the variables and distributions, but it is incompatible with BM priors. Here, we propose the GumBolt, a model that extends the Gumbel trick to BM priors in VAEs. GumBolt is significantly simpler than the recently proposed methods with BM prior and outperforms them by a considerable margin. It achieves state-of-the-art performance on permutation invariant MNIST and OMNIGLOT datasets in the scope of models with only discrete latent variables.  Moreover, the performance can be further improved by allowing multi-sampled (importance-weighted) estimation of log-likelihood in training, which was not possible with previous models.", "full_text": "GumBolt: Extending Gumbel trick to Boltzmann\n\npriors\n\nAmir H. Khoshaman\nD-Wave Systems Inc.\u21e4\nkhoshaman@gmail.com\n\nMohammad H. Amin\nD-Wave Systems Inc.\nSimon Fraser University\nmhsamin@dwavesys.com\n\nAbstract\n\nBoltzmann machines (BMs) are appealing candidates for powerful priors in varia-\ntional autoencoders (VAEs), as they are capable of capturing nontrivial and multi-\nmodal distributions over discrete variables. However, non-differentiability of the\ndiscrete units prohibits using the reparameterization trick, essential for low-noise\nback propagation. The Gumbel trick resolves this problem in a consistent way by\nrelaxing the variables and distributions, but it is incompatible with BM priors. Here,\nwe propose the GumBolt, a model that extends the Gumbel trick to BM priors in\nVAEs. GumBolt is signi\ufb01cantly simpler than the recently proposed methods with\nBM prior and outperforms them by a considerable margin. It achieves state-of-the-\nart performance on permutation invariant MNIST and OMNIGLOT datasets in the\nscope of models with only discrete latent variables. Moreover, the performance can\nbe further improved by allowing multi-sampled (importance-weighted) estimation\nof log-likelihood in training, which was not possible with previous models.\n\n1\n\nIntroduction\n\nVariational autoencoders (VAEs) are generative models with the useful feature of learning represen-\ntations of input data in their latent space. A VAE comprises of a prior (the probability distribution\nof the latent space), a decoder and an encoder (also referred to as the approximating posterior or\nthe inference network). There have been efforts devoted to making each of these components more\npowerful. The decoder can be made richer by using autoregressive methods such as pixelCNNs,\npixelRNNs (Oord et al., 2016) and MADEs (Germain et al., 2015). However, VAEs tend to ignore\nthe latent code (in the sense described by Yeung et al. (2017)) in the presence of powerful decoders\n(Chen et al., 2016; Gulrajani et al., 2016; Goyal et al., 2017). There are also a myriad of works\nstrengthening the encoder distribution (Kingma et al., 2016; Rezende and Mohamed, 2015; Salimans\net al., 2015). Improving the priors is manifestly appealing, since it directly translates into a more\npowerful generative model. Moreover, a rich structure in the latent space is one of the main purposes\nof VAEs. Chen et al. (2016) observed that a more powerful autoregressive prior and a simple encoder\nis commensurate with a powerful inverse autoregressive approximating posterior and a simple prior.\nBoltzmann machines (BMs) are known to represent\nintractable and multi-modal distribu-\ntions (Le Roux and Bengio, 2008), ideal for priors in VAEs, since they can lead to a more expressive\ngenerative model. However, BMs contain discrete variables which are incompatible with the repa-\nrameterization trick, required for ef\ufb01cient propagation of gradients through stochastic units. It is\ndesirable to have discrete latent variables in many applications such as semi-supervised learning\n\n\u21e4Currently at Borealis AI\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(Kingma et al., 2014), semi-supervised generation (Maal\u00f8e et al., 2017) and hard attention models\n(Serr\u00e0 et al., 2018; Gregor et al., 2015), to name a few. Many operations, such as choosing between\nmodels or variables are naturally expressed using discrete variables (Yeung et al., 2017).\nRolfe (2016) proposed the \ufb01rst model to use a BM in the prior of a VAE, i.e., a discrete VAE (dVAE).\nThe main idea is to introduce auxiliary continuous variables (Fig. 1(a)) for each discrete variable\nthrough a \u201csmoothing distribution\u201d. The discrete variables are marginalized out in the autoencoding\nterm by imposing certain constraints on the form of the relaxing distribution. However, the discrete\nvariables cannot be marginalized out from the remaining term in the objective (the KL term). Their\nproposed approach relies on properties of the smoothing distribution to evaluate these terms. In\nAppendix B, we show that this approach is equivalent to REINFORCE when dealing with some parts\nof the KL term (i.e., the cross-entropy term). Vahdat et al. (2018) proposed an improved version,\ndVAE++, that uses a modi\ufb01ed distribution for the smoothing variables, but has the same form for\nthe autoencoding part (see Sec. 2.1). The qVAE, (Khoshaman et al., 2018), expanded the dVAE to\noperate with a quantum Boltzmann machine (QBM) prior (Amin et al., 2016). A major shortcoming\nof these methods is that they are unable to have multi-sampled (importance-weighted) estimates of\nthe objective function during training, which can improve the performance.\nTo use the reparameterization trick directly with discrete variables (without marginalization), a\ncontinuous and differentiable proxy is required. The Gumbel (reparameterization) trick, independently\ndeveloped by Jang et al. (2016) and Maddison et al. (2016), achieves this by relaxing discrete\ndistributions. However, BMs and in general discrete random Markov \ufb01elds (MRFs) are incompatible\nwith this method. Relaxation of discrete variables (rather than distributions) for the case of factorial\ncategorical prior (Gumbel-Softmax) was also investigated in both works. It is not obvious whether\nsuch relaxation of discrete variables would work with BM priors.\nThe contributions of this work are as follows: we propose the GumBolt, which extends the Gumbel\ntrick to BM and MRF priors and is signi\ufb01cantly simpler than previous models that marginalize\ndiscrete variables. We show that BMs are compatible with relaxation of discrete variables (rather than\ndistributions) in Gumbel trick. We propose an objective using such relaxation and show that the main\nlimitations of previous models with BM priors can be circumvented; we do not need marginalization\nof the discrete variables, and can have an importance-weighted objective. GumBolt considerably\noutperforms the previous works in a wide series of experiments on permutation invariant MNIST\nand OMNIGLOT datasets, even without the importance-weighted objective (Sec. 5). Increasing the\nnumber of importance weights can further improve the performance. We obtain the state-of-the-art\nresults on these datasets among models with only discrete latent variables.\n\n2 Background\n\n2.1 Variational autoencoders\nConsider a generative model involving observable variables x and latent variables z. The joint\nprobability distribution can be decomposed as p\u2713(x, z) = p\u2713(z)p\u2713(x|z). The \ufb01rst and second terms\non the right hand side are the prior and decoder distributions, respectively, which are parametrized by\n\u2713. Calculating the marginal p\u2713(x) involves performing intractable, high dimensional sums or integrals.\nAssume an element x of the dataset X comprising of N independent samples from an unknown\nunderlying distribution is given. VAEs operate by introducing a family of approximating posteriors\nq(z|x) and maximize a lower bound (also known as the ELBO), L(x; \u2713, ) , on the log-likelihood\nlog p\u2713(x) (Kingma and Welling, 2013):\n\nlog p\u2713(x) L (x; \u2713, ) = E\n= E\n\np\u2713(x, z)\n\nq(z|x)\n\nq(z|x)\uf8fflog\n\nq(z|x)\n\n[log p\u2713(x|z)]  DKLq(z|x)k p\u2713(z),\n\n(1)\n\nwhere the \ufb01rst term on the right-hand side is the autoencoding term and DKL is the Kullback-Leibler\ndivergence (Bishop, 2011). In VAEs, the parameters of the distributions (such as the means in the case\nof Bernoulli variables) are calculated using neural nets. To backpropagate through latent variables z,\nthe reparameterization trick is used; z is reparametrized as a deterministic function f (, x, \u21e2), where\nthe stochasticity of z is relegated to another random variable, \u21e2, from a distribution that does not\ndepend on . Note that it is impossible to backpropagate if z is discrete, since f is not differentiable.\n\n2\n\n\f2.2 Gumbel trick\n\nThe non-differentiability of f can be resolved by \ufb01nding a relaxed proxy for the discrete variables.\nAssume a binary unit, z, with mean \u00afq and logit l; i.e., p(z = 1) = \u00afq = (l), where (l) \u2318\n1+exp(l)\nis the sigmoid function. Since (l) is a monotonic function, we can reparametrize z as z = H(\u21e2 \n(1  \u00afq)) = Hl + 1(\u21e2) (Maddison et al., 2016), where H is the Heaviside function, \u21e2 \u21e0U with\nU being a uniform distribution in the range [0, 1], and 1(\u21e2) = log(\u21e2)  log(1  \u21e2) is the inverse\nsigmoid or logit function. This transformation results in a non-differentiable reparameterization, but\ncan be smoothed when the Heaviside function is replaced by a sigmoid function with a temperature\n\u2327, i.e., H(. . . ) ! ( ...\n\n\u2327 ). Thus, we introduce the continuous proxy (Maddison et al., 2016):\n\n1\n\nf (, x, \u21e2) = \u21e3 = \u2713 l(, x) + 1(\u21e2)\n\n\u2327\n\n\u25c6 .\n\n(2)\n\n(3)\n\nThe continuous \u21e3 is differentiable and is equal to the discrete z in the limit \u2327 ! 0.\n2.3 Boltzmann machine\n\nOur goal is to use a BM as prior. A BM is a probabilistic energy model described by\n\np\u2713(z) =\n\neE\u2713(z)\n\nZ\u2713\n\n,\n\nwhere E\u2713(z) is the energy function, and Z\u2713 =P{z} eE\u2713(z) is the partition function; z is a vector of\n\nbinary variables. Since \ufb01nding p\u2713(z) is typically intractable, it is common to use sampling techniques\nto estimate the gradients. To facilitate MCMC sampling using Gibbs-block technique, the connectivity\nof latent variables is assumed to be bipartite; i.e., z is decomposed as [z1, z2] giving\n\nE\u2713(z) = aT z1 + bT z2 + zT\n\n2 W z1,\n\n(4)\n\nwhere a, b and W are the biases (on z1 and z2, respectively) and weights. This bipartite structure is\nknown as the restricted Boltzmann Machine.\n\n3 Proposed approach\n\n(a) d(q)VAE(++)\n\n(b) Concrete, Gumbel-\nSoftmax\n\n(c) GumBolt\n\nFigure 1: Schematic of the discussed models with discrete variables in their latent space. The dashed\nred and solid blue arrows represent the inference network, and the generative model, respectively. (a)\ndVAE, qVAE (Khoshaman et al., 2018) and dVAE++ have the same structure. They involve auxiliary\ncontinuous variables, \u21e3, for each discrete variable, z, provided by the same conditional probability\ndistribution, r(\u21e3|z), in both the generative and approximating posterior networks. (b) Concrete and\nGumbel-Softmax apply the Gumbel trick to the discrete variables to obtain the \u21e3s that appear in both\nthe inference and generative models. (c) GumBolt only involves discrete variables in the generative\nmodel, and the relaxed \u21e3s are used in the inference model during training. Note that during evaluation,\nthe temperature is set to zero, leading to \u21e3 = z.\n\n3\n\n\fThe importance-weighted or multi-sampled objective of a VAE with BM prior can be written as:\n\nlog p\u2713(x) L k(x; \u2713, ) =\n\n=\n\nEQi q(zi|x)\"log\nEQi q(zi|x)\"log\n\n1\nk\n\n1\nk\n\nkXi=1\nkXi=1\n\np\u2713(zi)p\u2713(x|zi)\n\n#\neE\u2713(zi)p\u2713(x|zi)\n\nq(zi|x)\n\nq(zi|x)\n\n(5)\n\n#  log Z\u2713,\n\nwhere k is the number of samples or importance-weights over which the Monte Carlo objective is\ncalculated, (Mnih and Rezende, 2016). zi are independent vectors sampled from q(zi|x). Note that\nwe have taken out Z\u2713 from the argument of the expectation value, since it is independent of z. The\npartition function is intractable but its derivative can be estimated using sampling:\n\nr\u2713 log Z\u2713 = r\u2713 logX{z}\n\neE\u2713(z) = P{z} r\u2713E\u2713(z)eE\u2713(z)\nP{z} eE\u2713(z)\n\nHere,P{z}\ninvolves summing over all possible con\ufb01gurations of the binary vector z. The objective\nLk(x; \u2713, ) cannot be used for training, since it involves non-differentiable discrete variables. This\ncan be resolved by relaxing the distributions:\n\n=  E\n\np\u2713(z)\n\n[r\u2713E\u2713(z)] .\n\n(6)\n\nlog p\u2713(x)  \u02dcLk(x; \u2713, ) =\n\n=\n\nEQi q(\u21e3i|x)\"log\nEQi q(\u21e3i|x)\"log\n\n1\nk\n\n1\nk\n\nkXi=1\nkXi=1\n\np\u2713(\u21e3i)p\u2713(x|\u21e3i)\n\n#\neE\u2713(\u21e3i)p\u2713(x|\u21e3i)\n\nq(\u21e3i|x)\n\nq(\u21e3i|x)\n\n#  log \u02dcZ\u2713.\n\nHere, \u21e3i is a continuous variable sampled from Eq. 2, which is consistent with the Gumbel probability\n\nq(\u21e3i|x) de\ufb01ned in (Maddison et al., 2016), and p\u2713(\u21e3) \u2318 eE\u2713(\u21e3)/ \u02dcZ\u2713, where \u02dcZ\u2713 \u2318R d\u21e3eE\u2713(\u21e3).\n\nThe expectation distribution is the joint distribution over independent zi samples. Notice that log \u02dcZ\u2713\nis different from log Z\u2713, therefore its derivatives cannot be estimated using discrete samples from\na BM, making this method inapplicable for BM priors. The derivatives could be estimated using\nsamples from a continuous distribution, which is very different from the BM distribution. Analytical\ncalculation of the expectations, suggested for Bernoulli prior by Maddison et al. (2016) is also\ninfeasible for BMs, since it requires exhaustively summing over all possible con\ufb01gurations of the\nbinary units.\n\n3.1 GumBolt probability proxy\n\nTo replace log \u02dcZ\u2713 with log Z\u2713, we introduce a proxy probability distribution:\n\n\u02d8p\u2713(\u21e3) \u2318\n\neE\u2713(\u21e3)\n\nZ\u2713\n\n.\n\n(7)\n\nNote that \u02d8p\u2713(\u21e3) is not a true (normalized) probability density function, but \u02d8p\u2713(\u21e3) ! p\u2713(z) as \u2327 ! 0.\nNow consider the following theorems (see Appendix A for proof):\nTheorem 1. For any polynomial function E\u2713(z) of nz binary variables z 2{ 0, 1}nz, the extrema\nof the relaxed function E\u2713(\u21e3) with \u21e3 2 [0, 1]nz reside on the vertices of the hypercube, i.e., \u21e3extr 2\n{0, 1}nz.\nTheorem 2. For any polynomial function E\u2713(z) of nz binary variables z 2{ 0, 1}nz, the proxy\nprobability \u02d8p\u2713(\u21e3) \u2318 eE\u2713(\u21e3)/Z\u2713, with \u21e3 2 [0, 1]nz, is a lower bound to the true probability p\u2713(\u21e3) \u2318\neE\u2713(\u21e3)/ \u02dcZ\u2713, i.e., \u02d8p\u2713(\u21e3) \uf8ff p\u2713(\u21e3), where Z\u2713 \u2318P{z} eE\u2713(z) and \u02dcZ\u2713 \u2318 R{\u21e3}\nTherefore, according to theorem (2), replacing p\u2713(\u21e3) with \u02d8p\u2713(\u21e3), we obtain a lower bound on\n\u02dcLk(x; \u2713, ):\n\nd\u21e3eE\u2713(\u21e3) .\n\n\u02d8Lk(x; \u2713, ) =\n\nEQi q(\u21e3i|x)\"log\n\n1\nk\n\nkXi=1\n\neE\u2713(\u21e3i)p\u2713(x|\u21e3i)\n\nq(\u21e3i|x)\n\n#  log Z\u2713 \uf8ff \u02dcLk(x; \u2713, ).\n\n(8)\n\n4\n\n\fFk(x; \u2713, ) =\n\nEQi q(\u21e3i|x)\"log\n\n1\nk\n\nkXi=1\n\neE\u2713(\u21e3i)p\u2713(x|\u21e3i)\n\n\u02d8q(\u21e3i|x)\n\n#  log Z\u2713\n\n(9)\n\nThis allows reparameterization trick, while making it possible to use sampling to estimate the\ngradients. The structure of our model with a BM prior is portrayed in Figure 1(c), where both\ncontinuous and discrete variables are used. Notice that in the limit \u2327 ! 0, \u02d8p\u2713(\u21e3i) becomes a\nprobability mass function (pmf) p\u2713(zi), while q(\u21e3i|x) remains as a probability density function\n(pdf). To resolve this inconsistency, we replace q(\u21e3i|x) with the Bernoulli pmf: \u02d8q(\u21e3i|x) =\n\u21e3i log \u00afqi + (1  \u21e3i) log1  \u00afqi, which approaches q(zi|x) when \u2327 ! 0. The training objective\n\nbecomes Lk(x; \u2713, ) at \u2327 = 0 as desired (see Fig. 2(a) for the relationship among different objectives).\nThis is the analog of the Gumbel-Softmax trick (Jang et al., 2016) when applied to BMs. During\ntraining, \u2327 should be kept small for the continuous variables to stay close to the discrete variables,\na common practice with Gumbel relaxation (Tucker et al., 2017). For evaluation, \u2327 is set to zero,\nleading to an unbiased evaluation of the objective function. For generation, discrete samples from\nBM are directly fed into the decoder to obtain the probabilities of each input feature.\n\n3.2 Compatibility of BM with discrete variable relaxation\nThe term in the discrete objective that involves the prior distribution is Eq(z|x) [log p\u2713(z)]. When z is\nreplaced with \u21e3, there is no guarantee that the parameters that optimize Eq(\u21e3|x) [log p\u2713(\u21e3)] would also\noptimize the discrete version. This happens naturally for Bernoulli distribution used in (Jang et al.,\n2016), since the extrema of the prior term in the objective log(p(\u21e3)) = \u21e3 log(\u00afp) + (1  \u21e3) log(1  \u00afp)\noccur at the boundaries (i.e., when \u21e3 = 1 or \u21e3 = 0). This means that throughout the training, the\nvalues of \u21e3 are pushed towards the boundary points, consistent with the discrete objective.\nIn the case of a BM prior, according to theorem (1) (proved in Appendix A), the extrema of\nlog p\u2713(\u21e3) / E\u2713(z) also occur on the boundaries; this shows that having a BM rather than a factorial\nBernoulli distribution does not exacerbate the training of GumBolt.\n\n4 Related works\n\nSeveral approaches have been devised to calculate the derivative of the expectation of a function with\nrespect to the parameters of a Bernoulli distribution, I\u2318 r Eq(z) [f (z)]:\n\n1. Analytical method: for simple functions, e.g., f (z) = z, one can analytically calculate the\nexpectation and obtain I = rEq(z) [z] = r \u00afq, where \u00afq is the mean of the Bernoulli\ndistribution. This is a non-biased estimator with zero variance, but can only be applied\nto very simple functions. This approach is frequently used in semi-supervised learning\n(Kingma et al., 2014) by summing over different categories.\n\n2. Straight-through method: continuous proxies are used in backpropagation to evaluate\nderivatives, but discrete units are used in forward propagation (Bengio et al., 2013; Raiko\net al., 2014).\n\ncan be reduced by variance reduction techniques (Williams, 1992).\n\n3. REINFORCE trick: I = Eq(z) [f (z)r log q(z)], although it has high variance, which\n4. Reparameterization trick: this method, as delineated in Sec. 2.1-2.2, is biased except in the\n\nlimit where the proxies approach the discrete variables.\n\n5. Marginalization: if possible, one can marginalize the discrete variables out from some parts\n\nof the loss function (Rolfe, 2016).\n\nNVIL (Mnih and Gregor, 2014) and its importance-weighted successor VIMCO (Mnih and Rezende,\n2016) use (3) with input-dependent signals obtained from neural networks to subtract from a baseline\nin order to reduce the variance of the estimator. REBAR (Tucker et al., 2017) and its generalization,\nRELAX (Grathwohl et al., 2017) use (3) and employ (4) in their control variates obtained using the\nGumbel trick. DARN (Gregor et al., 2013) and MuProp (Gu et al., 2015) apply the Taylor expansion\nof the function f (z) to synthesize baselines. dVAE and dVAE++ (Fig. 1(a)), which are the only works\nwith BM priors, operate primarily based on (5) in their autoencoding term and use a combination\n\n5\n\n\fTable 1: Test-set log-likelihood of the GumBolt compared against dVAE and dVAE++. k represents\nthe number of samples used to calculate the objective during training. Note that dVAE and dVAE++\nare only consistent with k = 1. See the main text for more details.\n\nMNIST\n\nOMNIGLOT\n\ndVAE dVAE++\nk = 1\nk = 1\n90.40\n90.11\n\n85.41\n85.72\n\u21e0\n87.35\n85.71\n\n84.33\n84.75\n\u21e0\u21e0\n106.83\n106.01\n\n101.97\n102.85\n\u21e0\n102.62\n 101.98\n\u21e0\u21e0 101.75\n100.70\n\nGumBolt\nk = 1\n88.88\n84.86\n85.42\n83.28\n105.00\n101.61\n100.62\n99.82\n\nk = 5\n88.18\n84.31\n84.65\n83.01\n103.99\n101.02\n99.38\n99.32\n\nk = 20\n87.45\n83.87\n84.46\n82.75\n103.69\n100.68\n99.36\n98.81\n\nof (1-4) for their KL term. In Appendix B, we show that dVAE has elements of REINFORCE in\ncalculating the derivative of the KL term. Our approach, GumBolt, exploits (4), and does not require\nmarginalizing out the discrete units.\n\n5 Experiments\n\nIn order to explore the effectiveness of the GumBolt, we present the results of a wide set of experiments\nconducted on standard feed-forward structures that have been used to study models with discrete\nlatent variables (Maddison et al., 2016; Tucker et al., 2017; Vahdat et al., 2018). At \ufb01rst, we evaluate\nGumBolt against dVAE and dVAE++ baselines, all in the same framework and structure. We also\ndemonstrate empirically that the GumBolt objective, Eq. 9, faithfully follows the non-differentiable\ndiscrete objective throughout the training. We then note on the relation between our model and other\nmodels that involve discrete variables. We also gauge the performance advantage GumBolt obtains\nfrom the BM by removing the couplings of the BM and re-evaluating the model.\n\n5.1 Comparison against dVAE and dVAE++\n\nWe compare the models on statically binarized MNIST (Salakhutdinov and Murray, 2008) and\nOMNIGLOT datasets (Lake et al., 2015) with the usual compartmentalization into the training,\nvalidation, and test-sets. The 4000-sample estimation of log-likelihood (Burda et al., 2015) of the\nmodels are reported in Table 1. The structures used are the same as those of (Vahdat et al., 2018),\nwhich were in turn adopted from (Tucker et al., 2017) and (Maddison et al., 2016). We performed\nexperiments with dVAE, dVAE++ and GumBolt on the same structure, and set the temperature\nto zero during evaluation (the results reported in (Vahdat et al., 2018) are calculated using non-\nzero temperatures). The inference network is chosen to be either factorial or have two hierarchies\n(Fig. 1(c)). In the case of two hierarchies, we have:\n\nq(z|x) = q(z1, z2|x) = q(z1|x)q(z2|z1, x)\n\n, where z = [z1, z2].\nThe meaning of the symbols in Table 1 are as follows: , and \u21e0 represent linear and nonlinear layers\nin the encoder and decoder neural networks. The number of stochastic layers (hierarchies) in the\nencoder is equal to the number of symbols. The dimensionality of the latent space is 200 times the\nnumber of symbols; e.g., \u21e0\u21e0 means two stochastic layers (just as in Fig. 1(c)), with 2 hidden layers\n(each one containing 200 deterministic units) in the encoder. The dimensionality of each stochastic\nlayer is equal to 200 in the encoder network; the generative network is a 200 \u21e5 200 RBM (a total of\n400 stochastic units), for \u21e0\u21e0 and , whereas, for  and \u21e0, it is a 100 \u21e5 100 RBM. Note that in\nthe case of , only one layer of 200 deterministic units is used in each one of the two hierarchies.\nThe decoder network receives the samples from the RBM and probabilistically maps them into the\ninput space using one or two layers of deterministic units. Since the RBM has bipartite structure,\nour model has two stochastic layers in the generative model. The chosen hyper-parameters are as\nfollows: 1M iterations of parameter updates using the ADAM algorithm (Kingma and Ba, 2014),\n\n6\n\n\f10 and 1\n\n7 for all the experiments involving GumBolt, 1\n\nwith the default settings and batch size of 100 were carried out. The initial learning rate is 3 \u21e5 103\nand is subsequently reduced by 0.3 at 60%, 75%, and 95% of the total iterations. KL annealing\n(S\u00f8nderby et al., 2016) was used via a linear schedule during 30% of the total iterations. The value\nof temperature, \u2327 was set to 1\n5 for experiments with\ndVAE, and 1\n8 for dVAE++ on the MNIST and OMNIGLOT datasets, respectively; these\nvalues were cross-validated from { 1\n5}. The GumBolt shows the same average performance\n10 , 1\n9 . . . , 1\nfor temperatures in the range { 1\n7}. The reported results are the averages from performing the\n8 , 1\n9 , 1\nexperiments 5 times. The standard deviations in all cases are less than 0.15; we avoid presenting\nthem individually to keep the table less cluttered. We used the batch-normalization algorithm (Ioffe\nand Szegedy, 2015) along with tanh nonlinearities. Sampling the RBM was done by performing\n200 steps of Gibbs updates for every mini-batch, in accordance with our baselines,using persistent\ncontrastive divergence (PCD) (Tieleman, 2008). We have observed that by having 2 and 20 PCD\nsteps, the performance of our best model on MNIST dataset is deteriorated by 0.35 and 0.12 nats on\naverage, respectively. In order to estimate the log-partition function, log Z\u2713, a GPU implementation\nof parallel tempering algorithm with bridge sampling was used (Desjardins et al., 2010; Bennett,\n1976; Shirts and Chodera, 2008), with a set of parameters to ensure the variance in log Z\u2713 is less\nthan 0.01: 20K burn-in steps were followed by 100K sweeps, 20 times (runs), with a pilot run to\ndetermine the inverse temperatures (such that the replica exchange rates are approximately 0.5).\nWe underscore several important points regarding Table 1. First, when one sample is used in the\ntraining objective (k = 1), GumBolt outperforms dVAE and dVAE++ in all cases. This can be due to\nthe ef\ufb01cient use of reparameterization trick and the absence of REINFORCE elements in the structure\nof GumBolt as opposed to dVAE (Appendix B). Second, the previous models do not apply when\nk > 1. GumBolt allows importance weighted objectives according to Eq. 9. We see that in all cases,\nby adding more samples to the training objective, the performance of the model is enhanced.\nFig. 2(b) depicts the k = 20 estimation of the GumBolt and discrete objectives on the training and\nvalidation sets during training. It can be seen that over-\ufb01tting does not happen since all the objectives\nare improving throughout the training. Also, note that the differentiable GumBolt proxy closely\nfollows the non-differentiable discrete objective. Note that the kinks in the learning curves are caused\nby our stepwise change of the learning rate and is not an artifact of the model.\n\n(a)\n\n(b)\n\nFigure 2:\n(a) Relationship between the different objectives. Note that functional dependence on\n\u2713 and  have been suppressed for brevity. (b) The values of the discrete (L) and GumBolt (F)\nobjectives (with k = 20 for all objectives) throughout the training on the training and validation sets\n(of MNIST dataset) for a GumBolt on \u21e0\u21e0 structure during 106 iterations of training. The subscripts\n\u201cval\u201d and \u201ctr\u201d correspond to the validation and training sets, respectively. The abrupt changes are\ncaused by stepwise annealing of the learning rate. This \ufb01gure signi\ufb01es that the differentiable GumBolt\nobjective faithfully follows the non-differentiable discrete objective, leading to no over\ufb01tting caused\nby following a wrong objective. We did not use early stopping in our experiments.\n\n7\n\n02505007501000\u21e5103iterations8090100110120LvalLtrFvalFtr\f5.2 Comparison with other discrete models and the importance of powerful priors\nIf the BM prior is replaced with a factorial Bernoulli distribution, GumBolt transforms into CON-\nCRETE (Maddison et al., 2016) (when continuous variables are used inside discrete pdfs) and\nGumbel-Softmax (Jang et al., 2016). This can be achieved by setting the couplings (W ) of the BM\nto zero, and keeping the biases. Since the performance of CONCRETE and Gumbel-Softmax has\nbeen extensively compared against other models (Maddison et al., 2016; Jang et al., 2016; Tucker\net al., 2017; Grathwohl et al., 2017), we do not repeat these experiments here; we note however that\nCONCRETE performs favorably to other discrete latent variable models in most cases.\n\nTable 2: Performance of GumBolt (test-set log-likelihood) in the presence and absence of coupling\nweights. -nW in the second column signi\ufb01es that the elements of the coupling matrix W are set to\nzero, throughout the training rather than just during evaluation. Removing the weights signi\ufb01cantly\ndegrades the performance of GumBolt.\n\nMNIST\n\nOMNIGLOT\n\nGumBolt\nk = 20\n87.45\n83.87\n82.75\n103.69\n100.68\n98.81\n\nGumBolt-nW\nk = 20\n99.94\n93.50\n88.01\n112.21\n107.221\n105.01\n\n\n\u21e0\n\u21e0\u21e0\n\n\u21e0\n\u21e0\u21e0\n\nAn interesting question is studying how much of the performance advantage of GumBolt is caused by\npowerful BM priors. We have studied this in Table 2 by setting the couplings of the BM to 0 throughout\nthe training (denoted by GumBolt-nW). The GumBolt with couplings signi\ufb01cantly outperforms the\nGumBolt-nW. It was shown in (Vahdat et al., 2018) that dVAE and dVAE++ outperform other models\nwith discrete latent variables (REBAR, RELAX, VIMCO, CONCRETE and Gumbel-Softmax) on\nthe same structure. By outperforming the previuos models with BM priors, our model achieves\nstate-of-the-art performance in the scope of models with discrete latent variables.\nAnother important question is if some of the improved performance in the presence of BMs can be\nsalvaged in the GumBolt-nW by having more powerful neural nets in the decoder. We observed that\nby making the decoder\u2019s neural nets wider and deeper, the performance of the GumBolt-nW does not\nimprove. This predictably suggests that the increased probabilistic capability of the prior cannot be\nobtained by simply having a more deterministically powerful decoder.\n6 Conclusion\nIn this work, we have proposed the GumBolt that extends the Gumbel trick to Markov random \ufb01elds\nand BMs. We have shown that this approach is effective and on the entirety of a wide host of structures\noutperforms the other models that use BMs in their priors. GumBolt is much simpler than previous\nmodels that require marginalization of the discrete variables and achieves state-of-the-art performance\non MNIST and OMNIGLOT datasets in the context of models with only discrete variables.\nReferences\nAmin, M. H., Andriyash, E., Rolfe, J., Kulchytskyy, B., and Melko, R. (2016). Quantum boltzmann\n\nmachine.\n\nBengio, Y., L\u00e9onard, N., and Courville, A. (2013). Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.\n\nBennett, C. H. (1976). Ef\ufb01cient estimation of free energy differences from monte carlo data. Journal\n\nof Computational Physics, 22(2):245\u2013268.\n\nBishop, C. M. (2011). Pattern Recognition and Machine Learning. Springer, New York, 1st ed. 2006.\n\ncorr. 2nd printing 2011 edition edition.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519.\n\n8\n\n\fChen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel,\n\nP. (2016). Variational lossy autoencoder. arXiv preprint arXiv:1611.02731.\n\nDesjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered markov\nchain monte carlo for training of restricted boltzmann machines. In Proceedings of the thirteenth\ninternational conference on arti\ufb01cial intelligence and statistics, pages 145\u2013152.\n\nGermain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: masked autoencoder for\ndistribution estimation. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML-15), pages 881\u2013889.\n\nGoyal, A. G. A. P., Sordoni, A., C\u00f4t\u00e9, M.-A., Ke, N., and Bengio, Y. (2017). Z-forcing: Training\nstochastic recurrent networks. In Advances in Neural Information Processing Systems, pages\n6716\u20136726.\n\nGrathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. (2017). Backpropagation\nthrough the void: Optimizing control variates for black-box gradient estimation. arXiv preprint\narXiv:1711.00123.\n\nGregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. (2015). Draw: A recurrent\n\nneural network for image generation. arXiv preprint arXiv:1502.04623.\n\nGregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2013). Deep autoregressive\n\nnetworks. arXiv preprint arXiv:1310.8499.\n\nGu, S., Levine, S., Sutskever, I., and Mnih, A. (2015). Muprop: Unbiased backpropagation for\n\nstochastic neural networks. arXiv preprint arXiv:1511.05176.\n\nGulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2016).\n\nPixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013.\n\nIoffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by\nreducing internal covariate shift. In International Conference on Machine Learning, pages 448\u2013\n456.\n\nJang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144.\n\nKhoshaman, A., Vinci, W., Denis, B., Andriyash, E., and Amin, M. H. (2018). Quantum variational\n\nautoencoder. Quantum Science and Technology, 4(1):014001.\n\nKingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980.\n\nKingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning with\ndeep generative models. In Advances in Neural Information Processing Systems, pages 3581\u20133589.\nKingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751.\n\nKingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.\n\narXiv:1312.6114.\n\narXiv preprint\n\nLake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338.\n\nLe Roux, N. and Bengio, Y. (2008). Representational power of restricted boltzmann machines and\n\ndeep belief networks. Neural computation, 20(6):1631\u20131649.\n\nMaal\u00f8e, L., Fraccaro, M., and Winther, O. (2017). Semi-supervised generation with cluster-aware\n\ngenerative models. arXiv preprint arXiv:1704.00637.\n\nMaddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation\n\nof discrete random variables. arXiv preprint arXiv:1611.00712.\n\n9\n\n\fMnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. arXiv\n\npreprint arXiv:1402.0030.\n\nMnih, A. and Rezende, D. (2016). Variational inference for monte carlo objectives. In International\n\nConference on Machine Learning, pages 2188\u20132196.\n\nOord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759.\n\nRaiko, T., Berglund, M., Alain, G., and Dinh, L. (2014). Techniques for learning binary stochastic\n\nfeedforward neural networks. arXiv preprint arXiv:1406.2989.\n\nRezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770.\n\nRolfe, J. T. (2016). Discrete variational autoencoders. arXiv preprint arXiv:1609.02200.\nRoss, S. M. (2013). Applied probability models with optimization applications. Courier Corporation.\nSalakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In\n\nProceedings of the 25th international conference on Machine learning, pages 872\u2013879. ACM.\n\nSalimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variational\ninference: Bridging the gap. In International Conference on Machine Learning, pages 1218\u20131226.\nSerr\u00e0, J., Sur\u00eds, D., Miron, M., and Karatzoglou, A. (2018). Overcoming catastrophic forgetting with\n\nhard attention to the task. arXiv preprint arXiv:1801.01423.\n\nShirts, M. R. and Chodera, J. D. (2008). Statistically optimal analysis of samples from multiple\n\nequilibrium states. The Journal of chemical physics, 129(12):124105.\n\nS\u00f8nderby, C. K., Raiko, T., Maal\u00f8e, L., S\u00f8nderby, S. K., and Winther, O. (2016). Ladder variational\n\nautoencoders. In Advances in neural information processing systems, pages 3738\u20133746.\n\nTieleman, T. (2008). Training restricted boltzmann machines using approximations to the likelihood\ngradient. In Proceedings of the 25th international conference on Machine learning, pages 1064\u2013\n1071. ACM.\n\nTucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl-Dickstein, J. (2017). Rebar: Low-\nvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural\nInformation Processing Systems, pages 2624\u20132633.\n\nVahdat, A., Macready, W. G., Bian, Z., and Khoshaman, A. (2018). Dvae++: Discrete variational\n\nautoencoders with overlapping transformations. arXiv preprint arXiv:1802.04920.\n\nWilliams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning, pages 5\u201332. Springer.\n\nYeung, S., Kannan, A., Dauphin, Y., and Fei-Fei, L. (2017). Tackling over-pruning in variational\n\nautoencoders. arXiv preprint arXiv:1706.03643.\n\n10\n\n\f", "award": [], "sourceid": 2007, "authors": [{"given_name": "Amir", "family_name": "Khoshaman", "institution": "D-Wave Systems Inc"}, {"given_name": "Mohammad", "family_name": "Amin", "institution": "D-Wave Systems Inc"}]}