{"title": "The Thermodynamic Variational Objective", "book": "Advances in Neural Information Processing Systems", "page_first": 11525, "page_last": 11534, "abstract": "We introduce the thermodynamic variational objective (TVO) for learning in both continuous and discrete deep generative models. The TVO arises from a key connection between variational inference and thermodynamic integration that results in a tighter lower bound to the log marginal likelihood than the standard variational evidence lower bound (ELBO) while remaining as broadly applicable. We provide a computationally efficient gradient estimator for the TVO that applies to continuous, discrete, and non-reparameterizable distributions and show that the objective functions used in variational inference, variational autoencoders, wake sleep, and inference compilation are all special cases of the TVO. We use the TVO to learn both discrete and continuous deep generative models and empirically demonstrate state of the art model and inference network learning.", "full_text": "The Thermodynamic Variational Objective\n\nVaden Masrani1, Tuan Anh Le2, Frank Wood1\n\n1Department of Computer Science, University of British Columbia\n\n2Department of Brain and Cognitive Sciences, MIT\n\nAbstract\n\nWe introduce the thermodynamic variational objective (TVO) for learning in both\ncontinuous and discrete deep generative models. The TVO arises from a key\nconnection between variational inference and thermodynamic integration that\nresults in a tighter lower bound to the log marginal likelihood than the standard\nvariational evidence lower bound (ELBO) while remaining as broadly applicable.\nWe provide a computationally ef\ufb01cient gradient estimator for the TVO that applies\nto continuous, discrete, and non-reparameterizable distributions and show that the\nobjective functions used in variational inference, variational autoencoders, wake\nsleep, and inference compilation are all special cases of the TVO. We use the\nTVO to learn both discrete and continuous deep generative models and empirically\ndemonstrate state of the art model and inference network learning.\n\n1\n\nIntroduction\n\nUnsupervised learning in richly structured deep latent variable models [1, 2] remains challenging.\nFundamental research directions include low-variance gradient estimation for discrete and continuous\nlatent variable models [3\u20137], tightening variational bounds in order to obtain better model learning [8\u2013\n11], and alleviation of the associated detrimental effects on learning of inference networks [12].\nWe present the thermodynamic variational objective (TVO), which is based on a key connection we\nestablish between thermodynamic integration (TI) and amortized variational inference (VI), namely\nthat by forming a geometric path between the model and inference network, the \u201cinstantaneous\nELBO\u201d [13] that appears in VI is equivalent to the \ufb01rst derivative of the potential function that appears\nin TI [14, 15]. This allows us to formulate the log evidence as a 1D integration of the instantaneous\nELBO in a unit interval, which we then approximate to form the TVO.\nWe demonstrate that optimizing the TVO leads to improved learning of both discrete and continuous\nlatent-variable deep generative models. The gradient estimator we derive for optimizing the TVO has\nempirically lower variance than the REINFORCE [16] estimator, and unlike the reparameterization\ntrick (which is only applicable to a limited family of continuous latent variables), applies to both\ncontinuous and discrete latent variables models.\nThe TVO is a lower bound to the log evidence which can be made arbitrarily tight. We empirically\nshow that optimizing the TVO results in better inference networks than optimizing the importance\nweighted autoencoder (IWAE) objective [8] for which tightening of the bound is known to make\ninference network learning worse [12]. While this problem can be ameliorated by reducing the\nvariance of the gradient estimator in the case of reparameterizable latent variables [17], resolving it\nin the case of non-reparameterizable latent variables currently involves alternating optimization of\nmodel and inference networks [18\u201320].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fE\u03c0\u03b2 hlog p\u03b8(x;z)\nq\u03c6(zjx)i\n\n\u2264\n\n\u2264\n\ntvo(\u03b8; \u03c6; x)\n\nlog p\u03b8(x)\n\nelbo(\u03b8; \u03c6; x)\n\n0\n\n\u03b2\n\n1\n\n0\n\n\u03b2\n\n1\n\n0\n\n\u03b2\n\n1\n\nFigure 1: The thermodynamic variational objective (TVO) (center) is a \ufb01nite sum numerical approxi-\nmation to log p\u2713(x), de\ufb01ned by the thermodynamic variational identity (TVI) (right). The ELBO (left)\nis a single partition approximation of the same integral. \u21e1 is given in (7)\n\n2 The Thermodynamic Variational Objective\n\n1\n\nThe evidence lower bound (ELBO), used in learning variational autoencoders (VAEs), lower bounds\nthe log evidence of a generative model p\u2713(x, z) parameterized by \u2713 of a latent variable z and data x.\nIt can be written as the log evidence minus a Kullback-Leibler (KL) divergence\nELBO(\u2713, , x) := log p\u2713(x) KL (q(z|x)||p\u2713(z|x)) ,\n\n(1)\n\nK1Xk=1\n{z\n\nTVO(\u2713,,x)\n\np\u2713(x, z)\n\np\u2713(x, z)\n\nE\u21e1\uf8fflog\n\n0\n\nE\u21e1k\uf8fflog\n\nwhere q(z|x) is an inference network parameterized by . As illustrated in Figure 1, the TVO\nK\"ELBO(\u2713, , x) +\nq(z| x) d = log p\u2713(x)\n}\n|\n\nq(z| x)#\n}\n\nlower bounds the log evidence by using a Riemann sum approximation to the TVI, a one-dimensional\nintegral over a scalar in a unit interval which evaluates to the log model evidence log p\u2713(x).\nThe integrand, which is a function of , is an expectation of the so-called \u201cinstantaneous ELBO\u201d [13]\nunder \u21e1(z), a geometric combination of p\u2713(x, z) and q(z|x) which we formally de\ufb01ne in \u00a73.\nRemarkably, at = 0, the integrand equals the ELBO. This therefore allows us to view the ELBO\nas a single-term left Riemann sum of the TVI. At = 1, the integrand equals to the evidence upper\nbound (EUBO). This sheds a new unifying perspective on the VAE and wake-sleep objectives, which\nwe explore in detail in \u00a75 and Appendix G.\n\n\uf8ff Z 1\n|\n\n(2)\n\nTHERMODYNAMIC VARIATIONAL IDENTITY\n\n{z\n\n3 Connecting Thermodynamic Integration and Variational Inference\n\nSuppose there are two unnormalized densities \u02dc\u21e1i(z) (i = 0, 1) and corresponding normalizing\n\nconstants Zi :=R \u02dc\u21e1i(z) dz, which together de\ufb01ne the normalized densities \u21e1i(z) := \u02dc\u21e1i(z)/Zi. We\ncan typically evaluate the unnormalized densities but cannot evaluate the normalizing constants.\nWhile calculating the normalizing constants individually is usually intractable, thermodynamic\nintegration [14, 15] allows us to compute the log of the ratio of the normalizing constants, log Z1/Z0.\nTo do so, we \ufb01rst form a family of unnormalized densities (or a \u201cpath\u201d) parameterized by 2 [0, 1]\nbetween the two distributions of interest\n(3)\n\n\u02dc\u21e1(z) := \u02dc\u21e11(z) \u02dc\u21e10(z)1\n\nwith the corresponding normalizing constants and normalized densities\n\nZ :=Z \u02dc\u21e1(z) dz,\n\nand \u21e1(z) := \u02dc\u21e1(z)/Z.\n\n(4)\n\nFollowing Neal [15], we will \ufb01nd it useful to de\ufb01ne a potential energy function U(z) := log \u02dc\u21e1(z)\nalong with its \ufb01rst derivative U0(z) = dU (z)\n. We can then estimate the log of the ratio of the\nnormalizing constants via the identity central to TI, derived in Appendix A,\n\nd\n\nlog Z1 log Z0 =Z 1\n\n0\n\nE\u21e1\u21e5U0(z)\u21e4d.\n\n2\n\n(5)\n\n\fOur key insight connecting TI and VI is the following. If we set\n\n\u02dc\u21e10(z) := q(z| x) Z0 =Z q(z| x) dz = 1\n\u02dc\u21e11(z) := p\u2713(x, z) Z1 =Z p\u2713(x, z) dz = p\u2713(x)\n\nthis results in a geometric path between the variational distribution q(z|x) and the model p\u2713(x, z)\n(7)\n\nand \u21e1(z) :=\n\n\u02dc\u21e1(z)\n\n,\n\n\u02dc\u21e1(z) := p\u2713(x, z)q(z|x)1\n\nZ\n\nwhere the \ufb01rst derivative of the potential is equal to the \u201cinstantaneous ELBO\u201d [13]\n\nU0(z) = log\n\n.\n\n(8)\n\np\u2713(x, z)\nq(z|x)\n\nSubstituting (8) and Z0 = 1 and Z1 = p\u2713(x) into (5) results in the thermodynamic variational\nidentity\n\n(6)\n\n(9)\n\nlog p\u2713(x) =Z 1\n\n0\n\nE\u21e1\uf8fflog\n\np\u2713(x, z)\n\nq(z| x) d.\n\nThis means that log p\u2713(x) can be expressed as a one-dimensional integral of an expectation of the\ninstantaneous ELBO under \u21e1 from = 0 to = 1 (see Figure 1 (right)).\nTo obtain the thermodynamic variational objective (TVO) de\ufb01ned in (2), we lower bound the integral\nin (9) using a left Riemann sum. That this is in fact a lower bound follows from observation that the\nintegrand is monotonically increasing, as shown in Appendix B. This is a result of our choice of path\nin (7), which allows us to show the derivative of the integrand is equal to the variance of U0(z) under\n\u21e1(z) and is therefore non-negative. For equal spacing of the partitions, where k = k/K, we arrive\nat the TVO in (2), illustrated in Figure 1 (middle). We present a generalized variant with non-equal\nspacing in Appendix C.\nMaximizing the ELBO(\u2713, , x) can be seen as a special case of the TVO, since for = 0, \u21e1(z) =\n\nq(z | x)i, which is equivalent to the\n\nq(z|x), and so the integrand in (9) becomes Eq(z|x)hlog p\u2713(x,z)\nde\ufb01nition of ELBO in (1). Because the integrand is increasing, we have\nELBO(\u2713, , x) \uf8ff TVO(\u2713, , x) \uf8ff log p\u2713(x),\n\n(10)\nwhich means that the TVO is an alternative to IWAE for tightening the variational bounds.\nIn\nAppendix D we show maximizing the TVO is equivalent to minimizing a divergence between the\nvariational distribution and the true posterior p\u2713(z| x).\nThe integrand in (9) is typically estimated by long running Markov chain Monte Carlo chains\ncomputed at different values of \u21e1(z) [21, 22]. Instead, we propose a simple importance sampling\nmechanism that allows us to reuse samples across an arbitrary number of discretizations and which is\ncompatible with gradient-based learning.\n\n4 Optimizing the TVO\n\nWe now provide a novel score-function based gradient estimator for the TVO which does not require\nthe reparameterization trick.\n\nGradients To use the TVO as a variational objective we must be able to differentiate through terms\nof the form r E\u21e1, [f(z)], where both \u21e1,(z) and f(z) are parameterized by , and \u21e1,(z)\ncontains an intractable normalizing constant. In the TVO, f(z) is the instantaneous ELBO and\n := {\u2713, }, but our method is applicable for generic f(z) : RM 7! R.\nWe can compute such terms using the covariance gradient estimator (derived in Appendix E)\n\nr E\u21e1, [f(z)] = E\u21e1, [rf(z)] + Cov\u21e1, [r log \u02dc\u21e1,(z), f(z)]\n\n(11)\n\n3\n\n\fWe emphasize that, like REINFORCE, our estimator relies on the log-derivative trick, but crucially un-\n\nlike REINFORCE, doesn\u2019t require differentiating through the normalizing constant Z =R \u02dc\u21e1,(z) dz.\nWe clarify the relationship between our estimator and REINFORCE in Appendix F.\nThe covariance in (11) has the same dimensionality as 2 RD because it\nr log \u02dc\u21e1,(z) 2 RD and f(z) 2 R and is de\ufb01ned as\n\nCov\u21e1, (a, b) := E\u21e1,\u21e5(a E\u21e1, [a])(b E\u21e1, [b])\u21e4 .\n\n(12)\nTo estimate this, we \ufb01rst estimate the inner expectations which are then used in estimating the outer\nexpectation. Thus, estimating the gradient in (11) requires estimating expectations under \u21e1.\nExpectations By using q(z|x) as the proposal distribution in S-sample importance sampling,\nwe can estimate an expectation of a general function f (z) under any \u21e1(z) by simply raising each\nunnormalized importance weight to the power and normalizing:\n\nis between\n\nw\n\ns f (zs),\n\n(13)\n\nSXs=1\n\nE\u21e1 [f (z)] \u21e1\ns /PS\n\ns0=1 w\n\nwhere zs \u21e0 q(z|x), w\nunnormalized importance weight can be expressed as\n\ns := w\n\ns0 and ws := p\u2713(x,zs)\n\nq(zs|x). This follows because each\n\n\u02dc\u21e1(x, zs)\nq(zs|x)\n\np\u2713(x, zs)q(zs|x)1\n\n=\n\n=\n\nq(zs|x)\n\np\u2713(x, zs)\n\nq(zs|x) =\u2713 p\u2713(x, zs)\nq(zs|x)\u25c6\n\n= w\ns .\n\n(14)\n\nInstead of sampling SK times, we can reuse S samples zs \u21e0 q(z|x) across an arbitrary number\nof terms, since evaluating the normalized weight wk\ns only requires raising each weight to different\npowers of k before normalizing. Reusing samples in this way is a use of the method known as\n\u201ccommon random numbers\u201d and we include experimental results showing it reduces the variance of\nthe covariance estimator in Appendix F [23].\nThe covariance estimator does not require z to be reparameterizable, which means it can be used\nin the cases of both non-reparameterizable continuous latent variables and discrete latent variables\n(without modifying the model using continuous relaxations [24, 25]).\n\n5 Generalizing Variational Objectives\n\nAs previously observed, the left single Riemann approximation of the TVI equals the ELBO, while the\nright endpoint ( = 1) is equal to the EUBO. The EUBO is analogous to the ELBO but under the true\nposterior and is de\ufb01ned\n\n(15)\n\n(17)\n\nEUBO(\u2713, , x) := Ep\u2713(z | x)\uf8fflog\n\np\u2713(x, z)\n\nq(z| x) .\n\nWe also have the following identity\n\n(16)\nwhich should be contrasted against (1). We de\ufb01ne an upper-bound variant of the TVO using the right\n(rather than left) Riemann sum. Setting k = k/K\n\nEUBO(x,\u2713, ) = log p\u2713(x) + KL (p\u2713(z|x)||q(z|x))\n\nTVOU\n\nK(\u2713, , x) :=\n\n1\n\nK\"EUBO(\u2713, , x) +\n\nE\u21e1k\uf8fflog\n\nK1Xk=1\n\np\u2713(x, z)\n\nq(z| x)# log p(x).\n\nThe wake-sleep (WS) [18] and reweighted wake-sleep (RWS) [19] algorithms have traditionally\nbeen viewed as using different objectives during the wake and sleep phase. The endpoints of the\nTVI, which the TVO approximates, correspond to the two objectives used in wake-sleep. We can\ntherefore view WS as alternating between between TVOL\n1 , i.e. a left and right single term\nRiemann approximation to the TVI. We show this algebraically in Appendix G and additionally, show\nhow the objectives used in variational inference [26], variational autoencoders [1, 2], and inference\ncompilation [27] are all special cases of TVOL\n1 . We refer the reader to [20] for a further\ndiscussion of the wake-sleep algorithm.\n\n1 and TVOU\n\n1 and TVOU\n\n4\n\n\fFigure 2: Investigation of how number of particles S, number of partitions K, and 1 affect learning\nof the generative model. In the \ufb01rst three plots (a-c), we vary S and K for different values of 1 and\nobserve that while S should be as high as possible, there is an optimal value for K, beyond which\nperformance begins to degrade. Assuming 1 is well-chosen, we see that as few as K = 2 partitions\ncan result in good model learning, as seen in the last plot (d).\n\n6 Related Work\n\nThermodynamic integration was originally developed in physics to calculate the difference in\nfree energy of two molecular systems [28]. Neal [15] and Gelman and Meng [14] then intro-\nduced TI into the statistics community to calculate the ratios of normalizing constants of gen-\neral probability models. TI is now commonly used in phylogenetics to calculate the Bayes\nfactor B = p(x|M1)/p(x|M0), where M0, M1 are two models specifying (for instance) tree\ntopologies and branch lengths [22, 29, 30]. We took inspiration from Fan et al. [31] who re-\nplaced the \u201cpower posterior\u201d p(\u2713| x, M, ) = p(x|\u2713, M )p(\u2713, M )/Z of Xie et al. [29] with\np(\u2713| x, M, ) = [p(x|\u2713, M )p(\u2713|M )][p0(\u2713|M )]1/Z, where p0(\u2713|M ) is a tractable reference\ndistribution chosen to facilitate sampling. That the integrand in (9) is strictly increasing was observed\nby Lartillot and Philippe [22].\nWe refer the reader to Titsias and Ruiz [32] for a summary of the numerous advances in variational\nmethods over recent years. The method most similar to our own was proposed by Bornschein et al.\n[33], who introduced another way of improving deep generative modeling through geometrically\ninterpolating between distributions and using importance sampling to estimate gradients. Unlike\nthe TVO, they de\ufb01ne a lower bound on the marginal likelihood of a modi\ufb01ed model de\ufb01ned as\n(p\u2713(x, z)q(z|x)q(x))1/2/Z where q(x) is an auxiliary distribution.\nGrosse et al. [34] studied annealed importance sampling (AIS), a related technique that estimates\npartition functions using a sequence of intermediate distributions to form a product of ratios of\nimportance weights. They observe the geometric path taken in AIS is equivalent to minimizing a\nweighted sum of KL divergences, and use this insight to motivate an alternative path. To the best of\nour knowledge, our work is the \ufb01rst to explicitly connect TI and VI.\n\n7 Experiments\n\n7.1 Discrete Deep Generative Models\nWe use the TVO to learn the parameters of a deep generative model with discrete latent variables.1\nWe use the binarized MNIST dataset with the standard train/validation/test split of 50k/10k/10k [35].\nWe train a sigmoid belief network, described in detail in Appendix I, using the TVO with the Adam\noptimizer. In the \ufb01rst set of experiments we investigate the effect of the discretization 0:K, number\nof partitions K and number of particles S. We then compare against variational inference for Monte\nCarlo objectives (VIMCO) and RWS (with the wake- objective) state-of-the-art IWAE-based methods\nfor learning discrete latent variable models [20]. All \ufb01gures have been smoothed for clarity.\nThe effect of S, K, and locations We expect that increasing the number of partitions K makes the\nRiemann sum approximate the integral over more tightly. However, with each addition term, we\nadd noise due to the use of importance sampling to estimate the expectation E\u21e1 [log p/q]. Importance\nsampling estimates of points on the curve further to the right are likely to be more biased because \u21e1\ngets further from q as we increase . We found the combination of these two effects means that there\n\n1Code to reproduce all experiments is available at: https://github.com/vmasrani/tvo.\n\n5\n\n\fFigure 3: Comparisons with baselines on a held out test set. (Left) Learning curves for different\nmethods. For TVO outperforms other methods both in terms of speed of convergence and the learned\nmodel for S < 50. At S = 50 VIMCO achieves a higher test log evidence but takes longer to converge\nthan the TVO. (Right) KL divergence between current q and p (which measures how well q \u201ctracks\u201d\np) is lowest for TVO.\n\nq\u03c6(zjx)i\nE\u03c0\u03b2 hlog p\u03b8(x;z)\n\nis a \u201csweet spot,\u201d or an optimal number of partitions beyond which adding more partitions becomes\ndetrimental to performance.\nWe have empirically observed that the curve in Figure 1 is often rising sharply from = 0 until a\npoint of maximum curvature \u21e4, after which it is almost \ufb02at until = 1, as seen in Figure 4. We\nhypothesized that if 1 is located far before \u21e4 (the point of maximum curvature), a large number\nof additional partitions would be needed to capture additional area, while if 1 is located after \u21e4,\nadditional partitions would simply incur a high cost of bias without signi\ufb01cantly tightening the\nbound. To investigate this, we choose small (1010), medium (0.1) and large (0.9) values of 1, and\nlogarithmically space the remaining 2:K between 1 and 1. For each value of 1 we train the discrete\ngenerative model for K 2{ 2, 5, 10, . . . , 50} and S 2{ 2, 5, 10, 50}, and show the test log evidence\nat the last iteration of each trial, approximated by evaluating the IWAE loss with 5000 samples.\nOur hypothesis is corroborated in Figure 2, where we observe in Fig-\nure 2a that for 1 = 1010 a large number of partitions are needed\nto approximate the integral. In Figure 2b we increase 1 to 101 and\nobserve only a few partitions are needed to improve performance, af-\nter which adding additional partitions becomes detrimental to model\nlearning.\nFrom Figure 2c we can see that if 1 is chosen to be well beyond \u21e4,\nthe Riemann sum cannot recover the \u201clost\u201d area even if the number of\npartitions is increased. That the performance does not degrade in this\ncase is due to the fact that for suf\ufb01ciently high k, the curve in Figure 1\nis \ufb02at and therefore \u21e1k \u21e1 \u21e1k+1 \u21e1 p\u2713(z| x). We also observe that in-\ncreasing number of samples S\u2014which decreases importance sampling\nbias per partition\u2014improves performance in all cases.\nIn our second experiment, shown in the Figure 2d, we \ufb01x K = 2 and\ninvestigate the quality of the learned generative model for different 1, This plot clearly shows \u21e4 is\nsomewhere near 0.3, as model learning improves as 1 approaches this point then begins to degrade.\nGiven these results, we recommend using as many particles S as possible and performing a hyper-\nparameter search over 1 (with K = 2) when using the TVO objective. We leave \ufb01nding the optimal\nplacement of discretization points to future work.\nPerformance In Figure 3 (left), we compare the TVO against VIMCO and RWS with the wake-\nobjective, the state-of-the-art IWAE-based methods for learning discrete latent variable models [20].\nFor S < 50, the TVO outperforms both methods in terms of speed of convergence and the \ufb01nal test\nlog evidence log p\u2713(x), estimated using 5000 IWAE particles as before. At S = 50 VIMCO achieves a\nhigher test log evidence but converges more slowly.\n\nFigure 4: The location of\n\u21e4, the point of maximum\ncurvature.\n\ntvo(\u03b8; \u03c6;\n\nx)\n\n1\n\n0\n\n\u03b2 \u2217\n\n\u03b2\n\n6\n\n\fFigure 5: Computational and gradient estimator ef\ufb01ciency. (Left) Time and memory ef\ufb01ciency of the\nTVO with increasing number of partitions vs baselines, measured for 100 iterations of optimization.\nIncreasing the number of partitions is much cheaper than increasing the number of particles. (Right)\nStandard deviation of the gradient estimator for each objective. TVO is lowest variance, VIMCO is\nhighest variance, RWS is in the middle.\n\nWe also investigate the quality of the learned inference network by plotting the KL divergence\n(averaged over the test set) between the current q and current p as training progresses (Figure 3\n(right)). This indicates how well q \u201ctracks\u201d p. This is estimated as log evidence minus ELBO where\nthe former is estimated as before and the latter is estimated using 5000 Monte Carlo samples. The KL\nis lowest for TVO.\nSomewhat surprisingly, for all methods, increasing number of particles makes the KL worse. We\nspeculate that this is due to the \u201ctighter bounds\u201d effect of Rainforth et al. [12], who showed that\nincreasing the number of samples can positively affect model learning but adversely affect inference\nnetwork learning, thereby increasing the KL between the two.\nEf\ufb01ciency Since we use K = 2 partitions for the same number of particles S, the time and memory\ncomplexity of TVO is double that of other methods. While this is true, in both time and memory cases,\nthe constant factor for increasing S is much higher than for increasing K. As shown in Figure 5 (left),\nit is virtually free to increase number of partitions. This is because for each new particle, we must\nadditionally sample from the inference network and score the sample under both p and q to obtain\na weight. On the other hand, we can reuse the S samples and corresponding weights in estimating\nvalues for the K + 1 terms in the Riemann sum. Thus, the region of the the computation graph that\nis dependent on K is after the expensive sampling and scoring, and only involves performing basic\noperations on additional matrices of size S \u21e5 K.\nVariance In Figure 5 (right), we plot the standard deviation of the gradient estimator for each method,\nwhere we compute the standard deviation for the dth element of the gradient estimated over 10\nsamples and take the average across all D.\nThe gradient estimator of the TVO has lower variance than both VIMCO, which uses REINFORCE\nwith a control variate as a gradient estimator and RWS which can calculate the gradient without\nreparameterizing or using the log-derivative trick. At S = 5, RWS has lower gradient variance but its\nperformance is worse in terms of both model and inference learning.\n\n7.2 Continuous Deep Generative Models\n\nUsing the binarized MNIST dataset and experimental design described above, we also evaluated our\nmethod on a deep generative model with continuous latent variables. The model is described in detail\nin Appendix I. For each S 2{ 5, 10, 50} we sweep over K 2{ 2, ..., 6} and 20 1 values linearly\nspaced between 102 and 0.9. We optimize the objectives using the Adam optimizer with default\nparameters.\nPerformance In Figure 6 (left), we train the model using the TVO and compare against the same\nmodel trained using the single sample VAE objective and multisample IWAE objective. The TVO\noutperforms the VAE and performs competitively with IWAE at 50 samples, despite not using the\nreparameterization trick. IWAE is the top performing objective in all cases. As in the discrete case,\n\n7\n\n\fFigure 6: Learning curves for learning continuous deep generative models using different objectives.\n(Left) Despite not using the reparameterization trick, TVO outperforms VAEs and is competitive with\nIWAE at 50 samples. For all S, IWAE > TVO > VAE. (Right) Standard deviation of the gradient\nestimator for each objective. The TVO has lower variance than IWAE but higher than VAE.\n\nincreasing the number of particles S improves model learning for all methods, but the improvement\nis most signi\ufb01cant for the TVO. Interestingly VAE performance actually decreases when the number\nof samples increases from 10 to 50. A similar effect was noticed by Burda et al. [8] on the omniglot\ndataset.\nVariance In Figure 6 (right), we plot the standard deviation of each method\u2019s gradient estimator. The\nstandard deviation of the TVO estimator falls squarely between VAE (best) and IWAE (worst). The\nvariance of each method improves as the number of samples increases, and as in the discrete model,\nthe improvement is most signi\ufb01cant in the case of TVO. Unlike in the discrete case, the variance\ndoes not decrease as the optimization proceeds, but plateaus early and then gradually increases. In\nAppendix F we include additional experiments to evaluate the properties of the covariance gradient\nestimator when used on the ELBO.\nFor both IWAE and the TVO, increasing the number of samples leads to decreased gradient variance\nand improved model learning. However, IWAE has the best performance but the highest variance\nacross the three models. These results lend support to the conclusions of Rainforth et al. [12] who\nobserve that the variance of a gradient estimator \u201cis not always a good barometer for the effectiveness\nof a gradient estimation scheme.\u201d\n\n8 Conclusions\n\nThe thermodynamic variational objective represents a new way to tighten evidence bounds and\nis based on a tight connection between variational inference and thermodynamic integration. We\ndemonstrated that optimizing the TVO can have a positive impact on the learning of discrete deep\ngenerative models and can perform as well as using the reparameterization trick to learn continuous\ndeep generative models.\nThe weakness of our method lies in choosing the discretization points. This does, however, point\nout opportunities for future work wherein we adaptively select optimal positions of the 1:K points,\nperhaps using techniques from the Bayesian numerical quadrature literature [36\u201338].\nThe approximate path integration perspective provided by our development of the TVO also sheds\nlight on the connection between otherwise disparate deep generative model learning techniques. In\nparticular, the TVO integration perspective points to ways to improve wake-sleep via tightening the\nEUBO using similar integral upper-bounding techniques. Further experimentation is warranted to\nexplore how TVO insights can be applied to all special cases of the TVO including non-amortized\nvariational inference and to the use of the TVO as a compliment to annealing importance sampling for\n\ufb01nal model evidence evaluation.\n\n8\n\n\fAcknowledgments\nWe would like to thank Trevor Campbell, Adam \u00b4Scibior, Boyan Beronov, and Saifuddin Syed for their\nhelpful comments on early drafts of this manuscript. Tuan Anh Le\u2019s research leading to these results\nis supported by EPSRC DTA and Google (project code DF6700) studentships. We acknowledge\nthe support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the\nCanada CIFAR AI Chairs Program, Compute Canada, Intel, and DARPA under its D3M and LWLL\nprograms.\n\nReferences\n[1] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations, 2014.\n\n[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In International Conference on Machine Learning, 2014.\n\n[3] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Interna-\n\ntional Conference on Machine Learning, pages 1791\u20131799, 2014.\n\n[4] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International\n\nConference on Machine Learning, pages 2188\u20132196, 2016.\n\n[5] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-\nvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information\nProcessing Systems, pages 2624\u20132633, 2017.\n\n[6] Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization gradients\nthrough acceptance-rejection sampling algorithms. In Arti\ufb01cial Intelligence and Statistics, pages 489\u2013498,\n2017.\n\n[7] Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances\n\nin Neural Information Processing Systems, pages 441\u2013452, 2018.\n\n[8] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International\n\nConference on Learning Representations, 2016.\n\n[9] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih,\nIn Advances in Neural Information\n\nArnaud Doucet, and Yee Teh. Filtering variational objectives.\nProcessing Systems, pages 6576\u20136586, 2017.\n\n[10] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-encoding sequential Monte\n\nCarlo. In International Conference on Learning Representations, 2018.\n\n[11] Christian Naesseth, Scott Linderman, Rajesh Ranganath, and David Blei. Variational sequential Monte\n\nCarlo. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[12] Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and\n\nYee Whye Teh. Tighter variational bounds are not necessarily better. In ICML, 2018.\n\n[13] David Blei. Variational inference: Foundations and innovations. URL http://www.cs.columbia.edu/\n\n~blei/talks/Blei_VI_tutorial.pdf.\n\n[14] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to\n\nbridge sampling to path sampling. Statistical science, pages 163\u2013185, 1998.\n\n[15] Radford M Neal. Probabilistic inference using Markov chain Monte Carlo methods. 1993.\n\n[16] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[17] George Tucker, Dieterich Lawson, Shixiang Gu, and Chris J Maddison. Doubly reparameterized gradient\n\nestimators for Monte Carlo objectives. arXiv preprint arXiv:1810.04152, 2018.\n\n[18] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The \u201cwake-sleep\" algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158\u20131161, 1995.\n\n9\n\n\f[19] J\u00f6rg Bornschein and Yoshua Bengio. Reweighted wake-sleep. In International Conference on Learning\n\nRepresentations, 2015.\n\n[20] Tuan Anh Le, Adam R Kosiorek, N Siddharth, Yee Whye Teh, and Frank Wood. Revisiting reweighted\n\nwake-sleep. arXiv preprint arXiv:1805.10469, 2018.\n\n[21] Nial Friel and Anthony N Pettitt. Marginal likelihood estimation via power posteriors. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 70(3):589\u2013607, 2008.\n\n[22] Nicolas Lartillot and Herv\u00e9 Philippe. Computing Bayes factors using thermodynamic integration. System-\n\natic biology, 55(2):195\u2013207, 2006.\n\n[23] Art B. Owen. Monte Carlo theory, methods and examples. 2013.\n[24] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax.\n\nInternational Conference on Learning Representations, 2017.\n\nIn\n\n[25] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In International Conference on Learning Representations, 2017.\n\n[26] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[27] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal probabilistic\n\nprogramming. arXiv preprint arXiv:1610.09900, 2016.\n\n[28] DJ Evans. Molecular dynamics simulation of statistical mechanical systems. International School of\n\nPhysics,\u201cEmico Fermi\u201d(July 22-August 2, 1985), to be published, 1986.\n\n[29] Wangang Xie, Paul O Lewis, Yu Fan, Lynn Kuo, and Ming-Hui Chen. Improving marginal likelihood\n\nestimation for Bayesian phylogenetic model selection. Systematic biology, 60(2):150\u2013160, 2010.\n\n[30] Nicolas Rodrigue and St\u00e9phane Aris-Brosou. Fast Bayesian choice of phylogenetic models: Prospecting\n\ndata augmentation\u2013based thermodynamic integration. Systematic biology, 60(6):881\u2013887, 2011.\n\n[31] Yu Fan, Rui Wu, Ming-Hui Chen, Lynn Kuo, and Paul O Lewis. Choosing among partition models in\n\nBayesian phylogenetics. Molecular biology and evolution, 28(1):523\u2013532, 2010.\n\n[32] Michalis K. Titsias and Francisco J. R. Ruiz. Unbiased implicit variational inference, 2018.\n[33] J\u00f6rg Bornschein, Samira Shabanian, Asja Fischer, and Yoshua Bengio. Bidirectional Helmholtz machines.\n\nIn International Conference on Machine Learning, pages 2511\u20132519, 2016.\n\n[34] Roger B Grosse, Chris J Maddison, and Ruslan R Salakhutdinov. Annealing between distributions by\n\naveraging moments. In Advances in Neural Information Processing Systems, pages 2769\u20132777, 2013.\n\n[35] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n[36] Anthony O\u2019Hagan. Bayes\u2013Hermite quadrature. Journal of statistical planning and inference, 29(3):\n\n245\u2013260, 1991.\n\n[37] Carl Edward Rasmussen and Zoubin Ghahramani. Bayesian Monte Carlo. Advances in neural information\n\nprocessing systems, pages 505\u2013512, 2003.\n\n[38] Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K Duvenaud, Stephen J Roberts, and\nCarl E Rasmussen. Active learning of model evidence using Bayesian quadrature. In Advances in neural\ninformation processing systems, pages 46\u201354, 2012.\n\n[39] Shinto Eguchi et al. A differential geometric approach to statistical inference on the basis of contrast\n\nfunctionals. Hiroshima mathematical journal, 15(2):341\u2013391, 1985.\n\n[40] P L\u2019Ecuyer. On the interchange of derivative and expectation. Management Science, to appear, 1993.\n[41] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient\nestimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471\u20131530, 2004.\n[42] Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In\nProceedings of the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 538\u2013545. Morgan\nKaufmann Publishers Inc., 2001.\n\n[43] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through\nthe void: Optimizing control variates for black-box gradient estimation. In International Conference on\nLearning Representations, 2018.\n\n10\n\n\f", "award": [], "sourceid": 6139, "authors": [{"given_name": "Vaden", "family_name": "Masrani", "institution": "University of British Columbia"}, {"given_name": "Tuan Anh", "family_name": "Le", "institution": "MIT"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of British Columbia"}]}