{"title": "Annealing between distributions by averaging moments", "book": "Advances in Neural Information Processing Systems", "page_first": 2769, "page_last": 2777, "abstract": "Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and an intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families: averaging the moments of the initial and target distributions. We derive an asymptotically optimal piecewise linear schedule for the moments path and show that it performs at least as well as geometric averages with a linear schedule. Moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models, including Deep Belief Networks and Deep Boltzmann Machines.", "full_text": "Annealing Between Distributions by\n\nAveraging Moments\n\nRoger Grosse\n\nComp. Sci. & AI Lab\n\nMIT\n\nCambridge, MA 02139\n\nChris J. Maddison\n\nRuslan Salakhutdinov\n\nDept. of Computer Science\n\nDepts. of Statistics and Comp. Sci.,\n\nUniversity of Toronto\nToronto, ON M5S 3G4\n\nUniversity of Toronto\n\nToronto, ON M5S 3G4, Canada\n\nAbstract\n\nMany powerful Monte Carlo techniques for estimating partition functions, such\nas annealed importance sampling (AIS), are based on sampling from a sequence\nof intermediate distributions which interpolate between a tractable initial distribu-\ntion and the intractable target distribution. The near-universal practice is to use\ngeometric averages of the initial and target distributions, but alternative paths can\nperform substantially better. We present a novel sequence of intermediate distribu-\ntions for exponential families de\ufb01ned by averaging the moments of the initial and\ntarget distributions. We analyze the asymptotic performance of both the geomet-\nric and moment averages paths and derive an asymptotically optimal piecewise\nlinear schedule. AIS with moment averaging performs well empirically at esti-\nmating partition functions of restricted Boltzmann machines (RBMs), which form\nthe building blocks of many deep learning models.\n\nIntroduction\n\n1\nMany generative models are de\ufb01ned in terms of an unnormalized probability distribution, and com-\nputing the probability of a state requires computing the (usually intractable) partition function. This\nis problematic for model selection, since one often wishes to compute the probability assigned to\nheld-out test data. While partition function estimation is intractable in general, there has been ex-\ntensive research on variational [1, 2, 3] and sampling-based [4, 5, 6] approximations. In the context\nof model comparison, annealed importance sampling (AIS) [4] is especially widely used because\ngiven enough computing time, it can provide high-accuracy estimates. AIS has enabled precise\nquantitative comparisons of powerful generative models in image statistics [7, 8] and deep learning\n[9, 10, 11]. Unfortunately, applying AIS in practice can be computationally expensive and require\nlaborious hand-tuning of annealing schedules. Because of this, many generative models still have\nnot been quantitatively compared in terms of held-out likelihood.\nAIS requires de\ufb01ning a sequence of intermediate distributions which interpolate between a tractable\ninitial distribution and the intractable target distribution. Typically, one uses geometric averages of\nthe initial and target distributions. Tantalizingly, [12] derived the optimal paths for some toy mod-\nels in the context of path sampling and showed that they vastly outperformed geometric averages.\nHowever, as choosing an optimal path is generally intractable, geometric averages still predominate.\nIn this paper, we present a theoretical framework for evaluating alternative paths. We propose a novel\npath de\ufb01ned by averaging moments of the initial and target distributions. We show that geometric\naverages and moment averages optimize different variational objectives, derive an asymptotically\noptimal piecewise linear schedule, and analyze the asymptotic performance of both paths. Our\nproposed path often outperforms geometric averages at estimating partition functions of restricted\nBoltzmann machines (RBMs).\n\n1\n\n\f2 Estimating Partition Functions\nSuppose we have a probability distribution pb(x) = fb(x)/Zb de\ufb01ned on a space X , where fb(x)\ncan be computed ef\ufb01ciently for a given x \u2208 X , and we are interested in estimating the partition\nfunction Zb. Annealed importance sampling (AIS) is an algorithm which estimates Zb by gradu-\nally changing, or \u201cannealing,\u201d a distribution. In particular, one must specify a sequence of K + 1\nintermediate distributions pk(x) = fk(x)/Zk for k = 0, . . . K, where pa(x) = p0(x) is a tractable\ninitial distribution, and pb(x) = pK(x) is the intractable target distribution. For simplicity, assume\nall distributions are strictly positive on X . For each pk, one must also specify an MCMC transi-\ntion operator Tk (e.g. Gibbs sampling) which leaves pk invariant. AIS alternates between MCMC\ntransitions and importance sampling updates, as shown in Alg 1.\nThe output of AIS is an unbiased estimate \u02c6Zb\nof Zb. Remarkably, unbiasedness holds even in\nthe context of non-equilibrium samples along\nthe chain [4, 13]. However, unless the interme-\ndiate distributions and transition operators are\ncarefully chosen, \u02c6Zb may have high variance\nand be far from Zb with high probability.\nThe mathematical formulation of AIS leaves\nmuch \ufb02exibility for choosing intermediate dis-\ntributions. However, one typically de\ufb01nes a\n[0, 1] \u2192 P through some family P\npath \u03b3 :\nof distributions. The intermediate distributions\npk are chosen to be points along this path corresponding to a schedule 0 = \u03b20 < \u03b21 < . . . < \u03b2K = 1.\nOne typically uses the geometric path \u03b3GA, de\ufb01ned in terms of geometric averages of pa and pb:\n\nw(i) \u2190 w(i) fk(xk\u22121)\nfk\u22121(xk\u22121)\nxk \u2190 sample from Tk (x| xk\u22121)\n\nx0 \u2190 sample from p0(x)\nw(i) \u2190 Za\nfor k = 1 to K do\n\nreturn \u02c6Zb =(cid:80)M\n\nAlgorithm 1 Annealed Importance Sampling\n\nfor i = 1 to M do\n\nend for\n\nend for\n\ni=1 w(i)/M\n\np\u03b2(x) = f\u03b2(x)/Z(\u03b2) = fa(x)1\u2212\u03b2fb(x)\u03b2/Z(\u03b2).\n\n(1)\nCommonly, fa is the uniform distribution, and (1) reduces to p\u03b2(x) = fb(x)\u03b2/Z(\u03b2). This motivates\nthe term \u201cannealing,\u201d and \u03b2 resembles an inverse temperature parameter. As in simulated annealing,\nthe \u201chotter\u201d distributions often allow faster mixing between modes which are isolated in pb.\nAIS is closely related to a broader family of techniques for posterior inference and partition function\nestimation, all based on the following identity from statistical physics:\n\nlog Zb \u2212 log Za =\n\nEx\u223cp\u03b2\n\nlog f\u03b2(x)\n\nd\u03b2.\n\n(2)\n\n(cid:90) 1\n\n0\n\n(cid:20) d\n\nd\u03b2\n\n(cid:21)\n\nThermodynamic integration [14] estimates (2) using numerical quadrature, and path sampling [12]\ndoes so with Monte Carlo integration. The weight update in AIS can be seen as a \ufb01nite difference\napproximation. Tempered transitions [15] is a Metropolis-Hastings proposal operator which heats\nup and cools down the distribution, and computes an acceptance ratio by approximating (2).\nThe choices of a path and a schedule are central to all of these methods. Most work on adapting paths\nhas focused on tuning schedules along a geometric path [15, 16, 17]. [15] showed that the geometric\nschedule was optimal for annealing the scale parameter of a Gaussian, and [16] extended this result\nmore broadly. The aim of this paper is to propose, analyze, and evaluate a novel alternative to \u03b3GA\nbased on averaging moments of the initial and target distributions.\n\n3 Analyzing AIS Paths\nWhen analyzing AIS, it is common to assume perfect transitions, i.e. that each transition opera-\ntor Tk returns an independent and exact sample from the distribution pk [4]. This corresponds to\nthe (somewhat idealized) situation where the Markov chains mix quickly. As Neal [4] pointed out,\nassuming perfect transitions, the Central Limit Theorem shows that the samples w(i) are approx-\nimately log-normally distributed. In this case, the variances var(w(i)) and var(log w(i)) are both\nmonotonically related to E[log w(i)]. Therefore, our analysis focuses on E[log w(i)].\nAssuming perfect transitions, the expected log weights are given by:\nE[log w(i)] = log Za +\n\nEpk [log fk+1(x) \u2212 log fk(x)] = log Zb \u2212 K\u22121(cid:88)\n\nDKL(pk(cid:107)pk+1).\n\nK\u22121(cid:88)\n\n(3)\n\nk=0\n\nk=0\n\n2\n\n\fIn other words, each log w(i) can be seen as a biased estimator of log Zb, where the bias \u03b4 =\n\nlog Zb \u2212 E[log w(i)] is given by the sum of KL divergences(cid:80)K\u22121\n\nk=0 DKL(pk(cid:107)pk+1).\n\nSuppose P is a family of probability distributions parameterized by \u03b8 \u2208 \u0398, and the K + 1 distribu-\ntions p0, . . . , pK are chosen to be linearly spaced along a path \u03b3 : [0, 1] \u2192 P. Let \u03b8(\u03b2) represent\nthe parameters of the distribution \u03b3(\u03b2). As K is increased, the bias \u03b4 decays like 1/K, and the\nasymptotic behavior is determined by a functional F(\u03b3).\nTheorem 1. Suppose K + 1 distributions pk are linearly spaced along a path \u03b3. Assuming per-\nfect transitions, if \u03b8(\u03b2) and the Fisher information matrix G\u03b8(\u03b2) = covx\u223cp\u03b8 (\u2207\u03b8 log p\u03b8(x)) are\ncontinuous and piecewise smooth, then as K \u2192 \u221e the bias \u03b4 behaves as follows:\n\nK\u03b4 = K\n\nDKL(pk(cid:107)pk+1) \u2192 F(\u03b3) \u2261 1\n2\n\n\u02d9\u03b8(\u03b2)T G\u03b8(\u03b2) \u02d9\u03b8(\u03b2)d\u03b2,\n\n(4)\n\nwhere \u02d9\u03b8(\u03b2) represents the derivative of \u03b8 with respect to \u03b2. [See supplementary material for proof.]\n\nThis result reveals a relationship with path sampling, as [12] showed that the variance of the path\nsampling estimator is proportional to the same functional. One useful result from their analysis is\na derivation of the optimal schedule along a given path. In particular, the value of F(\u03b3) using the\noptimal schedule is given by (cid:96)(\u03b3)2/2, where (cid:96) is the Riemannian path length de\ufb01ned by\n\nK\u22121(cid:88)\n\nk=0\n\n(cid:90) 1\n\n0\n\n(cid:90) 1\n\n(cid:113)\n\n(cid:96)(\u03b3) =\n\n\u02d9\u03b8(\u03b2)T G\u03b8(\u03b2) \u02d9\u03b8(\u03b2)d\u03b2.\n\n(5)\n\n0\n\nIntuitively, the optimal schedule allocates more distributions to regions where p\u03b2 changes quickly.\nWhile [12] derived the optimal paths and schedules for some simple examples, they observed that\nthis is intractable in most cases and recommended using geometric paths in practice.\nThe above analysis assumes perfect transitions, which can be unrealistic in practice because many\ndistributions have separated modes between which mixing is dif\ufb01cult. As Neal [4] observed, in\nsuch cases, AIS can be viewed as having two sources of variance: that caused by variability within\na mode, and that caused by misallocation of samples between modes. The former source of vari-\nance is well modeled by the perfect transitions analysis and can be made small by adding more\nintermediate distributions. The latter, however, can persist even with large numbers of intermediate\ndistributions. While our theoretical analysis assumes perfect transitions, our proposed method often\ngave substantial improvement empirically in situations with poor mixing.\n\n4 Moment Averaging\nAs discussed in Section 2, the typical choice of intermediate distributions for AIS is the geometric\naverages path \u03b3GA given by (1). In this section, we propose an alternative path for exponential\nfamily models. An exponential family model is de\ufb01ned as\n\nh(x) exp(cid:0)\u03b7T g(x)(cid:1) ,\n\np(x) =\n\n1\nZ(\u03b7)\n\n(6)\n\nwhere \u03b7 are the natural parameters and g are the suf\ufb01cient statistics. Exponential families include a\nwide variety of statistical models, including Markov random \ufb01elds.\nIn exponential families, geometric averages correspond to averaging the natural parameters:\n\n\u03b7(\u03b2) = (1 \u2212 \u03b2)\u03b7(0) + \u03b2\u03b7(1).\n\n(7)\nExponential families can also be parameterized in terms of their moments s = E[g(x)]. For any\nminimal exponential family (i.e. one whose suf\ufb01cient statistics are linearly independent), there is a\none-to-one mapping between moments and natural parameters [18, p. 64]. We propose an alternative\nto \u03b3GA called the moment averages path, denoted \u03b3M A, and de\ufb01ned by averaging the moments of\nthe initial and target distributions:\n\n(8)\nThis path exists for any exponential family model, since the set of realizable moments is convex\n[18]. It is unique, since g is unique up to af\ufb01ne transformation.\n\ns(\u03b2) = (1 \u2212 \u03b2)s(0) + \u03b2s(1).\n\n3\n\n\fE[xxT ] = \u2212 1\n\nAs an illustrative example, consider a multivariate Gaussian distribution parameterized by the mean\n\u00b5 and covariance \u03a3. The moments are E[x] = \u00b5 and \u2212 1\n2 (\u03a3 + \u00b5\u00b5T ). By plugging\nthese into (8), we \ufb01nd that \u03b3M A is given by:\n\u00b5(\u03b2) = (1 \u2212 \u03b2)\u00b5(0) + \u03b2\u00b5(1)\n\u03a3(\u03b2) = (1 \u2212 \u03b2)\u03a3(0) + \u03b2\u03a3(1) + \u03b2(1 \u2212 \u03b2)(\u00b5(1) \u2212 \u00b5(0))(\u00b5(1) \u2212 \u00b5(0))T .\n\n(9)\n(10)\nIn other words, the means are linearly interpolated, and the covariances are linearly interpolated\nand stretched in the direction connecting the two means.\nIntuitively, this stretching is a useful\nproperty, because it increases the overlap between successive distributions with different means. A\ncomparison of the two paths is shown in Figure 1.\n\n2\n\nNext consider the example of a restricted Boltzmann machine (RBM),\na widely used model in deep learning. A binary RBM is a Markov\nrandom \ufb01eld over binary vectors v (the visible units) and h (the hidden\nunits), and which has the distribution\n\np(v, h) \u221d exp(cid:0)aT v + bT h + vT Wh(cid:1) .\n\n(11)\nThe parameters of the model are the visible biases a, the hidden biases\nb, and the weights W. Since these parameters are also the natural\nparameters in the exponential family representation, \u03b3GA reduces to\nlinearly averaging the biases and the weights. The suf\ufb01cient statistics\nof the model are the visible activations v, the hidden activations h, and\nthe products vhT . Therefore, \u03b3M A is de\ufb01ned by:\nE[v]\u03b2 = (1 \u2212 \u03b2)E[v]0 + \u03b2E[v]1\nE[h]\u03b2 = (1 \u2212 \u03b2)E[h]0 + \u03b2E[h]1\n\nE[vhT ]\u03b2 = (1 \u2212 \u03b2)E[vhT ]0 + \u03b2E[vhT ]1\n\n(12)\n(13)\n(14)\n\nFigure 1: Comparison of\n\u03b3GA and \u03b3M A for multivari-\nate Gaussians:\nintermediate\ndistribution for \u03b2 = 0.5,\nand \u00b5(\u03b2) for \u03b2 evenly spaced\nfrom 0 to 1.\n\nFor many models of interest, including RBMs, it is infeasible to de-\ntermine \u03b3M A exactly, as it requires solving two often intractable prob-\nlems: (1) computing the moments of pb, and (2) solving for model\nparameters which match the averaged moments s(\u03b2). However, much work has been devoted to\npractical approximations [19, 20], some of which we use in our experiments with intractable mod-\nels. Since it would be infeasible to moment match every \u03b2k even approximately, we introduce the\nmoment averages spline (MAS) path, denoted \u03b3M AS. We choose a set of R values \u03b21, . . . , \u03b2R called\nknots, and solve for the natural parameters \u03b7(\u03b2j) to match the moments s(\u03b2j) for each knot. We\nthen interpolate between the knots using geometric averages. The analysis of Section 4.2 shows that,\nunder the assumption of perfect transitions, using \u03b3M AS in place of \u03b3M A does not affect the cost\nfunctional F de\ufb01ned in Theorem 1.\n4.1 Variational Interpretation\nBy interpreting \u03b3GA and \u03b3M A as optimizing different variational objectives, we gain additional\ninsight into their behavior. For geometric averages, the intermediate distribution \u03b3GA(\u03b2) minimizes\na weighted sum of KL divergences to the initial and target distributions:\n\np(GA)\n\u03b2\n\n= arg min\n\nq\n\n(1 \u2212 \u03b2)DKL(q(cid:107)pa) + \u03b2DKL(q(cid:107)pb).\n\n(15)\n\nOn the other hand, \u03b3M A minimizes the weighted sum of KL divergences in the reverse direction:\n\np(M A)\n\u03b2\n\n= arg min\n\nq\n\n(1 \u2212 \u03b2)DKL(pa(cid:107)q) + \u03b2DKL(pb(cid:107)q).\n\n(16)\n\nSee the supplementary material for the derivations. The objective function (15) is minimized by a\ndistribution which puts signi\ufb01cant mass only in the \u201cintersection\u201d of pa and pb, i.e. those regions\nwhich are likely under both distributions. By contrast, (16) encourages the distribution to be spread\nout in order to capture all high probability regions of both pa and pb. This interpretation helps\nexplain why the intermediate distributions in the Gaussian example of Figure 1 take the shape that\nthey do. In our experiments, we found that \u03b3M A often gave more accurate results than \u03b3GA because\nthe intermediate distributions captured regions of the target distribution which were missed by \u03b3GA.\n\n4\n\n\f4.2 Asymptotics under Perfect Transitions\nIn general, we found that \u03b3GA and \u03b3M A can look very different. Intriguingly, both paths always\nresult in the same value of the cost functional F(\u03b3) of Theorem 1 for any exponential family model.\nFurthermore, nothing is lost by using the spline approximation \u03b3M AS in place of \u03b3M A:\nTheorem 2. For any exponential family model with natural parameters \u03b7 and moments s, all three\npaths share the same value of the cost functional:\nF(\u03b3GA) = F(\u03b3M A) = F(\u03b3M AS) =\n\n(\u03b7(1) \u2212 \u03b7(0))T (s(1) \u2212 s(0)).\n\n(17)\n\nProof. The two parameterizations of exponential families satisfy the relationship G\u03b7 \u02d9\u03b7 = \u02d9s [21,\nsec. 3.3]. Therefore, F(\u03b3) can be rewritten as 1\n0 \u02d9\u03b7(\u03b2)T \u02d9s(\u03b2) d\u03b2. Because \u03b3GA and \u03b3M A linearly\ninterpolate the natural parameters and moments respectively,\n\n2\n\n1\n2\n\n(cid:82) 1\n\n(cid:90) 1\n(cid:90) 1\n\n0\n\nF(\u03b3GA) =\n\n1\n2\n\n(\u03b7(1) \u2212 \u03b7(0))T\n\n\u02d9s(\u03b2) d\u03b2 =\n\n1\n2\n\nF(\u03b3M A) =\n\n(s(1) \u2212 s(0))T\n\n(19)\nFinally, to show that F(\u03b3M AS) = F(\u03b3M A), observe that \u03b3M AS uses the geometric path between\neach pair of knots \u03b3(\u03b2j) and \u03b3(\u03b2j+1), while \u03b3M A uses the moments path. The above analysis shows\nthe costs must be equal for each segment, and therefore equal for the entire path.\n\n\u02d9\u03b7(\u03b2) d\u03b2 =\n\n0\n\n(\u03b7(1) \u2212 \u03b7(0))T (s(1) \u2212 s(0))\n\n1\n2\n(s(1) \u2212 s(0))T (\u03b7(1) \u2212 \u03b7(0)).\n\n1\n2\n\n(18)\n\nThis analysis shows that all three paths result in the same expected log weights asymptotically,\nassuming perfect transitions. There are several caveats, however. First, we have noticed experimen-\ntally that \u03b3M A often yields substantially more accurate estimates of Zb than \u03b3GA even when the\naverage log weights are comparable. Second, the two paths can have very different mixing prop-\nerties, which can strongly affect the results. Third, Theorem 2 assumes linear schedules, and in\nprinciple there is room for improvement if one is allowed to tune the schedule.\nFor instance, consider annealing between two Gaussians pa = N (\u00b5a, \u03c3) and pb = N (\u00b5b, \u03c3). The\noptimal schedule for \u03b3GA is a linear schedule with cost F(\u03b3GA) = O(d2), where d = |\u00b5b \u2212 \u00b5a|/\u03c3.\nUsing a linear schedule, the moment path also has cost O(d2), consistent with Theorem 2. However,\nmost of the cost of the path results from instability near the endpoints, where the variance changes\nsuddenly. Using an optimal schedule, which allocates more distributions near the endpoints, the cost\nfunctional falls to O((log d)2), which is within a constant factor of the optimal path derived by [12].\n(See the supplementary material for the derivations.) In other words, while F(\u03b3GA) = F(\u03b3M A),\nthey achieve this value for different reasons: \u03b3GA follows an optimal schedule along a bad path,\nwhile \u03b3M A follows a bad schedule along a near-optimal path. We speculate that, combined with the\nprocedure of Section 4.3 for choosing a schedule, moment averages may result in large reductions\nin the cost functional for some models.\n4.3 Optimal Binned Schedules\nIn general, it is hard to choose a good schedule for a given path. However, consider the set of binned\nschedules, where the path is divided into segments, some number Kj of intermediate distributions\nare allocated to each segment, and the distributions are spaced linearly within each segment. Under\nthe assumption of perfect transitions, there is a simple formula for an asymptotically optimal binned\nschedule which requires only the parameters obtained through moment averaging:\nTheorem 3. Let \u03b3 be any path for an exponential family model de\ufb01ned by a set of knots \u03b2j, each with\nnatural parameters \u03b7j and moments sj, connected by segments of either \u03b3GA or \u03b3M A paths. Then,\nunder the assumption of perfect transitions, an asymptotically optimal allocation of intermediate\ndistributions to segments is given by:\n\n(20)\n2 (\u03b7j+1 \u2212 \u03b7j)T (sj+1 \u2212 sj). Hence,\nProof. By Theorem 2, the cost functional for segment j is Fj = 1\nwith Kj distributions allocated to it, it contributes Fj/Kj to the total cost. The values of Kj which\n\n(\u03b7j+1 \u2212 \u03b7j)T (sj+1 \u2212 sj).\n\nKj \u221d(cid:113)\nj Kj = K and Kj \u2265 0 are given by Kj \u221d(cid:112)Fj.\n\nj Fj/Kj subject to(cid:80)\n\nminimize(cid:80)\n\n5\n\n\fFigure 2: Estimates of log Zb for a normalized Gaussian as K, the number of intermediate distributions, is\nvaried. True value: log Zb = 0. Error bars show bootstrap 95% con\ufb01dence intervals. (Best viewed in color.)\n\n1 \u22120.85\n\n\u22120.85\n\n1\n\n0\n\n(cid:1),(cid:0)\n\n(cid:1)(cid:1) and N(cid:0)(cid:0) 10\n\n0\n\n(cid:1),(cid:0) 1 0.85\n\n0.85 1\n\n(cid:1)(cid:1). As both distributions\n\nwhose parameters are N(cid:0)(cid:0) \u221210\n\n5 Experimental Results\nIn order to compare our proposed path with geometric averages, we ran AIS using each path to es-\ntimate partition functions of several probability distributions. For all of our experiments, we report\ntwo sets of results. First, we show the estimates of log Z as a function of K, the number of interme-\ndiate distributions, in order to visualize the amount of computation necessary to obtain reasonable\naccuracy. Second, as recommended by [4], we report the effective sample size (ESS) of the weights\nfor a large K. This statistic roughly measures how many independent samples one obtains using\nAIS.1 All results are based on 5,000 independent AIS runs, so the maximum possible ESS is 5,000.\n5.1 Annealing Between Two Distant Gaussians\nIn our \ufb01rst experiment, the initial and target distributions were the two Gaussians shown in Fig. 1,\nare normalized, Za = Zb = 1. We compared \u03b3GA and \u03b3M A both under perfect transitions, and\nusing the Gibbs transition operator. We also compared linear schedules with the optimal binned\nschedules of Section 4.3, using 10 segments evenly spaced from 0 to 1.\nFigure 2 shows the estimates of log Zb for K ranging from 10 to 1,000. Observe that with 1,000\nintermediate distributions, all paths yielded accurate estimates of log Zb. However, \u03b3M A needed\nfewer intermediate distributions to achieve accurate estimates. For example, with K = 25, \u03b3M A\nresulted in an estimate within one nat of log Zb, while the estimate based on \u03b3GA was off by 27 nats.\nThis result may seem surprising in light of Theorem 2, which implies that F(\u03b3GA) = F(\u03b3M A) for\nlinear schedules. In fact, the average log weights for \u03b3GA and \u03b3M A were similar for all values of K,\nas the theorem would suggest; e.g., with K = 25, the average was -27.15 for \u03b3M A and -28.04 for\n\u03b3GA. However, because the \u03b3M A intermediate distributions were broader, enough samples landed\nin high probability regions to yield reasonable estimates of log Zb.\n5.2 Partition Function Estimation for RBMs\nOur next set of experiments focused on restricted Boltzmann machines (RBMs), a building block of\nmany deep learning models (see Section 4). We considered RBMs trained with three different meth-\nods: contrastive divergence (CD) [19] with one step (CD1), CD with 25 steps (CD25), and persistent\ncontrastive divergence (PCD) [20]. All of the RBMs were trained on the MNIST handwritten digits\ndataset [22], which has long served as a benchmark for deep learning algorithms. We experimented\nboth with small, tractable RBMs and with full-size, intractable RBMs.\nSince it is hard to compute \u03b3M A exactly for RBMs, we used the moments spline path \u03b3M AS of\nSection 4 with the 9 knot locations 0.1, 0.2, . . . , 0.9. We considered the two initial distributions\ndiscussed by [9]: (1) the uniform distribution, equivalent to an RBM where all the weights and\nbiases are set to 0, and (2) the base rate RBM, where the weights and hidden biases are set to 0, and\nthe visible biases are set to match the average pixel values over the MNIST training set.\nSmall, Tractable RBMs: To better understand the behavior of \u03b3GA and \u03b3M AS, we \ufb01rst evaluated\nthe paths on RBMs with only 20 hidden units. In this setting, it is feasible to exactly compute the\n\n1The ESS is de\ufb01ned as M/(1 + s2(w(i)\u2217 )) where s2(w(i)\u2217 ) is the sample variance of the normalized weights\n[4]. In general, one should regard ESS estimates cautiously, as they can give misleading results in cases where\nan algorithm completely misses an important mode of the distribution. However, as we report the ESS in cases\nwhere the estimated partition functions are close to the true value (when known) or agree closely with each\nother, we believe the statistic is meaningful in our comparisons.\n\n6\n\n\fFigure 3: Estimates of log Zb for the tractable PCD(20) RBM as K, the number of intermediate distributions,\nis varied. Error bars indicate bootstrap 95% con\ufb01dence intervals. (Best viewed in color.)\n\npath & schedule\n\npa(v)\nuniform GA linear\nuniform GA optimal binned\nuniform MAS linear\nuniform MAS optimal binned\n\nlog Zb\n178.06\n\nPCD(20)\nlog \u02c6Zb\n177.99\n177.92\n178.09\n178.08\n\nESS\n204\n142\n289\n934\n\nlog Zb\n279.59\n\nCD1(20)\nlog \u02c6Zb\n279.60\n279.51\n279.59\n279.60\n\nESS\n248\n124\n2686\n2619\n\nTable 1: Comparing estimates of log Zb and effective sample size (ESS) for tractable RBMs. Results are shown\nfor K = 100,000 intermediate distributions, with 5,000 chains and Gibbs transitions. Bolded values indicate\nESS estimates that are not signi\ufb01cantly different from the largest value (bootstrap hypothesis test with 1,000\nsamples at \u03b1 = 0.05 signi\ufb01cance level). The maximum possible ESS is 5,000.\n\nFigure 4: Visible activations for samples from the PCD(500) RBM. (left) base rate RBM, \u03b2 = 0 (top) geometric\npath (bottom) MAS path (right) target RBM, \u03b2 = 1.\n\npartition function and moments and to generate exact samples by exhaustively summing over all\n220 hidden con\ufb01gurations. The moments of the target RBMs were computed exactly, and moment\nmatching was performed with conjugate gradient using the exact gradients.\nThe results are shown in Figure 3 and Table 1. Under perfect transitions, \u03b3GA and \u03b3M AS were both\nable to accurately estimate log Zb using as few as 100 intermediate distributions. However, using\nthe Gibbs transition operator, \u03b3M AS gave accurate estimates using fewer intermediate distributions\nand achieved a higher ESS at K = 100,000. To check that the improved performance didn\u2019t rely on\naccurate moments of pb, we repeated the experiment with highly biased moments.2 The differences\nin log \u02c6Zb and ESS compared to the exact moments condition were not statistically signi\ufb01cant.\nFull-size, Intractable RBMs: For intractable RBMs, moment averaging required approximately\nsolving two intractable problems: moment estimation for the target RBM, and moment matching.\nWe estimated the moments from 1,000 independent Gibbs chains, using 10,000 Gibbs steps with\n1,000 steps of burn-in. The moment averaged RBMs were trained using PCD. (We used 50,000 up-\ndates with a \ufb01xed learning rate of 0.01 and no momentum.) In addition, we ran a cheap, inaccurate\nmoment matching scheme (denoted MAS cheap) where visible moments were estimated from the\nempirical MNIST base rate and the hidden moments from the conditional distributions of the hidden\nunits given the MNIST digits. Intermediate RBMs were \ufb01t using 1,000 PCD updates and 100 par-\nticles, for a total computational cost far smaller than that of AIS itself. Results of both methods are\n\n2In particular, we computed the biased moments from the conditional distributions of the hidden units given\n\nthe MNIST training examples, where each example of digit class i was counted i + 1 times.\n\n7\n\n\fFigure 5: Estimates of log Zb for intractable RBMs. Error bars indicate bootstrap 95% con\ufb01dence intervals.\n(Best viewed in color.)\n\npath\nGA linear\n\npa(v)\nuniform\nuniform MAS linear\nuniform MAS cheap linear\nbase rate GA linear\nbase rate MAS linear\nbase rate MAS cheap linear\n\nCD1(500)\n\nPCD(500)\n\nlog \u02c6Zb\n341.53\n359.09\n359.09\n359.10\n359.07\n359.09\n\nESS\n4\n3076\n3773\n4924\n2203\n2465\n\nlog \u02c6Zb\n417.91\n418.27\n418.33\n418.20\n418.26\n418.25\n\nESS\n169\n620\n5\n159\n1460\n359\n\nCD25(500)\nlog \u02c6Zb\n451.34\n449.22\n450.90\n451.27\n451.31\n451.14\n\nESS\n13\n12\n30\n2888\n304\n244\n\nTable 2: Comparing estimates of log Zb and effective sample size (ESS) for intractable RBMs. Results are\nshown for K = 100,000 intermediate distributions, with 5,000 chains and Gibbs transitions. Bolded values\nindicate ESS estimates that are not signi\ufb01cantly different from the largest value (bootstrap hypothesis test with\n1,000 samples at \u03b1 = 0.05 signi\ufb01cance level). The maximum possible ESS is 5,000.\n\nshown in Figure 5 and Table 2. Overall, the MAS results compare favorably with those of GA on\nboth of our metrics. Performance was comparable under MAS cheap, suggesting that \u03b3M AS can be\napproximated cheaply and effectively. As with the tractable RBMs, we found that optimal binned\nschedules made little difference in performance, so we focus here on linear schedules.\nThe most serious failure was \u03b3GA for CD1(500) with uniform initialization, which underestimated\nour best estimates of the log partition function (and hence overestimated held-out likelihood) by\nnearly 20 nats. The geometric path from uniform to PCD(500) and the moments path from uni-\nform to CD1(500) also resulted in underestimates, though less drastic. The rest of the paths agreed\nclosely with each other on their partition function estimates, although some methods achieved sub-\nstantially higher ESS values on some RBMs. One conclusion is that it\u2019s worth exploring multiple\ninitializations and paths for a given RBM in order to ensure accurate results.\nFigure 4 compares samples along \u03b3GA and \u03b3M AS for the PCD(500) RBM using the base rate ini-\ntialization. For a wide range of \u03b2 values, the \u03b3GA RBMs assigned most of their probability mass\nto blank images. As discussed in Section 4.1, \u03b3GA prefers con\ufb01gurations which are probable under\nboth the initial and target distributions. In this case, the hidden activations were closer to uniform\nconditioned on a blank image than on a digit, so \u03b3GA preferred blank images. By contrast, \u03b3M AS\nyielded diverse, blurry digits which gradually coalesced into crisper ones.\n\n6 Conclusion\n\nWe presented a theoretical analysis of the performance of AIS paths and proposed a novel path\nfor exponential families based on averaging moments. We gave a variational interpretation of this\npath and derived an asymptotically optimal piecewise linear schedule. Moment averages performed\nwell empirically at estimating partition functions of RBMs. We hope moment averaging can also\nimprove other path-based sampling algorithms which typically use geometric averages, such as path\nsampling [12], parallel tempering [23], and tempered transitions [15].\n\nAcknowledgments\n\nThis research was supported by NSERC and Quanta Computer. We thank Geoffrey Hinton for\nhelpful discussions. We also thank the anonymous reviewers for thorough and helpful feedback.\n\n8\n\n\fReferences\n\n[1] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and gen-\n\neralized belief propagation algorithms. IEEE Trans. on Inf. Theory, 51(7):2282\u20132312, 2005.\n\n[2] Martin J. Wainwright, Tommi Jaakkola, and Alan S. Willsky. A new class of upper bounds on\nthe log partition function. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n[3] Amir Globerson and Tommi Jaakkola. Approximate Inference Using Conditional Entropy\nDecompositions. In 11th International Workshop on AI and Statistics (AISTATS\u20192007), 2007.\n[4] Radford Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[5] John Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis,\n\n1(4):833\u2013859, 2006.\n\n[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal\n\nof the Royal Statistical Society: Series B (Methodology), 68(3):411\u2013436, 2006.\n\n[7] Jascha Sohl-Dickstein and Benjamin J. Culpepper. Hamiltonian annealed importance sampling\n\nfor partition function estimation. Technical report, Redwood Center, UC Berkeley, 2012.\n\n[8] Lucas Theis, Sebastian Gerwinn, Fabian Sinz, and Matthias Bethge. In all likelihood, deep\n\nbelief is not enough. Journal of Machine Learning Research, 12:3071\u20133096, 2011.\n\n[9] Ruslan Salakhutdinov and Ian Murray. On the quantitative analysis of deep belief networks.\n\nIn Int\u2019l Conf. on Machine Learning, pages 6424\u20136429, 2008.\n\n[10] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. On tracking the partition func-\n\ntion. In NIPS 24. MIT Press, 2011.\n\n[11] Graham Taylor and Geoffrey Hinton. Products of hidden Markov models: It takes N > 1 to\n\ntango. In Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[12] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance\n\nsampling to bridge sampling to path sampling. Statistical Science, 13(2):163\u2013186, 1998.\n\n[13] Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measure-\n\nments: A master-equation approach. Physical Review E, 56:5018\u20135035, 1997.\n\n[14] Daan Frenkel and Berend Smit. Understanding Molecular Simulation: From Algorithms to\n\nApplications. Academic Press, 2 edition, 2002.\n\n[15] Radford Neal. Sampling from multimodal distributions using tempered transitions. Statistics\n\nand Computing, 6:353\u2013366, 1996.\n\n[16] Gundula Behrens, Nial Friel, and Merrilee Hurn. Tuning tempered transitions. Statistics and\n\nComputing, 22:65\u201378, 2012.\n\n[17] Ben Calderhead and Mark Girolami. Estimating Bayes factors via thermodynamic integration\nand population MCMC. Computational Statistics and Data Analysis, 53(12):4028\u20134045, 2009.\n[18] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and\n\nvariational inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[19] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1771\u20131800, 2002.\n\n[20] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likeli-\n\nhood gradient. In Intl. Conf. on Machine Learning, 2008.\n\n[21] Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University\n\nPress, 2000.\n\n[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[23] Y. Iba. Extended ensemble Monte Carlo.\n\n12(5):623\u2013656, 2001.\n\nInternational Journal of Modern Physics C,\n\n9\n\n\f", "award": [], "sourceid": 1276, "authors": [{"given_name": "Roger", "family_name": "Grosse", "institution": "MIT"}, {"given_name": "Chris", "family_name": "Maddison", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}