{"title": "On Tracking The Partition Function", "book": "Advances in Neural Information Processing Systems", "page_first": 2501, "page_last": 2509, "abstract": "Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function $Z$. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to {\\bf track} the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change $\\Delta Z$ incurred by each gradient update, (2) estimating the difference in $Z$ over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate $Z$ using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone.", "full_text": "On Tracking The Partition Function\n\nGuillaume Desjardins, Aaron Courville, Yoshua Bengio\n\n{desjagui,courvila,bengioy}@iro.umontreal.ca\n\nD\u00b4epartement d\u2019informatique et de recherche op\u00b4erationnelle\n\nUniversit\u00b4e de Montr\u00b4eal\n\nAbstract\n\nMarkov Random Fields (MRFs) have proven very powerful both as density esti-\nmators and feature extractors for classi\ufb01cation. However, their use is often limited\nby an inability to estimate the partition function Z. In this paper, we exploit the\ngradient descent training procedure of restricted Boltzmann machines (a type of\nMRF) to track the log partition function during learning. Our method relies on\ntwo distinct sources of information: (1) estimating the change \u2206Z incurred by\neach gradient update, (2) estimating the difference in Z over a small set of tem-\npered distributions using bridge sampling. The two sources of information are\nthen combined using an inference procedure similar to Kalman \ufb01ltering. Learn-\ning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate\nZ using no more temperatures than are required for learning. Comparing to both\nexact values and estimates using annealed importance sampling (AIS), we show\non several datasets that our method is able to accurately track the log partition\nfunction. In contrast to AIS, our method provides this estimate at each time-step,\nat a computational cost similar to that required for training alone.\n\n\u02dcq(x)\nZ(\u03b2)\n\n=\n\nZ(\u03b2)\n\n\u02dcq(x).\n\nx\n\nIntroduction\n\n1\nIn many areas of application, problems are naturally expressed as a Gibbs measure, where the dis-\ntribution over the domain X is given by, for x \u2208 X :\nexp{\u2212\u03b2E(x)}\n\n, with Z(\u03b2) =(cid:88)\n\nq(x) =\n\n(1)\n\nE(x) is refered to as the \u201cenergy\u201d of con\ufb01guration x, \u03b2 is a free parameter known as the inverse\ntemperature and Z(\u03b2) is the normalization factor commonly refered to as the partition function. Un-\nder certain general conditions on the form of E, these models are known as Markov Random Fields\n(MRF), and have been very popular within the vision and natural language processing communi-\nties. MRFs with latent variables \u2013 in particular restricted Boltzmann machines (RBMs) [9] \u2013 are\namong the most popular building block for deep architectures [1], being used in the unsupervised\ninitialization of both Deep Belief Networks [9] and Deep Boltzmann Machines [22].\nAs illustrated in Eq. 1, the partition function is computed by summing over all variable con\ufb01gura-\ntions. Since the number of con\ufb01gurations scales exponentially with the number of variables, exact\ncalculation of the partition function is generally computationally intractable. Without the parti-\ntion function, probabilities under the model can only be determined up to a multiplicative constant,\nwhich seriously limits the model\u2019s utility. One method recently proposed for estimating Z(\u03b2) is\nannealed importance sampling (AIS) [18, 23]. In AIS, Z(\u03b2) is approximated by the sum of a set of\nimportance-weighted samples drawn from the model distribution. With a large number of variables,\ndrawing a set of importance-weighted samples is generally subject to extreme variance in the im-\nportance weights. AIS alleviates this issue by annealing the model distribution through a series of\nslowly changing distributions that link the target model distribution to one where the log partition\nfunction is tractable. While AIS is quite successful, it generally requires the use of tens of thou-\nsands of annealing distributions in order to achieve accurate results. This computationally intensive\n\n1\n\n\frequirement renders AIS inappropriate as a means of maintaining a running estimate of the log par-\ntition function throughout training. Yet, having ready access to this quantity throughout learning\nopens the door to a range of possibilities. Likelihood could be used as a basis for model comparison\nthroughout training; early-stopping could be accomplished by monitoring an estimate of the likeli-\nhood of a validation set. Another important application is in Bayesian inference in MRFs [17] where\nwe require the partition function for each value of the parameters in the region of support. Track-\ning the log partition function would also enable simultaneous estimation of all the parameters of a\nheterogeneous model, for example an extended directed graphical model with Gibbs distributions\nforming some of the model components.\nIn this work, we consider a method of tracking the log partition function during training, which\nbuilds upon the parallel tempering (PT) framework [7, 10, 15]. Our method relies on two basic ob-\nservations. First, when using stochastic gradient descent 1, parameters tend to change slowly during\ntraining; consequently, the partition function Z(\u03b2) also tends to evolve slowly. We exploit this prop-\nerty of the learning process by using importance sampling to estimate changes in the log partition\nfunction from one learning iteration the next. If the changes in the distribution from time-step t to\nt + 1 are small, the importance sampling estimate can be very accurate, even with relatively few\nsamples. This is the same basic strategy employed in AIS, but while with AIS one constructs a path\nof close distributions through an annealing schedule, in our procedure we simply rely on the path\nof distributions that emerges from the learning process. Second, parallel tempering (PT) relies on\nsimulating an extended system, consisting of multiple models each running at their own tempera-\nture. These temperatures are chosen such that neighboring models overlap suf\ufb01ciently as to allow\nfor frequent cross-temperature state swaps. This is an ideal operating regime for bridge sampling\n[2, 19], which can thus serve to estimate the difference in log partition functions between neighbor-\ning models. While with relatively few samples, each method on its own tends not to provide reliable\nestimates, we propose to combine these measurements using a variation of the well-known Kalman\n\ufb01lter (KF), allowing us to accurately track the evolution of the log partition function throughout\nlearning. The ef\ufb01ciency of our method stems from the fact that our estimator makes use of the\nsamples generated in the course of training, thus incurring relatively little additional computational\ncost.\nThis paper is structured as follows. In Section 2, we provide a brief overview of RBMs and the\nSML-PT training algorithm, which serves as the basis of our tracking algorithm. Sections (3.1-3.3)\ncover the details of the importance and bridge sampling estimates, while Section 3.4 provides a\ncomprehensive look at our \ufb01ltering procedure and the tracking algorithm as a whole. Experimental\nresults are presented in Section 4.\n\n2 Stochastic Maximum Likelihood with Parallel Tempering\n\nOur proposed log partition function tracking strategy is applicable to any Gibbs distribution model\nthat is undergoing relatively smooth changes in the partition function. However, we concentrate on\nits application to the RBM since it has become a model of choice for learning unsupervised features\nfor use in deep feed-forward architectures [9, 1] as well as for modeling complex, high-dimensional\ndistributions [27, 24, 12].\nRBMs are bipartite graphical models where visible units v \u2208 {0, 1}nv interact with hidden units\nh \u2208 {0, 1}nh through the energy function E(v, h) = \u2212hT W v \u2212 cT h \u2212 bT v. The model parameters\n\u03b8 = [W, c, b] consist of the weight matrix W \u2208 Rnh\u00d7nv, whose entries Wij connect units (vi, hj),\nand offset vectors b and c. RBMs can be trained through a stochastic approximation to the nega-\n], where F (v) is the free-energy function de\ufb01ned as\ntive log-likelihood gradient \u2202F (v)\nh exp(\u2212E(v, h)). In Stochastic Maximum Likelihood (SML) [25], we replace the\nexpectation by a sample average, where approximate samples are drawn from a persistent Markov\nchain, updated through k-steps of Gibbs sampling between parameter updates. Other algorithms\nimprove upon this default formulation by replacing Gibbs sampling with more powerful sampling\nalgorithms [26, 7, 21, 20]. By increasing the mixing rate of the underlying Markov chain, these\nmethods can lead to lower variance estimates of the maximum likelihood gradient and faster conver-\n\nF (v) = \u2212 logP\n\n\u2202\u03b8 \u2212 Ep[ \u2202F (v)\n\n\u2202\u03b8\n\n1Stochastic gradient descent is one of the most popular methods for training MRFs precisely because second\norder optimization methods typically require a deterministic gradient, whereas sampling-based estimators are\nthe only practical option for models with an intractable partition function.\n\n2\n\n\fgence. However, from the perspective of tracking the log partition function, we will see in Section 3\nthat the SML-PT scheme [7] presents a rather unique advantage.\nThroughout training, parallel tempering draws samples from an extended system Mt = {qi,t; i \u2208\n[1, M]}, where qi,t denotes the model with inverse temperature \u03b2i \u2208 [0, 1] obtained after t steps\nof gradient descent. Each model qi,t (associated with a unique partition function Zi,t) represents\na smoothed version of the target distribution: q1,t (with \u03b21 = 1). The inverse temperature \u03b2i =\n1/Ti \u2208 [0, 1] controls the degree of smoothing, with smaller values of \u03b2i leading to distributions\nwhich are easier to sample from. To leverage these fast-mixing chains, PT alternates k steps of Gibbs\nsampling (performed independently at each temperature) with cross-temperature state swaps. These\nare proposed between neighboring chains using a Metropolis-Hastings-based acceptance criterion.\nIf we denote the particle obtained by each model qi,t after k steps of Gibbs sampling as xi,t, then\nthe swap acceptance ratio ri,t for chains (i, i + 1) is given by:\n\n(cid:18)\n\n(cid:19)\n\nri,t = min\n\n1,\n\n\u02dcqi,t(xi+1,t)\u02dcqi+1,t(xi,t)\n\u02dcqi,t(xi,t)\u02dcqi+1,t(xi+1,t)\n\n(2)\n\nThese swaps ensure that samples from highly ergodic chains are gradually swapped into lower tem-\nperature chains. Our swapping schedule is the deterministic even-odd algorithm [14] which proposes\nswaps between all pairs (qi,t, qi+1,t) with even i\u2019s, followed by those with odd i\u2019s. The gradient is\nthen estimated by using the sample which was last swapped into temperature \u03b21. To reduce the\nvariance on our estimate, we run multiple Markov chains per temperature, yielding a mini-batch of\nmodel samples Xi,t = {x(n)\nSML with Adaptive parallel tempering (SML-APT) [6], further improves upon SML-PT by automat-\ning the choice of temperatures. It does so by maximizing the \ufb02ow of particles between extremal\ntemperatures, yielding better ergodicity and more robust sampling in the negative phase of training.\n\ni,t \u223c qi,t(x); 1 \u2264 n \u2264 N} at each time-step and temperature.\n\n3 Tracking the Partition Function\n\nUnrolling in time (learning iterations) the M models being simulated by PT, we can envision a two-\ndimensional lattice of RBMs indexed by (i, t). As previously mentioned, gradient descent learning\ncauses qi,t, the model with inverse temperature \u03b2i obtained at time-step t, to be close to qi,t\u22121. We\ncan thus apply importance sampling between adjacent temporal models 2 to obtain an estimate of\n\u03b6i,t \u2212 \u03b6i,t\u22121, denoted as O\u2206t\ni,t . Inspired by the annealing distributions used in AIS, one could think to\niterate this process from a known quantity \u03b6i,1, in order to estimate \u03b6i,t. Unfortunately, the variance\nof such an estimate would grow quickly with t.\nPT provides an interesting solution to this problem, by simulating an extended system Mt where the\n\u03b2i\u2019s are selected such that qi,t and qi+1,t have enough overlap to allow for frequent cross-temperature\nstate swaps. This motivates using bridge sampling [2] to provide an estimate of \u03b6i+1,t \u2212 \u03b6i,t, the\ndifference in log partitions between temperatures \u03b2i+1 and \u03b2i. We denote this estimate as O\u2206\u03b2\ni,t . Ad-\nditionally, we can treat \u03b6M,t as a known quantity during training, by setting \u03b2M = 0 3. Beginning\nwith \u03b6M,t (see de\ufb01nition in Fig. 1), repeated application of bridge sampling alone could in principle\narrive at an accurate estimate of {\u03b6i,t; i \u2208 [1, M ], t \u2208 [1, T ]}. However, reducing the variance suf\ufb01-\nciently to provide useful estimates of the log partition function would require using a relatively large\nnumber of samples at each temperature. Within the context of RBM training, the required number\nof samples at each of the parallel chains would have an excessive computational cost. Nonetheless\neven with relatively few samples, the bridge sampling estimate provides an additional source of\ninformation regarding the log partition function.\ni,t by treating the unknown\nOur strategy is to combine these two high variance estimates O\u2206t\ni,t and O\u2206\u03b2\nIn this framework, we\nlog partition functions as a latent state to be tracked by a Kalman \ufb01lter.\nconsider O\u2206t\ni,t as observed quantities, used to iteratively re\ufb01ne the joint distribution over the\nlatent state at each learning iteration. Formally, we de\ufb01ne this latent state to be \u03b6t = [\u03b61,t, . . . , \u03b6M,t, bt]\n, where bt is an extra term to account for a systematic bias in O\u2206\u03b2\n1,t (see Sec. 3.2 for details). The\ncorresponding graphical model is shown in Figure 1.\n\ni,t and O\u2206\u03b2\n\nthus given asP\n\n2 This same technique was recently used in [5], in the context of learning rate adaptation.\n3 The visible units of an RBM with zero weights are marginally independent. Its log partition function is\n\ni log(1 + exp(bi)) + nh \u00b7 log(2).\n\n3\n\n\fSystem Equations:\np(\u03b60) = N (\u00b50, \u03a30)\np(\u03b6t | \u03b6t\u22121) = N (\u03b6t\u22121, \u03a3\u03b6)\np(O(\u2206t)\n\nt\n\n| \u03b6t, \u03b6t\u22121) = N (C[\u03b6t, \u03b6t\u22121]T , \u03a3\u2206t)\n| \u03b6t) = N (H\u03b6t, \u03a3\u2206\u03b2)\n\np(O(\u2206\u03b2)\n\nt\n\nC =\n\nH =\n\n2664 IM\n2664\n\n3775\n\n0\n0\n\n1\n0\n\n\u2212IM\n\n0\n\n...\n\n...\n0\n\u22121 +1\n0\n...\n0 \u22121 +1\n. . .\n0 \u22121 +1\n\n0\n\n0\n\n0\n\n0\n\n3775\n\n0\n\n0\n0\n0\n\nFigure 1: A directed graphical model for log partition function tracking. The shaded nodes represent observed\nvariables, and the double-walled nodes represent the tractable \u03b6M,: with \u03b2M = 0. For clarity of presentation,\nwe show the bias term as distinct from the other \u03b6i,t (recall bt = \u03b6M +1,t.)\n\n3.1 Model Dynamics\n\nThe \ufb01rst step is to specify how we expect the log partition function to change over training iterations,\ni.e. our prior over the model dynamics. SML training of the RBM model parameters is a stochastic\ngradient descent algorithm (typically over a mini-batch of N examples) where the parameters change\nby small increments speci\ufb01ed by an approximation to the likelihood gradient. This implies that both\nthe model distribution and the partition function change relatively slowly over learning increments\nwith a rate of change being a function of the SML learning rate; i.e. we expect qi,t and \u03b6i,t to be\nclose to qi,t\u22121 and \u03b6i,t\u22121 respectively.\nOur model dynamics are thus simple and capture the fact that the log partition function is slowly\nchanging. Characterizing the evolution of the log partition functions as independent Gaussian pro-\ncesses, we model the probability of \u03b6t conditioned on \u03b6t\u22121 as p(\u03b6t|\u03b6t\u22121) = N (\u03b6t\u22121, \u03a3\u03b6), a normal\ndistribution with mean \u03b6t\u22121 and \ufb01xed diagonal covariance \u03a3\u03b6 = Diag[\u03c32\nZ and \u03c32\nb\nare hyper-parameters controlling how quickly the latent states \u03b6i,t and bt are expected to change\nbetween learning iterations.\n\nZ , . . . , \u03c32\n\nb ]. \u03c32\n\nZ , \u03c32\n\n3.2\n\nImportance Sampling Between Learning Iterations\n\nThe observation distribution p(O(\u2206t)\nship between the evolution of the latent log partitions and the statistical measurements O(\u2206t)\n[O(\u2206t)\n\n| \u03b6t, \u03b6t\u22121) = N (C[\u03b6t, \u03b6t\u22121]T , \u03a3\u2206t) models the relation-\n=\n\nM,t ] given by importance sampling, with O\u2206t\n\ni,t de\ufb01ned as:\n\n, . . . , O(\u2206t)\n\n1,t\n\nt\n\nt\n\n(cid:41)\n\n(cid:40) 1\n\nN(cid:88)\n\nN\n\nn=1\n\nO\u2206t\n\ni,t = log\n\nw(n)\ni,t\n\nwith w(n)\n\ni,t =\n\n\u02dcqi,t(x(n)\ni,t\u22121)\n\u02dcqi,t\u22121(x(n)\ni,t\u22121)\n\n.\n\n(3)\n\nIn the above distribution, the matrix C encodes the fact that the average importance weights esti-\nmate \u03b6i,t \u2212 \u03b6i,t\u22121 + bt \u00b7 Ii=1, where I is the indicator function. It is formally de\ufb01ned in Fig. 1.\n\u03a3\u2206t is a diagonal covariance matrix, whose elements are updated online from the estimated vari-\nances of the log-importance weights. At time-step t, the i-th entry of its diagonal is thus given by\n\nVar[wi,t]/(cid:2)(cid:80)\n\nn w(n)(cid:3)2\n\n.\n\n. It stems from the reuse of samples X1,t\u22121: \ufb01rst,\nThe term bt accounts for a systematic bias in O(\u2206t)\nfor estimating the negative phase gradient at time-step t\u2212 1 (i.e. the gradient applied between qi,t\u22121\n\n1,t\n\n4\n\nO\u2206t1,tO\u2206t2,tO\u2206\u03b21,tO\u2206\u03b21,t\u22121O\u2206\u03b2M\u22121,t\u22121O\u2206\u03b2M\u22121,t\u03b6M,t\u22121\u03b6M,t\u03b62,t\u03b61,t\u03b61,t\u22121\u03b62,t\u22121btbt\u22121\fand qi,t) and second, to compute the importance weights of Eq. 3. Since the SML gradient acts to\nlower the probability of negative particles, w(n)\n\nis biased.\n\ni,t\n\n3.3 Bridging the Parallel Tempering Temperature Gaps\n\nConsider now the other dimension of our parallel tempered lattice of RBMs: temperature. As pre-\nviously mentioned, neighboring distributions in PT are designed to have signi\ufb01cant overlap in their\ndensities in order to permit particle swaps. However, the intermediate distributions qi,t(v, h) are not\nso close to one another that we can use them as the intermediate distributions of AIS. AIS typically\nrequires thousands of intermediate chains, and maintaining that number of parallel chains would\ncarry a prohibitive computational burden. On the other hand, the parallel tempering strategy of\nspacing the temperature to ensure moderately frequent swapping nicely matches the ideal operating\nregime of bridge sampling [2].\n\nWe thus consider a second observation model as p(O(\u2206\u03b2)\nFig.1. The quantities O(\u2206\u03b2)\n\u03b6i+1,t \u2212 \u03b6i,t. Entries O\u2206\u03b2\n\n1,t , . . . , O\u2206\u03b2\n\ni,t are given by:\n\n= [O\u2206\u03b2\n\nt\n\n| \u03b6t) = N (H\u03b6t, \u03a3\u2206\u03b2), with H de\ufb01ned in\nM\u22121,t] are obtained via bridge sampling as estimates of\n\nt\n\nN(cid:88)\n\nN(cid:88)\n\nO\u2206\u03b2\n\ni,t = log\n\ni,t \u2212 log\nu(n)\n\nn=1\n\nn=1\n\nv(n)\ni,t\n\n, where u(n)\n\ni,t =\n\nq\u2217\n\ni,t\n\u02dcqi,t\n\nx(n)\ni,t\nx(n)\ni,t\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n(cid:17) , v(n)\n\ni,t =\n\n(cid:16)\n(cid:16)\n\nq\u2217\n\ni,t\n\u02dcqi+1,t\n\nx(n)\ni+1,t\nx(n)\ni+1,t\n\n(cid:17)\n(cid:17) .\n\n(4)\n\n\u02dcqi,t(x)\u02dcqi+1,t(x)\n\nThe bridging distribution [2, 19] q\u2217\ni,t is chosen such that it has large support with both qi and\nqi+1. For all i \u2208 [1, M \u2212 1], we choose the approximately optimal distribution q(opt)\n(x) =\nsi,t \u02dcqi,t(x)+\u02dcqi+1,t(x) where si,t \u2248 Zi+1,t/Zi,t. Since the Zi,t\u2019s are the very quantities we are trying to\nestimate, this de\ufb01nition may seem problematic. However it is possible to start with a coarse estimate\nof si,1 and re\ufb01ne it in subsequent iterations by using the output of our tracking algorithm. \u03a3\u2206\u03b2 is\nonce again a diagonal covariance matrix, updated online from the variance of the log-importance\nweights u and v [19]. The i-th entry is given by Var[ui,t]\nu(n)\ni,t\n\ni2 + Var[vi,t]\ni2 .\n\nhP\n\nhP\n\nv(n)\ni,t\n\ni,t\n\nn\n\nn\n\n3.4 Kalman Filtering of the Log-Partition Function\n\nIn the above we have described two sources of information regarding the log partition function for\neach of the RBMs in the lattice. In this section we describe a method to fuse all available information\nto improve the overall accuracy of the estimate of every log partition function. We now consider the\nsteps involved in the inference process in moving from an estimate of the posterior over the latent\nstate at time t \u2212 1 to an estimate of the posterior at time t. We begin by assuming we know the\nposterior p(\u03b6t\u22121 | O(\u2206t)\n\nt\u22121:0), where O(\u00b7)\n\nt\u22121:0 = [O(\u00b7)\n\n1 , . . . , O(\u00b7)\n\nt\u22121:0, O(\u2206\u03b2)\n\nt\u22121].\n\nWe follow the treatment of Neal [18] in characterizing our uncertainty regarding \u03b6i,t as a Gaussian\ndistribution and de\ufb01ne p(\u03b6t\u22121 | O(\u2206t)\nt\u22121:0) \u223c N (\u00b5t\u22121,t\u22121, Pt\u22121,t\u22121), a multivariate Gaussian\nwith mean \u00b5t\u22121,t\u22121 and covariance Pt\u22121,t\u22121. The double index notation is used to indicate which\nis the latest observation being conditioned on for each of the two types of observations: e.g. \u00b5t,t\u22121\nrepresents the posterior mean given O(\u2206t)\n\nt\u22121:0, O(\u2206\u03b2)\n\nand O(\u2206\u03b2)\nt\u22121:0.\n\nt:0\n\nt\n\nt\u22121:0, O(\u2206\u03b2)\n\nt\u22121:0) = p(\u03b6t | \u03b6t\u22121)p(\u03b6t\u22121 | O(\u2206t)\n\ndepends on both \u03b6t and \u03b6t\u22121. In order to\nDeparting from the typical Kalman \ufb01lter setting, O(\u2206t)\nincorporate this observation into our estimate of the latent state, we \ufb01rst need to specify the prior\njoint distribution p(\u03b6t\u22121, \u03b6t | O(\u2206t)\nt\u22121:0), with p(\u03b6t |\n\u03b6t\u22121) as de\ufb01ned in Sec. 3.1. Observation O(\u2206t)\nis then incorporated through Bayes rule, yielding\np(\u03b6t\u22121, \u03b6t | O(\u2206t)\nt\u22121:0) . Having incorporated the importance sampling estimate into the model,\nwe can then marginalize over \u03b6t\u22121 (which is no longer required), to yield p(\u03b6t\nt\u22121:0).\nFinally, it remains only to incorporate the bridge sampler estimate O(\u2206\u03b2)\nby a second application\nof Bayes rule, which gives us p(\u03b6t | O(\u2206t)\n), the updated posterior over the latent state at\ntime-step t. The detailed inference equations are provided in Fig. 2 and can be derived easily from\nstandard textbook equations on products and marginals of normal distributions [4].\n\nt\u22121:0, O(\u2206\u03b2)\n\nt:0 , O(\u2206\u03b2)\n\nt:0 , O(\u2206\u03b2)\n\nt:0 , O(\u2206\u03b2)\n\n| O(\u2206t)\n\nt:0\n\nt\n\nt\n\n5\n\n\fInference Equations:\n\n(i)\n\np\n\n\u03b6t\u22121, \u03b6t | O(\u2206t)\n\n(ii)\n\nwith \u03b7t\u22121,t\u22121 =\np(\u03b6t\u22121, \u03b6t | O(\u2206t)\nwith Vt,t\u22121 = (V\n\n\u201c\n\n\u201c\n\n\u00bb Pt\u22121,t\u22121\n\n\u201d\n\u00bb \u00b5t\u22121,t\u22121\n\nt\u22121:0, O(\u2206\u03b2)\nt\u22121:0\n\n\u2013\n\n= N (\u03b7t\u22121,t\u22121, Vt\u22121,t\u22121)\nand Vt\u22121,t\u22121 =\n\u00b5t\u22121,t\u22121\nt\u22121:0) = N (\u03b7t,t\u22121 , Vt,t\u22121)\n\u201d\n\n\u22121\nt\u22121,t\u22121 + C T \u03a3\n\nt:0 , O(\u2206\u03b2)\n\n\u2013\n\nPt\u22121,t\u22121\n\nPt\u22121,t\u22121 \u03a3\u03b6 + Pt\u22121,t\u22121\n\n\u22121\n\u2206t C)\u22121 and \u03b7t,t\u22121 = Vt,t\u22121(C T \u03a3\u2206tO(\u2206t)\n\n\u22121\nt\u22121,t\u22121\u03b7t\u22121,t\u22121)\n= N (\u00b5t,t\u22121 , Pt,t\u22121) with \u00b5t,t\u22121 = [\u03b7t,t\u22121]2 and Pt,t\u22121 = [Vt,t\u22121]2,2\n\n+ V\n\nt\n\n\u03b6t | O(\u2206t)\n\n(iii) p\n(iv) p(\u03b6t | O(\u2206t)\n\nt:0 , O(\u2206\u03b2)\n\nt:0 , O(\u2206\u03b2)\nt\u22121:0\n\nwith Pt,t = (P\n\nt:0\n\u22121\nt,t\u22121 + H T \u03a3\n\n) = N (\u00b5t,t, Pt,t)\n\n\u22121\n\u2206\u03b2H)\u22121 and \u00b5t,t = Pt,t(H T \u03a3\u2206\u03b2O(\u2206\u03b2)\n\nt\n\n+ P\n\n\u22121\nt,t\u22121\u00b5t,t\u22121)\n\nFigure 2: Inference equations for our log partition tracking algorithm, a variant on the Kalman \ufb01lter. For any\nvector v and matrix V , we use the notation [v]2 to denote the vector obtained by preserving the bottom half\nelements of v and [V ]2,2 to indicate the lower right-hand quadrant of V .\n\n4 Experimental Results\n\n\u03b1\n\nFor the following experiments, SML was performed using either constant or decreasing learning\nrates. We used the decreasing schedule \u0001t = min(\u0001init\nt+1 , \u0001init), where \u0001t is the learning rate at\ntime-step t, \u0001init is the initial or base learning rate and \u03b1 is the decrease constant. Entries of \u03a3\u03b6\nZ = +\u221e, which is to say that we did not exploit the\n(see Section 3.1) were set as follows. We set \u03c32\nsmoothness prior when estimating the prior distribution over the joint p(\u03b6t\u22121, \u03b6t | O(\u2206t)\nt\u22121:0). \u03c32\nwas set to 10\u22123 \u00b7 \u0001t, allowing the estimated bias on O(\u2206t)\nto change faster for large learning rates.\nWhen initializing the RBM visible offsets4 as proposed in [8], the intermediate distributions of Eq. 1\nlead to sub-optimal swap rates between adjacent chains early in training, with a direct impact on the\nquality of tracking. In our experiments, we avoid this issue by using the intermediate distributions\nqi,t(x) \u221d exp[\u03b2i \u00b7 (\u2212hT W v \u2212 cT h) \u2212 bT v]. We tested mini-batch sizes N \u2208 [10, 20].\n\nt\u22121:0, O(\u2206\u03b2)\n\n1,t\n\nb\n\nComparing to Exact Likelihood We start by comparing the performance of our tracking algo-\nrithm to the exact likelihood, obtained by marginalizing over both visible and hidden units. We\nchose 25 hidden units and trained on the ubiquitous MNIST [13] dataset for 300k updates, using\nboth \ufb01xed and adaptive learning rates. The main results are shown in Figure 3.\nIn Figure 3(a), we can see that our tracker provides a very good \ufb01t to the likelihood with \u0001init = 0.001\nand decrease constants \u03b1 in {103, 104, 105}. Increasing the base learning rate to \u0001init = 0.01 in\nFigure 3(b), we maintain a good \ufb01t up to \u03b1 = 104, with a small dip in performance at 50k updates.\nOur tracker fails however to capture the oscillatory behavior engendered by too high of a learning\nrate (\u0001init = 0.01, \u03b1 = 105). It is interesting to note that the failure mode of our algorithm seems to\ncoincide with an unstable optimization process.\n\nComparing to AIS for Large-Scale Models\nIn evaluating the performance of our tracking algo-\nrithm on larger models, exact computation of the likelihood is no longer possible, so we use AIS\nas our baseline.5 Our models consisted of RBMs with 500 hidden units, trained using SML-APT\n[6] on the MNIST and Caltech Silhouettes [16] datasets. We performed 200k updates, with learning\nrate parameters \u0001init \u2208 {.01, .001} and \u03b1 \u2208 {103, 104, 105}.\nOn MNIST, AIS estimated the test-likelihood of our best model at \u221294.34 \u00b1 3.08 (where \u00b1 indi-\ncates the 3\u03c3 con\ufb01dence interval), while our tracking algorithm reported a value \u221289.96. On Cal-\ntech Silhouettes, our model reached \u2212134.23 \u00b1 21.14 according to AIS, while our tracker reported\n\n4Each bk is initialized to log \u00afxk\n1\u2212\u00afxk\n5Our base AIS con\ufb01g. was 103 intermediate distributions spaced linearly between \u03b2 = [0, 0.5], 104 dis-\ntributions for the interval [0.5, 0.9] and 104 for [0.9, 1.0]. Estimates of log Z are averaged over 100 annealed\nimportance weights.\n\n, where \u00afxk is the mean of the k-th dimension on the training set.\n\n6\n\n\fFigure 3: Comparison of exact test-set likelihood and estimated likelihood as given by AIS and our tracking\nalgorithm. We trained a 25-hidden unit RBM for 300k updates using SML, with a learning rate schedule\n\u0001t = min(\u03b1\u00b7\u0001init/(t+1), \u0001init), with (left) \u0001init = 0.001 and (right) \u0001init = 0.01 varying \u03b1 \u2208 {103, 104, 105}.\n\n(left) Plotted on left y-axis are the Kalman \ufb01lter measurements O\n\nFigure 4:\n, our log partition estimate\nof \u03b61,t and point estimates of \u03b61,t obtained by AIS. On the right y-axis, measurement O\nis plotted, along\nwith the estimated bias bt. Note how bt becomes progressively less-pronounced as \u0001t decreases and the model\nconverges. Also of interest, the variance on O\nincreases with t but is compensated by a decreasing variance\n, yielding a relatively smooth estimate \u03b61,t. (not shown) The \u00b13\u03c3 con\ufb01dence interval of the AIS\non O\nestimate at 200k updates was measured to be 3.08. (right) Example of early-stopping on dna dataset.\n\n(\u2206\u03b2)\nt\n\n(\u2206t)\nt\n\n(\u2206\u03b2)\nt\n\n(\u2206t)\nt\n\n\u2212114.31. To put these numbers in perspective, Salakhutdinov and Murray [23] reports values of\n\u2212125.53, \u2212105.50 and \u221286.34 for 500 hidden unit RBMs trained with CD{1,3,25} respectively.\nMarlin et al. [16] report around \u2212120 for Caltech Silhouettes, again using 500 hidden units.\nFigure 4(left) shows a detailed view of the Kalman \ufb01lter measurements and its output, for the best\nperforming MNIST model. We can see that the variance on O(\u2206\u03b2)\n(plotted on the left y-axis) grows\nslowly over time, which is mitigated by a decreasing variance on O(\u2206t)\n(plotted on the right y-\naxis). As the model converges and the learning rate decreases, qi,t\u22121 and qi,t become progressively\ncloser and the importance sampling estimates become more robust. The estimated bias term bt also\nconverges to zero.\nAn important point to note is that a naive linear-spacing of temperatures yielded low exchange\nrates between neighboring temperatures, with adverse effects on the quality of our bridge sampling\nestimates. As a result, we observed a drop in performance, both in likelihood as well as tracking\nperformance. Adaptive tempering [6] (with a \ufb01xed number of chains M) proved crucial in getting\ngood tracking for these experiments.\n\nt\n\nt\n\nEarly-Stopping Experiments Our \ufb01nal set of experiments highlights the performance of our\nmethod on a wide-variety of datasets [11]. In these experiments, we use our estimate of the log\n\n7\n\n050100150200250300Updates(x1e3)\u2212210\u2212200\u2212190\u2212180\u2212170\u2212160\u2212150\u2212140Likelihood(nats)\u00b5i=0.001\u03b1=1e3\u00b5i=0.001\u03b1=1e4\u00b5i=0.001\u03b1=1e5ExactAISKalman050100150200250300Updates(x1e3)\u2212210\u2212200\u2212190\u2212180\u2212170\u2212160\u2212150\u2212140\u2212130Likelihood(nats)\u00b5i=0.010\u03b1=1e3\u00b5i=0.010\u03b1=1e4\u00b5i=0.010\u03b1=1e5ExactAISKalman050100150200Updates(x1e3)380400420440460480500520540logZ(nats)\u03b6tO(\u2206\u03b2)tAISO(\u2206t)tbt\u221220246\u2206tlogZ0100200300400500Epochs\u2212105\u2212100\u221295\u221290\u221285\u221280Likelihood(nats)\u03b6t(train)\u03b6t(valid)AIS(valid)\fDataset\n\nadult\nconnect4\ndna\nmushrooms\nnips\nocr letters\nrcv1\nweb\n\nKalman\n-15.24\n-15.77\n-87.97\n-10.49\n-270.10\n-33.87\n-46.89\n-28.95\n\nRBM\n\n-15.70\n-16.81\n-88.51\n-14.68\n-271.23\n-31.45\n-48.61\n-29.91\n\nAIS\n(\u00b1 0.50)\n(\u00b1 0.67)\n(\u00b1 0.97)\n(\u00b1 30.75)\n(\u00b1 0.58)\n(\u00b1 2.70)\n(\u00b1 0.69)\n(\u00b1 0.74)\n\nRBM-25\n\nNADE\n\n-16.29\n-22.66\n-96.90\n-15.15\n-277.37\n-43.05\n-48.88\n-29.38\n\n-13.19\n-11.99\n-84.81\n-9.81\n-273.08\n-27.22\n-46.66\n-28.39\n\nTable 1: Test set likelihood on various datasets. Models were trained using SML-PT. Early-stopping was\nperformed by monitoring likelihood on a hold-out validation set, using our KF estimate of the log partition\nfunction. Best models (i.e. the choice of hyper-parameters) were then chosen according to the AIS likelihood\nestimate. Results for 25-hidden unit RBMs and NADE are taken from [11]. \u00b1 indicates a con\ufb01dence interval\nof three standard deviations.\n\npartition to monitor model performance on a held-out validation set. When the onset of over-\ufb01tting\nis detected, we store the model parameters and report the associated test-set likelihood, as estimated\nby both AIS and our tracking algorithm. The advantages of such an early-stopping procedure is\nshown in Figure 4(b), where training log-likelihood increases throughout training while validation\nperformance starts to decrease around 250 epochs. Detecting over-\ufb01tting without tracking the log\npartition would require a dense grid of AIS runs which would prove computationally prohibitive.\nWe tested parameters in the following range: number of hidden units in {100, 200, 500, 1000} (de-\npending on dataset size), learning rates in {10\u22122, 10\u22123, 10\u22124} either held constant during training\nor annealed with constants \u03b1 \u2208 {103, 104, 105}. For tempering, we used 10 \ufb01xed temperatures,\nspaced linearly between \u03b2 = [0, 1]. SGD was performed using mini-batches of size {10, 100} when\nestimating the gradient, and mini-batches of size {10, 20} for our set of tempered-chains (we thus\nsimulate 10 \u00d7 {10, 20} tempered chains in total). As can be seen in Table 4, our tracker performs\nvery well compared to the AIS estimates and across all datasets. Efforts to lower the variance of the\nAIS estimate proved unsuccessful, even going as far as 105 intermediate distributions.\n\n5 Discussion\n\nIn this paper, we have shown that while exact calculation of the partition function of RBMs may be\nintractable, one can exploit the smoothness of gradient descent learning in order to approximately\ntrack the evolution of the log partition function during learning. Treating the \u03b6i,t\u2019s as latent vari-\nables, the graphical model of Figure 1 allowed us to combine multiple sources of information to\nachieve good tracking of the log partition function throughout training, on a variety of datasets. We\nnote however that good tracking performance is contingent on the ergodicity of the negative phase\nsampler. Unsurprisingly, this is the same condition required by SML for accurate estimation of the\nnegative phase gradient.\nThe method presented in the paper is also computationally attractive, with only a small computaiton\noverhead relative to SML-PT training. The added computational cost lies in the computation of\nthe importance weights for importance sampling and bridge sampling. However, this boils down to\ncomputing free-energies which are mostly pre-computed in the course of gradient updates with the\nsole exception being the computation of \u02dcqi,t(xi,t\u22121) in the importance sampling step. In comparison\nto AIS, our method allows us to fairly accurately track the log partition function, and at a per-point\nestimate cost well below that of AIS. Having a reliable and accurate online estimate of the log\npartition function opens the door to a wide range of new research directions.\n\nAcknowledgments\n\nThe authors acknowledge the \ufb01nancial support of NSERC and CIFAR; and Calcul Qu\u00b4ebec for com-\nputational resources. We also thank Hugo Larochelle for access to the datasets of Sec. 4; Hannes\nSchulz, Andreas Mueller, Olivier Delalleau and David Warde-Farley for feedback on the paper and\nalgorithm; along with the developers of Theano [3].\n\n8\n\n\fReferences\n[1] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1),\n\n1\u2013127. Also published as a book. Now Publishers, 2009.\n\n[2] Bennett, C. (1976). Ef\ufb01cient estimation of free energy differences from Monte Carlo data. Journal of\n\nComputational Physics, 22(2), 245\u2013268.\n\n[3] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nIn Proceedings of the\n\nD., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nPython for Scienti\ufb01c Computing Conference (SciPy). Oral.\n\n[4] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.\n[5] Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for training restricted\nboltzmann machines. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference\non Machine Learning (ICML-11), ICML \u201911, pages 105\u2013112, New York, NY, USA. ACM.\n\n[6] Desjardins, G., Courville, A., and Bengio, Y. (2010a). Adaptive parallel tempering for stochastic maximum\n\nlikelihood learning of rbms. NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop.\n\n[7] Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010b). Tempered Markov chain\nmonte carlo for training of restricted Boltzmann machine. In JMLR W&CP: Proceedings of the Thirteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS 2010), volume 9, pages 145\u2013152.\n[8] Hinton, G. (2010). A practical guide to training restricted boltzmann machines. Technical Report 2010003,\n\nUniversity of Toronto. version 1.\n\n[9] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18, 1527\u20131554.\n\n[10] Iba, Y. (2001). Extended ensemble monte carlo. International Journal of Modern Physics, C12, 623\u2013656.\n[11] Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In Proceedings\nof the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS\u20192011), volume\n15 of JMLR: W&CP.\n\n[12] Larochelle, H., Bengio, Y., and Turian, J. (2010). Tractable multivariate binary density estimation and the\n\nrestricted Boltzmann forest. Neural Computation, 22(9), 2285\u20132307.\n\n[13] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient based learning applied to document\n\nrecognition. IEEE, 86(11), 2278\u20132324.\n\n[14] Lingenheil, M., Denschlag, R., Mathias, G., and Tavan, P. (2009). Ef\ufb01ciency of exchange schemes in\n\nreplica exchange. Chemical Physics Letters, 478(1-3), 80 \u2013 84.\n\n[15] Marinari, E. and Parisi, G. (1992). Simulated tempering: A new monte carlo scheme. EPL (Europhysics\n\nLetters), 19(6), 451.\n\n[16] Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2009). Inductive principles for restricted boltzmann\nmachine learning. In Proceedings of The Thirteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS\u201910), volume 9, pages 509\u2013516.\n\n[17] Murray, I. and Ghahramani, Z. (2004). Bayesian learning in undirected graphical models: Approximate\n\nmcmc algorithms.\n\n[18] Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125\u2013139.\n[19] Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.\n[20] Salakhutdinov, R. (2010a). Learning deep boltzmann machines using adaptive mcmc. In L. Bottou and\nM. Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning\n(ICML-10), volume 1, pages 943\u2013950. ACM.\n\n[21] Salakhutdinov, R. (2010b). Learning in Markov random \ufb01elds using tempered transitions. In NIPS\u201909.\n[22] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS\u20192009, volume 5,\n\npages 448\u2013455.\n\n[23] Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In W. W.\n\nCohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages 872\u2013879. ACM.\n\n[24] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling\n\nmotion style. In L. Bottou and M. Littman, editors, ICML 2009, pages 1025\u20131032. ACM.\n\n[25] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood\ngradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064\u20131071. ACM.\n[26] Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive divergence. In\n\nL. Bottou and M. Littman, editors, ICML 2009, pages 1033\u20131040. ACM.\n\n[27] Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an applica-\n\ntion to information retrieval. In NIPS\u201904, volume 17, Cambridge, MA. MIT Press.\n\n9\n\n\f", "award": [], "sourceid": 1350, "authors": [{"given_name": "Guillaume", "family_name": "Desjardins", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Aaron", "family_name": "Courville", "institution": null}]}