{"title": "Training Restricted Boltzmann Machine via the \ufffcThouless-Anderson-Palmer free energy", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 648, "abstract": "Restricted Boltzmann machines are undirected neural networks which have been shown tobe effective in many applications, including serving as initializations fortraining deep multi-layer neural networks. One of the main reasons for their success is theexistence of efficient and practical stochastic algorithms, such as contrastive divergence,for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategycan be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machineswith hidden units.", "full_text": "Training Restricted Boltzmann Machines via the\n\nThouless-Anderson-Palmer Free Energy\n\nMarylou Gabri\u00b4e\n\nEric W. Tramel\n\nFlorent Krzakala\n\nLaboratoire de Physique Statistique, UMR 8550 CNRS\n\n\u00b4Ecole Normale Sup\u00b4erieure & Universit\u00b4e Pierre et Marie Curie\n\n75005 Paris, France\n\n{marylou.gabrie, eric.tramel}@lps.ens.fr, florent.krzakala@ens.fr\n\nAbstract\n\nRestricted Boltzmann machines are undirected neural networks which have been\nshown to be effective in many applications, including serving as initializations\nfor training deep multi-layer neural networks. One of the main reasons for their\nsuccess is the existence of ef\ufb01cient and practical stochastic algorithms, such as\ncontrastive divergence, for unsupervised training. We propose an alternative deter-\nministic iterative procedure based on an improved mean \ufb01eld method from statis-\ntical physics known as the Thouless-Anderson-Palmer approach. We demonstrate\nthat our algorithm provides performance equal to, and sometimes superior to, per-\nsistent contrastive divergence, while also providing a clear and easy to evaluate\nobjective function. We believe that this strategy can be easily generalized to other\nmodels as well as to more accurate higher-order approximations, paving the way\nfor systematic improvements in training Boltzmann machines with hidden units.\n\n1\n\nIntroduction\n\nA restricted Boltzmann machine (RBM) [1, 2] is a type of undirected neural network with sur-\nprisingly many applications. This model has been used in problems as diverse as dimensionality\nreduction [3], classi\ufb01cation [4], collaborative \ufb01ltering [5], feature learning [6], and topic modeling\n[7]. Also, quite remarkably, it has been shown that generative RBMs can be stacked into multi-layer\nneural networks, forming an initialization for deep network architectures [8, 9]. Such deep architec-\ntures are believed to be crucial for learning high-order representations and concepts. Although the\namount of training data available in practice has made pretraining of deep nets dispensable for super-\nvised tasks, RBMs remain at the core of unsupervised learning, a key area for future developments\nin machine intelligence [10].\nWhile the training procedure for RBMs can be written as a log-likelihood maximization, an ex-\nact implementation of this approach is computationally intractable for all but the smallest models.\nHowever, fast stochastic Monte Carlo methods, speci\ufb01cally contrastive divergence (CD) [2] and per-\nsistent CD (PCD) [11, 12], have made large-scale RBM training both practical and ef\ufb01cient. These\nmethods have popularized RBMs even though it is not entirely clear why such approximate methods\nshould work as well as they do.\nIn this paper, we propose an alternative deterministic strategy for training RBMs, and neural net-\nworks with hidden units in general, based on the so-called mean-\ufb01eld, and extended mean-\ufb01eld,\nmethods of statistical mechanics. This strategy has been used to train neural networks in a num-\nber of earlier works [13, 14, 15, 16, 17]. In fact, for entirely visible networks, the use of adaptive\ncluster expansion mean-\ufb01eld methods has lead to spectacular results in learning Boltzmann machine\nrepresentations [18, 19].\n\n1\n\n\fHowever, unlike these fully visible models, the hidden units of the RBM must be taken into account\nduring the training procedure. In 2002, Welling and Hinton [17] presented a similar deterministic\nmean-\ufb01eld learning algorithm for general Boltzmann machines with hidden units, considering it a\npriori as a potentially ef\ufb01cient extension of CD. In 2008, Tieleman [12] tested the method in detail\nfor RBMs and found it provided poor performance when compared to both CD and PCD. In the\nwake of these two papers, little inquiry has been made in this direction, with the apparent consensus\nbeing that the deterministic mean-\ufb01eld approach is ineffective for RBM training.\nOur goal is to challenge this consensus by going beyond na\u00a8\u0131ve mean \ufb01eld, a mere \ufb01rst-order approx-\nimation, by introducing second-, and possibly third-, order terms. In principle, it is even possible to\nextend the approach to arbitrary order. Using this extended mean-\ufb01eld approximation, commonly\nknown as the Thouless-Anderson-Palmer [20] approach in statistical physics, we \ufb01nd that RBM\ntraining performance is signi\ufb01cantly improved over the na\u00a8\u0131ve mean-\ufb01eld approximation and is even\ncomparable to PCD. The clear and easy to evaluate objective function, along with the extensible\nnature of the approximation, paves the way for systematic improvements in learning ef\ufb01ciency.\n\n2 Training restricted Boltzmann machines\n\nA restricted Boltzmann machine, which can be viewed as a two layer undirected bipartite neural\nnetwork, is a speci\ufb01c case of an energy based model wherein a layer of visible units is fully con-\nnected to a layer of hidden units. Let us denote the binary visible and hidden units, indexed by i and\nj respectively, as vi and hj. The energy of a given state, v = {vi}, h = {hj}, of the RBM is given\nby\n(1)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nviWijhj,\n\nbjhj \u2212\n\nj\n\ni,j\n\nE(v, h) = \u2212\n\naivi \u2212\n\ni\n\nresults in the difference\n\nwhere Wij are the entries of the matrix specifying the weights, or couplings, between the visible and\nhidden units, and ai and bj are the biases, or the external \ufb01elds in the language of statistical physics,\nof the visible and hidden units, respectively. Thus, the set of parameters {Wij, ai, bj} de\ufb01nes the\nRBM model.\nThe joint probability distribution over the visible and hidden units is given by the Gibbs-Boltzmann\nv,h e\u2212E(v,h) is the normalization constant known\nas the partition function in physics. For a given data point, represented by v, the marginal of the\nh P (v, h). Writing this marginal of v in terms of its log-likelihood\nL = ln P (v) = \u2212F c(v) + F,\n\nmeasure P (v, h) = Z\u22121e\u2212E(v,h), where Z =(cid:80)\nRBM is calculated as P (v) =(cid:80)\nwhere F = \u2212 ln Z is the free energy of the RBM, and F c(v) = \u2212 ln((cid:80)\n\n(2)\nh e\u2212E(v,h)) can be inter-\npreted as a free energy as well, but with visible units \ufb01xed to the training data point v. Hence, F c is\nreferred to as the clamped free energy.\nOne of the most important features of the RBM model is that F c can be easily computed as h\nmay be summed out analytically since the hidden units are conditionally independent of the visible\nunits, owing to the RBM\u2019s bipartite structure. However, calculating F is computationally intractable\nsince the number of possible states to sum over scales combinatorially with the number of units in\nthe model. This complexity frustrates the exact computation of the gradients of the log-likelihood\nneeded in order to train the RBM parameters via gradient ascent. Monte Carlo methods for RBM\ntraining rely on the observation that\n= P (vi = 1, hj = 1), which can be simulated at a\nlower computational cost. Nevertheless, drawing independent samples from the model in order\nto approximate this derivative is itself computationally expensive and often approximate sampling\nalgorithms, such as CD or PCD, are used instead.\n\n\u2202Wij\n\n\u2202F\n\n3 Extended mean \ufb01eld theory of RBMs\n\nHere, we present a physics-inspired tractable estimation of the free energy F of the RBM. This\napproximation is based on a high temperature expansion of the free energy derived by Georges and\nYedidia in the context of spin glasses [21] following the pioneering works of [20, 22]. We refer the\nreader to [23] for a review of this topic.\n\n2\n\n\f(cid:80)\n\n(cid:80)\n\ni aisi \u2212\n\n(i,j) Wijsisj\n\nTo apply the Georges-Yedidia expansion to the RBM free energy, we start with a general energy\nbased model which possesses arbitrary couplings Wij between undifferentiated binary spins si \u2208\n{0, 1}, such that the energy of the Gibbs-Boltzmann measure on the con\ufb01guration s = {si} is\n1. We also restore the role of the temperature,\nde\ufb01ned by E(s) = \u2212\nusually considered constant and for simplicity set to 1 in most energy based models, by multiplying\nthe energy functional in the Boltzmann weight by the inverse temperature \u03b2.\nNext, we apply a Legendre transform to the free energy, a standard procedure in statistical physics,\nby \ufb01rst writing the free energy as a function of a newly introduced auxiliary external \ufb01eld q = {qi},\ni qisi. This external \ufb01eld will be eventually set to the value q = 0 in\norder to recover the true free energy. The Legendre transform \u0393 is then given as a function of the\nconjugate variable m = {mi} by maximizing over q,\n\n\u2212\u03b2F [q] = ln(cid:80)\n\ns e\u2212\u03b2E(s)+\u03b2(cid:80)\n\n\u2212\u03b2\u0393[m] = \u2212\u03b2 max\n\nq\n\n[F [q] +\n\nqimi] = \u2212\u03b2(F [q\u2217[m]] +\n\n\u2217\ni [m]mi),\nq\n\n(3)\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\ndq . Since the derivative dF\n\nwhere the maximizing auxiliary \ufb01eld q\u2217[m], a function of the conjugate variables, is the inverse\nfunction of m[q] \u2261 \u2212 dF\ndq is exactly equal to \u2212(cid:104)s(cid:105), where the operator (cid:104)\u00b7(cid:105)\nrefers to the average con\ufb01guration under the Boltzmann measure, the conjugate variable m is in fact\nthe equilibrium magnetization vector (cid:104)s(cid:105). Finally, we observe that the free energy is also the inverse\nLengendre transform of its Legendre transform at q = 0,\n(4)\n\u2212\u03b2F = \u2212\u03b2F [q = 0] = \u2212\u03b2 min\nwhere m\u2217 minimizes \u0393, which yields an expression of the free energy in terms of the magnetization\nvector. Following [22, 21], this formulation allows us to perform a high temperature expansion of\nA(\u03b2, m) \u2261 \u2212\u03b2\u0393[m] around \u03b2 = 0 at \ufb01xed m,\n\n[\u0393[m]] = \u2212\u03b2\u0393[m\u2217],\n\nm\n\nA(\u03b2, m) = A(0, m) + \u03b2\n\n\u2202A(\u03b2, m)\n\n\u2202\u03b2\n\n+\n\n\u03b22\n2\n\n\u22022A(\u03b2, m)\n\n\u2202\u03b22\n\n+ \u00b7\u00b7\u00b7 ,\n\n(5)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b2=0\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b2=0\n\nwhere the dependence on \u03b2 of the product \u03b2q must carefully be taken into account. At in\ufb01nite\ntemperature, \u03b2 = 0, the spins decorrelate, causing the average value of an arbitrary product of spins\nto equal the product of their local magnetizations; a useful property. Accounting for binary spins\ntaking values in {0, 1}, one obtains the following expansion\n\n(cid:88)\n\n(cid:88)\n\n\u2212\u03b2\u0393(m) = \u2212\n\n[mi ln mi + (1 \u2212 mi) ln(1 \u2212 mi)] + \u03b2\n\naimi + \u03b2\n\nWijmimj\n\ni\n\n(i,j)\n\nW 2\nij(mi \u2212 m2\n\ni )(mj \u2212 m2\nj )\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:19)\n\n(cid:18) 1\n\n2 \u2212 mj\n\n(mj \u2212 m2\nj )\n\nW 3\nij(mi \u2212 m2\ni )\n\n2 \u2212 mi\n\n(cid:88)\n\n+\n\ni\n\u03b22\n2\n\n(cid:88)\n(cid:88)\n+ \u03b23 (cid:88)\n\n2\u03b23\n3\n\n(i,j)\n\n+\n\n(i,j)\n\n(i,j,k)\n\nWijWjkWki(mi \u2212 m2\n\ni )(mj \u2212 m2\n\nj )(mk \u2212 m2\n\nk) + \u00b7\u00b7\u00b7 1\n\n(6)\n\nThe zeroth-order term corresponds to the entropy of non-interacting spins with constrained mag-\nnetizations values. Taking this expansion up to the \ufb01rst-order term, we recover the standard na\u00a8\u0131ve\nmean-\ufb01eld theory. The second-order term is known as the Onsager reaction term in the TAP equa-\ntions [20]. The higher orders terms are systematic corrections which were \ufb01rst derived in [21].\nReturning to the RBM notation and truncating the expansion at second-order for the remainder of\nthe theoretical discussion, we have\n\n\u0393(mv, mh) \u2248 \u2212S(mv, mh) \u2212\n\naimv\n\ni \u2212\n\nbjmh\nj\n\n1The notation(cid:80)\n\n(i,j) and(cid:80)\n\nWijmv\n\ni mh\n\nj +\n\nW 2\nij\n2\n\n(mv\n\ni \u2212 (mv\n\ni )2)(mh\n\nj \u2212 (mh\n\nj )2),\n\n(7)\n\n(i,j,k) refers to the sum over the distinct pairs and triplets of spins, respectively.\n\n(cid:88)\n\ni,j\n\n\u2212\n\n(cid:88)\n\nj\n\n(cid:88)\n\ni\n\n3\n\n\fwhere S is the entropy contribution, mv and mh are introduced to denote the magnetization of the\nvisible and hidden units, and \u03b2 is set equal to 1. Eq. (7) can be viewed as a weak coupling expansion\nin Wij. To recover an estimate of the RBM free energy, Eq. (7) must be minimized with respect to\nits arguments, as in Eq. (4). Lastly, by writing the stationary condition d\u0393\ndm = 0, we obtain the self-\nconsistency constraints on the magnetizations. At second-order we obtain the following constraint\non the visible magnetizations,\n\n\uf8ee\uf8f0ai +\n\n(cid:88)\n\nj\n\n(cid:18)\n\n(cid:19)(cid:0)mh\n\nj )2(cid:1)\uf8f9\uf8fb ,\n\nmv\ni \u2248 sigm\n\nWijmh\n\nj \u2212 W 2\n\nij\n\nmv\n\ni \u2212\n\n1\n2\n\nj \u2212 (mh\n\n(8)\n\nwhere sigm[x] = (1 + e\u2212x)\u22121 is a logistic sigmoid function. A similar constraint must be satis\ufb01ed\nfor the hidden units, as well. Clearly, the stationarity condition for \u0393 obtained at order n utilizes\nterms up to the nth order within the sigmoid argument of these consistency relations. Whatever\nthe order of the approximation, the magnetizations are the solutions of a set of non-linear coupled\nequations of the same cardinality as the number of units in the model. Finally, provided we can\nde\ufb01ne a procedure to ef\ufb01ciently derive the value of the magnetizations satisfying these constraints,\nwe obtain an extended mean-\ufb01eld approximation of the free energy which we denote as F EMF.\n\n4 RBM evaluation and unsupervised training with EMF\n\n4.1 An iteration for calculating F EMF\nRecalling the log-likelihood of the RBM, L = \u2212F c(v) + F , we have shown that a tractable ap-\nproximation of F , F EMF, is obtained via a weak coupling expansion so long as one can solve the\ncoupled system of equations over the magnetizations shown in Eq. (8). In the spirit of iterative\nbelief propagation [23], we propose that these self-consistency relations can serve as update rules\nfor the magnetizations within an iterative algorithm. In fact, the convergence of this procedure has\nbeen rigorously demonstrated in the context of random spin glasses [24]. We expect that these con-\nvergence properties will remain present even for real data. The iteration over the self-consistency\nrelations for both the hidden and visible magnetizations can be written using the time index t as\n\n(cid:34)\n\uf8ee\uf8f0ai +\n\nbj +\n\n(cid:88)\n(cid:88)\n\ni\n\nj\n\nmh\nj [t + 1] \u2190 sigm\n\nmv\ni [t + 1] \u2190 sigm\n\ni [t])2(cid:1)(cid:35)\n\n(cid:18)\n\n(cid:18)\n\n1\n2\n\n(cid:19)(cid:0)mv\n(cid:19)(cid:0)mh\n\n1\n2\n\n,\n\n(9, 10)\n\nj [t + 1])2(cid:1)\uf8f9\uf8fb ,\n\nWijmv\n\ni [t] \u2212 W 2\n\nij\n\nmh\nj [t] \u2212\n\ni [t] \u2212 (mv\n\nWijmh\n\nj [t + 1] \u2212 W 2\n\nij\n\nmv\ni [t] \u2212\n\nj [t + 1] \u2212 (mh\n\nwhere the time indexing follows from application of [24]. The values of mv and mh minimizing\n\u0393(mv, mh), and thus providing the value of F EMF, are obtained by running Eqs. (9, 10) until they\nconverge to a \ufb01xed point. We note that while we present an iteration to \ufb01nd F EMF up to second-order\nabove, third-order terms can easily be introduced into the procedure.\n\n4.2 Deterministic EMF training\n\nBy using the EMF estimation of F , and the iterative algorithm detailed in the previous section to\ncalculate it, it is now possible to estimate the gradients of the log-likelihood used for unsupervised\ntraining of the RBM model by substituting F with F EMF. We note that the deterministic iteration\nwe propose for estimating F is in stark contrast with the stochastic sampling procedures utilized in\nCD and PCD to the same end. The gradient ascent update of weight Wij is approximated as\n\n\u2206Wij \u221d\n\n\u2202L\n\u2202Wij \u2248 \u2212\n\n\u2202F c\n\u2202Wij\n\n+\n\n\u2202F EMF\n\u2202Wij\n\n,\n\n(11)\n\nwhere \u2202F EMF\ncan be computed by differentiating Eq. (7) at \ufb01xed mv and mh and computing the\n\u2202Wij\nvalue of this derivative at the \ufb01xed points of Eqs. (9, 10) obtained from the iterative procedure. The\ngradients with respect to the visible and hidden biases can be derived similarly. Interestingly, \u2202F EMF\n\n\u2202ai\n\n4\n\n\f\u2202bj\n\ni and mh\nj ,\n\nare merely the \ufb01xed-point magnetizations of the visible and hidden units, mv\n\nand \u2202F EMF\nrespectively.\nA priori, the training procedure sketched above can be used at any order of the weak coupling\nexpansion. The training algorithm introduced in [17], which was shown to perform poorly for RBM\ntraining in [12], can be recovered by retaining only the \ufb01rst-order of the expansion when calculating\nF EMF. Taking F EMF to second-order, we expect that training ef\ufb01ciency and performance will be\ngreatly improved over [17]. In fact, including the third-order term in the training algorithm is just\nas easy as including the second-order one, due to the fact that the particular structure of the RBM\nmodel does not admit triangles in its corresponding factor graphs. Although the third-order term in\nEq. (6) does include a sum over distinct pairs of units, as well as a sum over coupled triplets of units,\nsuch triplets are excluded by the bipartite structure of the RBM. However, coupled quadruplets do\ncontribute to the fourth-order term and therefore fourth- and higher-order approximations require\nmuch more expensive computations [21], though it is possible to utilize adaptive procedures [19].\n\n5 Numerical experiments\n\n5.1 Experimental framework\n\nTo evaluate the performance of the proposed deterministic EMF RBM training algorithm1, we per-\nform a number of numerical experiments over two separate datasets and compare these results with\nboth CD-1 and PCD. We \ufb01rst use the MNIST dataset of labeled handwritten digit images [25]. The\ndataset is split between 60 000 training images and 10 000 test images. Both subsets contain approx-\nimately the same fraction of the ten digit classes (0 to 9). Each image is comprised of 28\u00d7 28 pixels\ntaking values in the range [0, 255]. The MNIST dataset was binarized by setting all non-zero pixels\nto 1 in all experiments.\nSecond, we use the 28 \u00d7 28 pixel version of the Caltech 101 Silhouette dataset [26]. Constructed\nfrom the Caltech 101 image dataset, the silhouette dataset consists of black regions of the primary\nforeground scene objects on a white background. The images are labeled according to the object in\nthe original picture, of which there are 101 unevenly represented object labels. The dataset is split\nbetween a training (4 100 images), a validation (2 264 images), and a test (2 304 images) sets.\nFor both datasets, the RBM models require 784 visible units. Following previous studies evaluating\nRBMs on these datasets, we \ufb01x the number of RBM hidden units to 500 in all our experiments. Dur-\ning training, we adopt the mini-batch learning procedure for gradient averaging, with 100 training\npoints per batch for MNIST and 256 training points per batch for Caltech 101 Silhouette.\nWe test the EMF learning algorithm presented in Section 4.2 in various settings. First, we com-\npare implementations utilizing the \ufb01rst-order (MF), second-order (TAP2), and third-order (TAP3)\napproximations of F . Higher orders were not considered due to their greater complexity. Next,\nwe investigate training quality when the self-consistency relations on the magnetizations were not\nconverged when calculating the derivatives of F EMF, instead iterated for a small, \ufb01xed (3) number\nof times, an approach similar to CD. Furthermore, we also evaluate a \u201cpersistent\u201d version of our\nalgorithm, similar to [12]. As in PCD, the iterative EMF procedure possesses multiple initialization-\ndependent \ufb01xed-point magnetizations. Converging multiple chains allows us to collect proper statis-\ntics on these basins of attraction.\nIn this implementation, the magnetizations of a set of points,\ndubbed fantasy particles, are updated and maintained throughout the training in order to estimate\nF . This persistent procedure takes advantage of the fact that the RBM-de\ufb01ned Boltzmann measure\nchanges only slightly between parameter updates. Convergence to the new \ufb01xed point magnetiza-\ntions at each minibatch should therefore be sped up by initializing with the converged state from\nthe previous update. Our \ufb01nal experiments consist of persistent training algorithms using 3 itera-\ntions of the magnetization self-consistency relations (P-MF, P-TAP2 and P-TAP3) and one persistent\ntraining algorithm using 30 iterations (P-TAP2-30) for comparison.\nFor comparison, we also train RBM models using CD-1, following the prescriptions of [27], and\nPCD, as implemented in [12]. Given that our goal is to compare RBM training approaches rather\nthan achieving the best possible training across all free parameters, neither momentum nor adaptive\nlearning rates were included in any of the implementations tested. However, we do employ a weight\n\n1Available as a Julia package at https://github.com/sphinxteam/Boltzmann.jl\n\n5\n\n\fFigure 1: Estimates of the per-sample log-likelihood over the MNIST test set, normalized by the\ntotal number of units, as a function of the number of training epochs. The results for the different\ntraining algorithms are plotted in different colors with the same color code used for both panels. Left\npanel : Pseudo log-likelihood estimate. The difference between EMF algorithms and contrastive\ndivergence algorithms is minimal. Right panel : EMF log-likelihood estimate at 2nd order. The\nimprovement from MF to TAP is clear. Perhaps reasonably, TAP demonstrates an advantage over\nCD and PCD. Notice how the second-order EMF approximation of L provides less noisy estimates,\nat a lower computational cost.\n\ndecay regularization in all our trainings to keep weights small; a necessity for the weak coupling\nexpansion on which the EMF relies. When comparing learning procedures on the same plot, all free\nparameters of the training (e.g. learning rate, weight decay, etc.) were set identically. All results are\npresented as averages over 10 independent trainings with standard deviations reported as error bars.\n\n5.2 Relevance of the EMF log-likelihood\n\nOur \ufb01rst observation is that the implementations of the EMF training algorithms are not overly\nbelabored. The free parameters relevant for the PCD and CD-1 procedures were found to be equally\nwell suited for the EMF training algorithms. In fact, as shown in the left panel of Fig. 1, and the\nright inset of Fig. 3, the ascent of the pseudo log-likelihood over training epochs is very similar\nbetween the EMF training methods and both the CD-1 and PCD trainings.\nInterestingly, for the Caltech 101 Silhouettes dataset, it seems that the persistent algorithms tested\nhave dif\ufb01culties in ascending the pseudo-likelihood in the \ufb01rst epochs of training. This contradicts\nthe common belief that persistence yields more accurate approximations of the likelihood gradients.\nThe complexity of the training set, 101 classes unevenly represented over only 4 100 training points,\nmight explain this unexpected behavior. The persistent fantasy particles all converge to similar non-\ninformative blurs in the earliest training epochs with many epochs being required to resolve the\nparticles to a distribution of values which are informative about the pseudo log-likelihood.\nExamining the fantasy particles also gives an idea of the performance of the RBM as a generative\nmodel. In Fig. 2, 24 randomly chosen fantasy particles from the 50th epoch of training with PCD,\nP-MF, and P-TAP2 are displayed. The RBM trained with PCD generates recognizable digits, yet\nthe model seems to have trouble generating several digit classes, such as 3, 8, and 9. The fantasy\nparticles extracted from a P-MF training are of poorer quality, with half of the drawn particles\nfeaturing non-identi\ufb01able digits. The P-TAP2 algorithm, however, appears to provide qualitative\nimprovements. All digits can be visually discerned, with visible defects found only in two of the\nparticles. These particles seem to indicate that it is indeed possible to ef\ufb01ciently persistently train\nan RBM without converging on the \ufb01xed point of the magnetizations.\nThe relevance of the EMF log-likelihood for RBM training is further con\ufb01rmed in the right panel\nof Fig. 1, where we observe that both CD-1 and PCD ascend the second-order EMF log-likelihood,\neven though they are not explicitly constructed to optimize over this objective. As expected, the\npersistent TAP2 algorithm with 30 iterations of the magnetizations (P-TAP2-30) achieves the best\nmaximization of LEM F . However, P-TAP2, with only 3 iterations of the magnetizations, achieves\nvery similar performance, perhaps making it preferable when a faster training algorithm is desired.\n\n6\n\n01020304050Epoch\u22120.12\u22120.10\u22120.08\u22120.06\u22120.04pseudoLUnits\u00d7SamplesCD-1PCDP-TAP3TAP2P-TAP2-30P-MFP-TAP201020304050Epoch\u22120.12\u22120.10\u22120.08\u22120.06\u22120.04LEMFUnits\u00d7Samples\fFigure 2: Fantasy particles generated by a 500 hidden unit RBM after 50 epochs of training on\nthe MNIST dataset with PCD (top two rows), P-MF (middle two rows) and P-TAP2 (bottom two\nrows). These fantasy particles represent typical samples generated by the trained RBM when used as\na generative prior for handwritten numbers. The samples generated by P-TAP2 are of similar subjec-\ntive quality, and perhaps slightly preferable, to those generated by PCD, while certainly preferable\nto those generated by P-MF.\n\nMoreover, we note that although P-TAP2 demonstrates improvements with respect to the P-MF, the\nP-TAP3 does not yield signi\ufb01cantly better results than P-TAP2. This is perhaps not surprising since\nthe third order term of the EMF expansion consists of a sum over as many terms as the second order,\nbut at a smaller order in {Wij}.\nLastly, we note the computation times for each of these approaches. For a Julia implementation of\nthe tested RBM training techniques running on a 3.2 GHz Intel i5 processor, we report the 10 trial\naverage wall times for \ufb01tting a single 100-sample batch normalized against the model complexity.\nPCD, which uses only a single sampling step, required 14.10\u00b10.97 \u00b5s/batch/unit. The three EMF\ntechniques, P-MF, P-TAP2, and P-TAP3, each of which use 3 magnetization iterations, required\n21.25 \u00b1 0.22 \u00b5s/batch/unit, 37.22 \u00b1 0.34 \u00b5s/batch/unit, and 64.88 \u00b1 0.45 \u00b5s/batch/unit,\nrespectively.\nIf fewer magnetization iterations are required, as we have empirically observed in\nlimited tests, then the run times of the P-MF and P-TAP2 approaches are commesurate with PCD.\n\n5.3 Classi\ufb01cation task performance\n\nWe also evaluate these RBM training algorithms from the perspective of supervised classi\ufb01cation.\nAn RBM can be interpreted as a deterministic function mapping the binary visible unit values to\nthe real-valued hidden unit magnetizations. In this case, the hidden unit magnetizations represent\nthe contributions of some learned features. Although no supervised \ufb01ne-tuning of the weights is\nimplemented, we tested the quality of the features learned by the different training algorithms by\ntheir usefulness in classi\ufb01cation tasks. For both datasets, a logistic regression classi\ufb01er was cal-\nibrated with the hidden units magnetizations mapped from the labeled training images using the\nscikit-learn toolbox [28]. We purposely avoid using more sophisticated classi\ufb01cation algo-\nrithms in order to place emphasis on the quality of the RBM training, not the classi\ufb01cation method.\nIn Fig. 3, we see that the MNIST classi\ufb01cation accuracy of the RBMs trained with the P-TAP2\nalgorithms is roughly equivalent with that obtained when using PCD training, while CD-1 training\nyields markedly poorer classi\ufb01cation accuracy. The slight decrease in performance of CD-1 and\nTAP2 along as the training epochs increase might be emblematic of over-\ufb01tting by the non-persistent\nalgorithms, although no decrease in the EMF test set log-likelihood was observed.\nFinally, for the Caltech 101 Silhouettes dataset, the classi\ufb01cation task, shown in the right panel of\nFig. 3, is much more dif\ufb01cult a priori. Interestingly, the persistent algorithms do not yield better\nresults on this task. However, we observe that the performance of deterministic EMF RBM training\nis at least comparable with both CD-1 and PCD.\n\n7\n\nPCD-1P-MFP-TAP\fFigure 3: Test set classi\ufb01cation accuracy for the MNIST (left) and Caltech 101 Silhouette (right)\ndatasets using logistic regression on the hidden-layer marginal probabilities as a function of the num-\nber of epochs. As a baseline comparison, the classi\ufb01cation accuracy of logistic regression performed\ndirectly on the data is given as a black dashed line. The results for the different training algorithms\nare displayed in different colors, with the same color code being used in both panels. (Right inset:)\nPseudo log-likelihood over training epochs for the Caltech 101 Silhouette dataset.\n\n6 Conclusion\n\nWe have presented a method for training RBMs based on an extended mean \ufb01eld approximation.\nAlthough a na\u00a8\u0131ve mean \ufb01eld learning algorithm had already been designed for RBMs, and judged\nunsatisfactory [17, 12], we have shown that extending beyond the na\u00a8\u0131ve mean \ufb01eld to include terms\nof second-order and above brings signi\ufb01cant improvements over the \ufb01rst-order approach and allows\nfor practical and ef\ufb01cient deterministic RBM training with performance comparable to the stochastic\nCD and PCD training algorithms.\nThe extended mean \ufb01eld theory also provides an estimate of the RBM log-likelihood which is easy\nto evaluate and thus enables practical monitoring of the progress of unsupervised learning through-\nout the training epochs. Furthermore, training on real-valued magnetizations is theoretically well-\nfounded within the presented approach, paving the way for many possible extensions. For instance,\nit would be quite straightforward to apply the same kind of expansion to Gauss-Bernoulli RBMs, as\nwell as to multi-label RBMs.\nThe extended mean \ufb01eld approach might also be used to learn stacked RBMs jointly, rather than\nseparately, as is done in both deep Boltzmann machine and deep belief network pre-training, a\nstrategy that has shown some promise [29]. In fact, the approach can be generalized even to non-\nrestricted Boltzmann machines with hidden variables with very little dif\ufb01culty. Another interesting\npossibility would be to make use of higher-order terms in the series expansion using adaptive cluster\nmethods such as those used in [19]. We believe our results show that the extended mean \ufb01eld\napproach, and in particular the Thouless-Anderson-Palmer one, may be a good starting point to\ntheoretically analyze the performance of RBMs and deep belief networks.\n\nAcknowledgments\n\nWe would like to thank F. Caltagirone and A. Decelle for many insightful discussions. This research\nwas funded by European Research Council under the European Union\u2019s 7th Framework Programme\n(FP/2007-2013/ERC Grant Agreement 307087-SPARCS).\n\n8\n\n01020304050Epoch0.920.940.96Classi\ufb01cationaccuracyMNISTTAP2P-TAP2P-TAP2-30P-TAP3PCDCD-1direct020406080100Epoch0.620.640.660.68Classi\ufb01cationaccuracyCalTechSilhouette10104080Epoch\u22120.22\u22120.16\u22120.10pseudoL\fReferences\n[1] P. Smolensky. Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony The-\nory. Processing of the Parallel Distributed: Explorations in the Microstructure of Cognition, Volume 1:\nFoundations, 1986.\n\n[2] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comp., 14:1771\u2013\n\n1800, 2002.\n\n[3] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[4] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted Boltzmann machines.\n\nICML, pages 536\u2013543, 2008.\n\nIn\n\n[5] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative \ufb01ltering. In\n\nICML, pages 791\u2013798, 2007.\n\n[6] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning.\n\nIn Intl. Conf. on Arti\ufb01cial Intelligence and Statistics, pages 215\u2013223, 2011.\n\n[7] G. Hinton and R. Salakhutdinov. Replicated softmax: an undirected topic model. In NIPS, pages 1607\u2013\n\n1614, 2009.\n\n[8] R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In Intl. Conf. on Arti\ufb01cial Intelligence and\n\nStatistics, pages 448\u2013455, 2009.\n\n[9] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comp.,\n\n18(7):1527\u20141554, 2006.\n\n[10] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, (521):436\u2013444, May 2015.\n[11] R. M. Neal. Connectionist learning of deep belief networks. Arti\ufb01cial Int., 56(1):71\u2013113, 1992.\n[12] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\n\nICML, pages 1064\u20141071, 2008.\n\n[13] C. Peterson and J. R. Anderson. A mean \ufb01eld theory learning algorithm for neural networks. Complex\n\nSystems, 1:995\u20131019, 1987.\n\n[14] G. Hinton. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comp.,\n\n1(1):143\u2013150, 1989.\n\n[15] C. C. Galland. The limitations of deterministic Boltzmann machine learning. Network, 4:355\u2013379, 1993.\n[16] H. J. Kappen and F. B. Rodr\u00b4\u0131guez. Boltzmann machine learning using mean \ufb01eld theory and linear\n\nresponse correction. In NIPS, pages 280\u2013286, 1998.\n\n[17] M. Welling and G. Hinton. A new learning algorithm for mean \ufb01eld Boltzmann machines. In Intl. Conf.\n\non Arti\ufb01cial Neural Networks, pages 351\u2013357, 2002.\n\n[18] S. Cocco, S. Leibler, and R. Monasson. Neuronal couplings between retinal ganglion cells inferred by\n\nef\ufb01cient inverse statistical physics methods. PNAS, 106(33):14058\u201314062, 2009.\n\n[19] S. Cocco and R. Monasson. Adaptive cluster expansion for inferring Boltzmann machines with noisy\n\ndata. Physical Review Letters, 106(9):90601, 2011.\n\n[20] D. J. Thouless, P. W. Anderson, and R. G. Palmer. Solution of \u2018Solvable model of a spin glass\u2019. Philo-\n\nsophical Magazine, 35(3):593\u2013601, 1977.\n\n[21] A. Georges and J. S. Yedidia. How to expand around mean-\ufb01eld theory using high-temperature expan-\n\nsions. Journal of Physics A: Mathematical and General, 24(9):2173\u20132192, 1999.\n\n[22] T. Plefka. Convergence condition of the TAP equation for the in\ufb01nite-ranged Ising spin glass model.\n\nJournal of Physics A: Mathematical and General, 15(6):1971\u20131978, 1982.\n\n[23] M. Opper and D. Saad. Advanced mean \ufb01eld methods: Theory and practice. MIT press, 2001.\n[24] E. Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington\u2013Kirkpatrick\n\nmodel. Communications in Mathematical Physics, 325(1):333\u2013366, 2014.\n\n[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProc. of the IEEE, 86(11):2278\u20132323, 1998.\n\n[26] B. M. Marlin, K. Swersky, B. Chen, and N. de Freitas.\n\nInductive principles for restricted Boltzmann\n\nmachine learning. In Intl. Conf. on Arti\ufb01cial Intelligence and Statistics, pages 509\u2013516, 2010.\n[27] G. Hinton. A practical guide to training restricted Boltzmann machines. Computer, 9:1, 2010.\n[28] F. Pedregosa et al. Scikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830, 2011.\n[29] I. J. Goodfellow, A. Courville, and Y. Bengio. Joint training deep Boltzmann machines for classi\ufb01cation.\n\narXiv preprint: 1301.3568, 2013.\n\n9\n\n\f", "award": [], "sourceid": 443, "authors": [{"given_name": "Marylou", "family_name": "Gabrie", "institution": "\u00c9cole Normale Sup\u00e9rieure, ICFP, Laboratoire de Physique Statistique"}, {"given_name": "Eric", "family_name": "Tramel", "institution": "LPS, \u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "Ecole Normale Superieure Paris"}]}