{"title": "Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 286, "abstract": "", "full_text": "Boltzmann Machine learning using mean \nfield theory and linear response correction \n\nH.J. Kappen \n\nDepartment of Biophysics \n\nUniversity of Nijmegen, Geert Grooteplein 21 \n\nNL 6525 EZ Nijmegen, The Netherlands \n\nInstituto de Ingenieria del Conocimiento & Departamento de Ingenieria Informatica. \n\nF. B. Rodriguez \n\nUniversidad Aut6noma de Madrid, Canto Blanco,28049 Madrid, Spain \n\nAbstract \n\nWe present a new approximate learning algorithm for Boltzmann \nMachines, using a systematic expansion of the Gibbs free energy to \nsecond order in the weights. The linear response correction to the \ncorrelations is given by the Hessian of the Gibbs free energy. The \ncomputational complexity of the algorithm is cubic in the number \nof neurons. We compare the performance of the exact BM learning \nalgorithm with first order (Weiss) mean field theory and second \norder (TAP) mean field theory. The learning task consists of a fully \nconnected Ising spin glass model on 10 neurons. We conclude that \n1) the method works well for paramagnetic problems 2) the TAP \ncorrection gives a significant improvement over the Weiss mean field \ntheory, both for paramagnetic and spin glass problems and 3) that \nthe inclusion of diagonal weights improves the Weiss approximation \nfor paramagnetic problems , but not for spin glass problems. \n\n1 \n\nIntroduction \n\nBoltzmann Machines (BMs) [1], are networks of binary neurons with a stochastic \nneuron dynamics, known as Glauber dynamics. Assuming symmetric connections \nbetween neurons, the probability distribution over neuron states s will become \nstationary and is given by the Boltzmann-Gibbs distribution P( S). The Boltzmann \ndistribution is a known function of the weights and thresholds of the network. \nHowever, computation of P(Sj or any statistics involving P(S), such as mean firing \nrates or correlations, requires exponential time in the number of neurons. This is \n\n\fBoltzmann Machine Learning Using Mean Field Theory \n\n281 \n\ndue to the fact that P(S) contains a normalization term Z, which involves a sum \nover all states in the network, of which there are exponentially many. This problem \nis particularly important for BM learning. \n\nUsing statistical sampling techiques [2], learning can be significantly improved [1]. \nHowever, the method has rather poor convergence and can only be applied to small \nnetworks. \n\nIn [3 , 4], an acceleration method for learning in BMs is proposed using mean field \ntheory by replacing (SjSj) by mimj in the learning rule. It can be shown [5] that \nsuch a naive mean field approximation of the learning rules does not converge in \ngeneral. Furthermore, we argue that the correlations can be computed using the \nlinear response theorem [6]. \n\nIn [7, 5] the mean field approximation is derived by making use of the properties \nof convex functions (Jensen's inequality and tangential bounds). In this paper we \npresent an alternative derivation which uses a Legendre transformation and a small \ncoupling expansion [8] . It has the advantage that higher order contributions (TAP \nand higher) can be computed in a systematic manner and that it may be applicable \nto arbitrary graphical models. \n\n2 Boltzmann Machine learning \n\nThe Boltzmann Machine is defined as follows. The possible configurations of the \nnetwork can be characterized by a vector s = (S1, .. , Si, .. , sn), where Si = \u00b11 is the \nstate of the neuron i, and n the total number of the neurons. Neurons are updated \nusing Glauber dynamics. \nLet us define the energy of a configuration s as \n\n-E(S) = 2 L WijSiSj + L SiOi. \n\n1 \n\ni,j \n\ni \n\nAfter long times, the probability to find the network in a state s becomes indepen(cid:173)\ndent of time (thermal equilibrium) and is given by the Boltzmann distribution \n\n(1) \nZ = L; exp{ - E( S)} is the partition function which normalizes the probability \ndistribution. \n\np(S) = Z exp{ -E(S)}. \n\n1 \n\nLearning [1] consists of adjusting the weights and thresholds in such a way that \nthe Boltzmann distribution approximates a target distribution q(S) as closely as \npossible. \nA suitable measure of the difference between the distributions p(S) and q(S) is the \nK ullback divergence [9] \n\nJ{ = ~ q(S) log ;~~. \n\ns \n\n(2) \n\nLearning consists of minimizing J{ using gradient descent [1] \n~()i = 17( (Si)c -\n\n~Wij = 17( (SiSj)c -\n\n(SiSj) ), \n\n(Si) ). \n\nThe parameter 17 is the learning rate. The brackets (-) and (-) c denote the 'free' and \n'clamped' expectation values, respectively. \n\n\f282 \n\nH. 1. Kappen and F. B. Rodr(guez \n\nThe computation of both the free and the clamped expectation values is intractible, \nbecause it consists of a sum over all unclamped states. As a result, the BM learning \nalgorithm can not be applied to practical problems. \n\n3 The mean field approximation \n\nWe derive the mean field free energy using the small, expansion as introduced by \nPlefka [8]. The energy of the network is given by \n\nE(s, w, h, ,) \n\nfor, = 1. The free energy is given by \n\nF(w, (), ,) = -logTrs e- E (s ,w,8,),) \n\nand is a function of the independent variables Wij, ()j and ,. We perform a Legendre \ntransformation on the variables (}i by introducing mj = - ~:.. The Gibbs free energy \n\nG(w, m, ,) = F(w, (), ,) + L ()jmj \n\nis now a function of the independent variables mj and Wij, and ()i is implicitly \ngiven by (Si) )' = mi. The expectation 0\" is with respect to the full model with \ninteraction , . \n\nWe expand \n\nG(,) = G(O) + ,G' (0) + ~,2G\"(O) + 0(,3) \n\nWe directly obtain from [8] \n\nG'(,) \n\n(Eint)\" \n\nGUb) \n\n(E;n')~ - {Ei:,,), + (E;n,;;= ~ (s, - m;)), \n\nFor, = 0 the expectation values 0 )' become the mean field expectations which we \ncan directly compute: \n\nG(O) \n\nG'(O) \n\nG\"(O) \n\nG(l) \n\nThus \n\n1 (1 \n2 ~ (1 + mi) log 2(1 + mi) + (1 - md log 2(1- md \n-~ L Wjjmjmj \n\nJ \n\n1 ) \n\nij \n\nI \n\n. \n\n2 \n\n-L (1 + md log-(l + md + (1- mdlog-(l- md \n1 (1 \n2 \n-~ \"'w .. m \u00b7m\u00b7 \n2 L-\nij \n-~ L w;j(l - m;)(l- m]) + 0(w 3 f(m)) \n\nIJ \n\n1 ) \n2 \n\nI \n\nJ \n\nij \n\n(3) \n\n\fBoltvnann Machine Learning Using Mean Field Theory \n\nwhere f(m) is some unknown function of m. \nThe mean field equations are given by the inverse Legendre transformation \n\ne. - ae _ \n\n1 - ami -\n\nmi - L- Wijmj + ~ Wjjmd 1 - mj ), \n\n\"\"' 2 \n\n\"\"' \n\n2 \n\n) \n\nh - 1 ( \n\ntan \n\n283 \n\n(4) \n\nwhich we recognize as the mean field equations. \n\nJ \n\nJ \n\nThe correlations are given by \n\n(SiSj) -\n\n(Si) (Sj) = - oeiooj = oej = am \n\n( ao ) -1 \nij \n\nami \n\na 2 F \n\n( 02~) -1 \n\nam \n\nij \n\nWe therefore obtain from Eq. 3 \n\n(Si S j ) -\n\n(Si) (s j) = Aij \n\nwith \n\n(A-')oj = Jij ( 1 _1 ml + ~ wi.(1 - ml)) - Wij - 2mimjW;j \n\n(5) \n\nThus, for given Wij and OJ, we obtain the approximate mean firing rates mj by solv(cid:173)\ning Eqs. 4 and the correlations by their linear response approximations Eqs. 5. The \ninclusion of hidden units is straigthforward . One applies the above approximations \nin the free and the clamped phase separately [5]. The complexity of the method is \nO(n3 ), due to the matrix inversion. \n\n4 Learning without hidden units \n\nWe will assess the accuracy of the above method for networks without hidden units. \nLet us define Cij = (SjSj)c -\n(Si)c (Sj)c' which can be directly computed from the \ndata. The fixed point equation for D..Oj gives \n\nD..Oi = 0 {:} mj = (Si)c . \nThe fixed point equation for D..wij, using Eq. 6, gives \n\n(6) \n\n(7) \nFrom Eq. 7 and Eq. 5 we can solve for Wij, using a standard least squares method. \nIn our case, we used fsolve from Matlab. Subsequently, we obtain ei from Eq. 4. \nWe refer to this method as the TAP approximation. \n\nD..wij = 0 {:} Aij = Cij ' i =F j. \n\nIn order to assess the effect of the TAP term, we also computed the weights and \nthresholds in the same way as described above, but without the terms of order w 2 \nin Eqs. 5 and 4. Since this is the standard Weiss mean field expression, we refer to \nthis method as the Weiss approximation. \n\nThe fixed point equations are only imposed for the off-diagonal elements of D..Wjj \nbecause the Boltzmann distribution Eq. 1 does not depend on the diagonal elements \nWij. In [5], we explored a variant of the Weiss approximation, where we included \ndiagonal weight terms. As is discussed there, if we were to impose Eq. 7 for i = j \nas well, we have A = C. HC is invertible, we therefore have A-I = C- 1 . However, \nwe now have more constraints than variables. Therefore, we introduce diagonal \nweights Wii by adding the term Wiimi to the righthandside of Eq. 4 in the Weiss \napproximation. Thus, \n\nWij = 1 _ m? \n\n,sij \n\n_ (C- 1 ) .. \nlJ \n\nand OJ is given by Eq. 4 in the Weiss approximation. Clearly, this method is com(cid:173)\nputationally simpler because it gives an explicit expression for the solution of the \nweights involving only one matrix inversion. \n\nI \n\n\f284 \n\nH. 1. Kappen and F. B. Rodr(guez \n\n5 Numerical results \n\nFor the target distribution q(s) in Eq. 2 we chose a fully connected Ising spin glass \nmodel with equilibrium distribution \n\nwith lij i.i.d. Gaussian variables with mean n~l and variance /~1 ' This model \nis known as the Sherrington-Kirkpatrick (SK) model [10]. Depending on the val(cid:173)\nues of 1 and 10 , the model displays a para-magnetic (unordered), ferro-magnetic \n(ordered) and a spin-glass (frustrated) phase. For 10 = 0, the para-magnetic (spin(cid:173)\nglass) phase is obtained for 1 < 1 (1 > 1). We will assess the effectiveness of our \napproximations for finite n, for 10 = 0 and for various values of 1. Since this is a \nrealizable task, the optimal KL divergence is zero, which is indeed observed in our \nsimulations. \n\nWe measure the quality of the solutions by means ofthe Kullback divergence. There(cid:173)\nfore, this comparison is only feasible for small networks . The reason is that the \ncomputation of the Kullback divergence requires the computation of the Boltzmann \ndistribution, Eq. 1, which requires exponential time due to the partition function \nZ. \nWe present results for a network of n = 10 neurons. For 10 = 0, we generated \nfor each value of 0.1 < 1 < 3, 10 random weight matrices l i j. For each weight \nmatrix, we computed the q(S) on all 2n states. For each of the 10 problems, we \napplied the TAP method, the Weiss method and the Weiss method with diagonal \nweights. In addition, we applied the exact Boltzmann Machine learning algorithm \nusing conjugate gradient descent and verified that it gives KL diver?ence equal \nto zero, as it should. We also applied a factorized model p(S) = Ili ?\"(1 + misd \nwith mi = (Si)c to assess the importance of correlations in the target distribution. \nIn Fig. la, we show for each 1 the average KL divergence over the 10 problem \ninstances as a function of 1 for the TAP method, the Weiss method, the Weiss \nmethod with diagonal weights and the factorized model. We observe that the TAP \nmethod gives the best results, but that its performance deteriorates in the spin-glass \nphase (1) 1) . \n\nThe behaviour of all approximate methods is highly dependent on the individual \nproblem instance. In Fig. Ib, we show the mean value of the KL divergence of the \nTAP solution, together with the minimum and maximum values obtained on the \n10 problem instances. \n\nDespite these large fluctuations , the quality of the TAP solution is consistently \nbetter than the Weiss solution. In Fig. lc, we plot the difference between the TAP \nand Weiss solution, averaged over the 10 problem instances. \n\nIn [5] we concluded that the Weiss solution with diagonal weights is better than \nthe standard Weiss solution when learning a finite number of randomly generated \npatterns. In Fig. Id we plot the difference between the Weiss solution with and \nwithout diagonal weights. We observe again that the inclusion of diagonal weights \nleads to better results in the paramagnetic phase (1 < 1), but leads to worse results \nin the spin-glass phase. For 1 > 2, we encountered problem instances for which \neither the matrix C is not invertible or the KL divergence is infinite. This problem \nbecomes more and more severe for increasing 1 . We therefore have not presented \nresults for the Weiss approximation with diagonal weigths for 1 > 2. \n\n\fBoltvnannMachine Learning Using Mean Field Theory \n\n285 \n\nComparison mean values \n\n5r-------~------~------. \n\n4 \n\nfact \nweiss+d \nweiss \ntap \n\nI \n\n/ \n\n\", \n\n1_'\" \n\nJ..:y '\", \n\n.... ; ./ \n\n'\" -(cid:173)\n\n.... \n\no~ __ ~ ... =\u00b7~J~ ______ ~ ____ ~ \no \n\n2 \n\n3 \n\nJ \n\nDifference WEISS and TAP \n\nn. .1 0.5 \n:.:: \nI en \nen \nW \nJ \n:.:: \n\n0 \n\nTAP \n\n2 \n\n1.5 \n\nmean \nmin \nmax \n\nQ) \nu \nc: \nQ) \n\ne> 1 \nQ) > \n'5 \n....J \n:.:: \n\n0.5 \n\no~-=~~------~----~ \n3 \n\n2 \n\no \n\nJ \n\n1.5 \n\n1 \n\nen \n':!? \nw \n\nJ :.:: \nI '? 0.5 \nen \nen \nW \nJ \n:.:: \n\n0 \n\nDifference WEISS+D and WEISS \n\nv \n\nV \n/ \n~ 1 \n\n-0.5 L-. ______ ~ ____ ~~ ____ ----.J \n3 \n\n2 \n\no \n\nJ \n\n-0.5 o \n\n2 \n\n3 \n\nJ \n\nFigure 1: Mean field learning of paramagnetic (J < 1) and spin glass (J > 1) \nproblems for a network of 10 neurons. a) Comparison of mean KL divergences for \nthe factorized model (fact), the Weiss mean field approximation with and without \ndiagonal weights (weiss+d and weiss), and the TAP approximation, as a function \nof J. The exact method yields zero KL divergence for all J. b) The mean, mini(cid:173)\nmum and maximum KL divergence of the TAP approximation for the 10 problem \ninstances , as a function of J. c) The mean difference between the KL divergence \nfor the Weiss approximation and the TAP approximation, as a function of J. d) \nThe mean difference between the KL divergence for the Weiss approximation with \nand without diagonal weights, as a function of J . \n\n6 Discussion \n\nWe have presented a derivation of mean field theory and the linear response correc(cid:173)\ntion based on a small coupling expansion of the Gibbs free energy. This expansion \ncan in principle be computed to arbitrary order. However , one should expect that \nthe solution of the resulting mean field and linear response equations will become \nmore and more difficult to solve numerically. The small coupling expansion should \nbe applicable to other network models such as the sigmoid belief network, Potts \nnetworks and higher order Boltzmann Machines . \n\nThe numerical results show that the method is applicable to paramagnetic problems. \nThis is intuitively clear, since paramagnetic problems have a unimodal probability \ndistribution, which can be approximated by a mean and correlations around the \nmean. The method performs worse for spin glass problems. However, it still gives \na useful approximation of the correlations when compared to the factorized model \nwhich ignores all correlations. In this regime, the TAP approximation improves \n\n\f286 \n\nH. 1. Kappen and F. B. Rodr(guez \n\nsignificantly on the Weiss approximation. One may therefore hope that higher order \napproximation may further improve the method for spin glass problems. Therefore. \nwe cannot conclude at this point whether mean field methods are restricted to \nunimodal distributions. In order to further investigate this issue, one should also \nstudy the ferromagnetic case (Jo > 1, J > 1), which is multimodal as well but less \nchallenging than the spin glass case. \n\nIt is interesting to note that the performance of the exact method is absolutely \ninsensitive to the value of J. Naively, one might have thought that for highly \nmulti-modal target distributions, any gradient based learning method will suffer \nfrom local minima. Apparently, this is not the case: the exact KL divergence has \njust one minimum, but the mean field approximations of the gradients may have \nmultiple solutions. \n\nAcknowledgement \n\nThis research is supported by the Technology Foundation STW, applied science \ndivision of NWO and the techology programme of the Ministry of Economic Affairs. \n\nReferences \n\n[1] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann Ma(cid:173)\n\nchines. Cognitive Science, 9:147-169, 1985. \n\n[2] C. Itzykson and J-M. Drouffe. Statistical Field Theory. Cambridge monographs on \n\nmathematical physics. Cambridge University Press. Cambridge, UK, 1989. \n\n[3] C. Peterson and J.R. Anderson. A mean field theory learning algorithm for neural \n\nnetworks. Complex Systems, 1:995-1019, 1987. \n\n[4] G.E. Hinton. Deterministic Boltzmann learning performs steepest descent in weight(cid:173)\n\nspace. Neural Computation, 1:143-150. 1989. \n\n[5] H.J. Kappen and F.B. Rodriguez. Efficient learning in Boltzmann Machines using \n\nlinear response theory. Neural Computation, 1997. In press. \n\n[6] G. Parisi. Statistical Field Theory. Frontiers in Physics. Addison-Wesley, 1988. \n[7] L.K. Saul, T. Jaakkola, and M.1. Jordan. Mean field theory for sigmoid belief net(cid:173)\n\nworks. Journal of artificial intelligence research, 4:61-76, 1996. \n\n[8] T. Plefka. Convergence condition of the TAP equation for the infinite-range Ising \n\nspin glass model. Journal of Physics A, 15:1971-1978, 1982. \n\n[9] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959. \n[10] D. Sherrington and S. Kirkpatrick. Solvable model of Spin-Glass. Physical review \n\nletters, 35:1792-1796, 1975. \n\n\f", "award": [], "sourceid": 1412, "authors": [{"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Francisco", "family_name": "de Borja Rodr\u00edguez Ortiz", "institution": null}]}