{"title": "Means, Correlations and Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 455, "page_last": 462, "abstract": null, "full_text": "Means. Correlations and Bounds \n\nM.A.R. Leisink and H.J. Kappen \n\nDepartment of Biophysics \n\nUniversity of Nijmegen , Geert Grooteplein 21 \n\nNL 6525 EZ Nijmegen, The Netherlands \n\n{martijn,bert}@mbfys.kun.nl \n\nAbstract \n\nThe partition function for a Boltzmann machine can be bounded \nfrom above and below. We can use this to bound the means and \nthe correlations. For networks with small weights, the values of \nthese statistics can be restricted to non-trivial regions (i.e. a subset \nof [-1 , 1]). Experimental results show that reasonable bounding \noccurs for weight sizes where mean field expansions generally give \ngood results. \n\n1 \n\nIntroduction \n\nOver the last decade, bounding techniques have become a popular tool to deal with \ngraphical models that are too complex for exact computation. A nice property of \nbounds is that they give at least some information you can rely on. For instance, \none may find that a correlation is definitely between 0.4 and 0.6. An ordinary ap(cid:173)\nproximation might be more accurate, but in practical situations there is absolutely \nno warranty for that. \n\nThe best known bound is probably the mean field bound , which has been described \nfor Boltzmann machines in [1] and later for sigmoid belief networks in [2]. Apart \nfrom its bounding properties, mean field theory is a commonly used approximation \ntechnique as well. Recently this first order bound was extended to a third order \napproximation for Boltzmann machines and sigmoid belief networks in [3] and [4], \nwhere it was shown that this particular third order expansion is still a bound. \n\nIn 1996 an upper bound for Boltzmann machines was described in [5] and [6]. In the \nsame articles the authors derive an upper bound for a special case of sigmoid belief \nnetworks: the two-layered networks. In this article we will focus solely on Boltzmann \nmachines, but an extension to sigmoid belief networks is quite straightforward. \n\nThis article is organized as follows: In section 2 we start with the general theory \nabout bounding t echniques. Later in that section the upper and lower bound are \nbriefly described. For a full explanation we refer to the articles mentioned before. \nThe section is concluded by explaining how these bounds on the partition function \ncan be used to bound m eans and correlations. In section 3 results are shown for \nfully connected Boltzmann machines, where size of weights and thresholds as well as \nnetwork size are varied. In section 4 we present our conclusions and outline possible \nextensions. \n\n\f2 Theory \n\nThere exists a general method to create a class of polynomials of a certain order, \nwhich all bound a function of interest, fo(x). Such a class of order 2n can be \nfound if the 2n-th order derivative of fo(x), written as hn(x), can be bounded by \na constant . When this constant is zero , the class is actually of order 2n-1. It turns \nout that this class is parameterized by n free parameters. \nSuppose we have a function b2k for some integer k which bounds the function 12k \nfrom below (an upper bound can be written as a lower bound by using the negative \nof both functions). Thus \n\n(1) \n\nNow construct the primitive functions 12k -1 and b2k -1 such that 12k - 1 (p) = \nb2k- 1(p) for a free to choose value for p. This constraint can always be achieved by \nadding an appropriate constant to the primitive function b2k - 1 . It is easy to prove \nthat \n\n{ 12k -1 (x) :S b2k -1 (x) \n12k -1 (x) 2: b2k -1 (x) \nor in shorthand notation hk-1(x) \u00a7 b2k - 1(X). \nIf we repeat this procedure and construct the primitive functions hk-2 and b2k - 2 \nsuch that hk-2(p) = b2k - 2(p) for the same p, one can verify that \n\nfor x < p \nfor x 2: p \n\n(2) \n\nVx hk-2(x) 2: b2k - 2(X) \n\n(3) \nThus given a bound 12k (x) 2: b2k (x) we can construct a class of bounding functions \nfor hk-2 parameterized by p. \nSince we assumed hn (x) can be bounded from below by a constant , we can apply the \nprocedure n times and we finally find fa (x) 2: bo (x), where bo (x) is parameterized \nby n free parameters. This procedure can be found in more detail in [4]. \n\n2.1 A third order lower bound for Boltzmann machines \n\nBoltzmann machines are stochastic neural networks with N binary valued neurons , \nSi, which are connected by symmetric weights Wij. Due to this symmetry the \nprobability distribution is a Boltzmann-Gibbs distribution which is given by (see \nalso [7]) \n\np (siB, w) = ~ exp (~L. WijSiSj + L BiSi) = ~ exp (-E (s, B, w)) \n\n'J \n\n' \n\nwhere the Bi are threshold values and \n\nZ (B , w) = L exp ( - E (s, B, w)) \n\nis the normalization known as the partition function. \n\nall S \n\n(4) \n\n(5) \n\nThis partition function is especially important, since statistical quantities such as \nm eans and correlations can be directly derived from it. For instance , the m eans can \nbe computed as \n\n(sn) = LP (siB, w) Sn = L P (s, Sn =+l IB, w) - P (s, Sn = - l iB, w) \n\nall S \nZ+ (B, w) - Z_ (B, w) \n\na ll s/sn \n\nZ (B, w) \n\n(6) \n\n\fwhere Z+ and Z_ are partition functions over a network with Sn clamped to +1 \nand -1 , respectively. \n\nThis explains why the objective of almost any approximation method is the partition \nfunction given by equation 5. In [3] and [4] it is shown that the standard mean field \nlower bound can be obtained by applying the linear bound \n\n(7) \n\nto all exponentially many terms in the sum. Since J.l may depend on S, one can \nchoose J.l (s) = J.li Si + J.lo , which leads to the standard mean field equations, where \nthe J.li turn out to be the local fields. \n\nMoreover, the authors show that one can apply the procedure of 'upgrading bounds' \n(which is described briefly at the beginning of this section) to equation 7, which \nleads to the class of third order bounds for eX. This is achieved in the following \nway: \n\n'r/X,V h(x) = eX 2': eV (1 + x - v) = b2(x) \n\nh(x)=ex'\u00a7ell-+ev ((1+J.l-v)(x-J.l)+~(x-J.l)2) =bdx) \n\n'r/X ,Il- ,A fo(x) = eX 2': ell- { 1 + x - J.l + eA C ; >.. (x - J.l)2 + ~ (x - J.l)3) } = bo(x) \n\n(8) \n\nwith>\" = v - J.l. \nIn principle, this third order bound could be maximized with respect to all the free \nparameters, but here we follow the suggestion made in [4] to use a mean field opti(cid:173)\nmization , which is much faster and generally almost as good as a full optimization. \nFor more details we refer to [4]. \n\n2.2 An upper bound \n\nAn upper bound for Boltzmann machines has been described in [5] and [6]1. Basi(cid:173)\ncally, this method uses a quadratic upper bound on log cosh x, which can easily be \nobtained in the following way: \n\nh(x) = 1 - tanh 2 x::; 1 = b2(x) \n\nh(x) = tanh x ~ x - J.l + tanhJ.l = bdx) \n\n(9) \n\nfa (x) = log cosh x ::; \"2 (x - J.l) + (x - J.l) tanh J.l + log cosh J.l = bo (x) \n\n2 \n\n1 \n\nUsing this bound, one can derive \n\nZ (e , w) = L exp (~L WijSiSj + L eiSi) \n\nall s \n\nij \n\ni \n\nall sisn \n\n= ~ 2exp (lOg cosh (L WniSi + en)) exp (~ .L WijSiSj + L eiSi) \n::; L exp (~ L W~jSiSj + L e;Si + k) = ek . Z (e' , W') \n\n'J i'n \n\n' i'n \n\n, \n\n(10) \n\nallsls n \n\nij i'n \n\nii'n \n\nINote: The articles referred to, use Si E {O, I} instead of the +1/-1 coding used here. \n\n\fwhere k is a constant and el and Wi are thresholds and weights in a reduced network \ngiven by \n\nI Wij = Wij + WniWnj \ne;j = ei + Wni (en -\nk = \"2 (en -\n\n1 \n\nJ-Ln + tanhJ-Ln) \n1 \n\n2 \n\nJ-L n + tanhJ-Ln) -\"2 tanh J-Ln + log 2 cosh J-Ln \n\n2 \n\n(11) \n\nHence, equation 10 defines a recursive relation, where each step reduces the network \nby one neuron. Finally, after N steps, an upper bound on the partition function is \nfound 2 . We did a crude minimization with respect to the free parameters J-L. A more \nsophisticated method can probably be found, but this is not the main objective of \nthis article. \n\n2.3 Bounding means and correlations \n\nThe previous subsections showed very briefly how we can obtain a lower bound , ZL, \nand an upper bound , ZU , for any partition function. We can use this in combination \nwith equation 6 to obtain a bound on the means: \n\nZL _ ZU \n\nZu _ ZL \n\n(sn)L = + X \n\n(12) \nwhere X = ZU if the nominator is positive and X = ZL otherwise. For Y it is the \nopposite. The difference, (sn)U -\n\n(sn)L, is called the bandwidth. \n\n-::::; (sn)::::; + y \n\n- = (snt \n\nNaively, we can compute the correlations similarly to the means using \n\n(13) \n\nwhere the partition function is computed for all combinations Sn Sm. Generally, \nhowever, this gives poor results , since we have to add four bounds together , which \nleads to a bandwidth which is about twice as large as for the means. We can \ncircumvent this by computing the correlations using \n\nwhere we allow the sum in the partition functions to be taken over Sn , but fix Sm \neither to Sn or its negative. Finally, the computation of the bounds (SnSm)L and \n(snsmt is analogue to equation 12. \n\nThere exists an alternative way to bound the means and correlations. One can write \n\n(14) \n\n( \nSn \n\n) _ Z+ - Z _ _ Z+/Z_ - 1 _ z - 1 -\n-\n-\nz + 1 \n\nZ+/Z_ + 1 \n\nZ + + Z _ \n\n-\n\n-\n\nwith z = Z+/Z_ , which can be bounded by \n\nZu \nZL \n----\u00b1 < z < ----\u00b1 \nZ~ -\nZ~ \n\nf ( ) \nz \n\n(15) \n\n(16) \n\nSince f (z) is a monotonically increasing function of z, the bounds on (Sn) are given \nby applying this function to the left and right side of equation 16. The correlations \ncan be bounded similarly. It is still unknown whether this algorithm would yield \nbetter results than the first one, which is explored in this article. \n\n2The original articles show that it is not necessary to do all the N steps. However, \nsince this is based on mixing approximation techniques with exact ca lculations, it is not \nused here as it would hide the real error the approximation makes. \n\n\f13 ir==================i~----~----~ \n\nExact \nMean field lower bound \nUpper bound \nThird order lower bound \n\n12.5 \n\n12 \n\n11 \n\n10.5 \n\n, , \n\n, ,-,. \n\n,. \n, . \n, . \n. ' \n\n10'--~\u00b7\"\"'--:' - ' - ...\n\n....... . _,_ ....... - ,'\" \n\no \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\na w \n\nFigure 1: Comparison of 1) the mean field lower bound, 2) the upper bound and \n3) the third order lower bound with the exact log partition function. The network \nwas a fully connected Boltzmann machine with 14 neurons and (J'B = 0.2. The size \nof the weights is varied on the x-axis. Each point was averaged over 100 networks. \n\n3 Results \n\nIn all experiments we used fully connected Boltzmann machines of which the thresh(cid:173)\nolds and weights both were drawn from a Gaussian with zero mean and standard \ndeviation (J'B and (J'w/VN, respectively, where N is the network size. This is the so \ncalled sK-model (see also [8]). Generally speaking, the mean field approximation \nbreaks down for (J'B = 0 and (J'w > 0.5, whereas it can be proven that any expansion \nbased approximation is inaccurate when (J'w > 1 (which is the radius of convergence \nas in [9]). If (J'B #- 0 these maximum values are somewhat larger. \nIn figure 1 we show the logarithm of the exact partition function , the first order \nor mean field bound, the upper bound (which is roughly quadratic) and the third \norder lower bound. The weight size is varied along the horizontal axis. One can \nsee clearly that the mean field bound is not able to capture the quadratic form of \nthe exact partition function for small weights due to its linear behaviour. The error \nmade by the upper and third order lower bound is small enough to make non-trivial \nbounds on the means and correlations. \n\nAn example of this bound is shown in figure 2 for the specific choice (J'B = (J'w = 0.4. \nFor both the means and t he correlations a histogram is plotted for the upper and \nlower bounds computed with equation 12. Both have an average bandwidth of \n0.132, which is a clear subset of the whole possible interval of [-1 , 1]. \nIn figure 3 th e average bandwidth is shown for several values of (J'e and (J' w ' For \nbandwidths of 0.01,0.1 and 1 a line is drawn. We conclude that almost everywhere \nthe bandwidth is non-trivially reduced and reaches practically useful values for (J'w \nless than 0.5. This is more or less equivalent to the region wh ere the m ean fi eld \napproximation performs well. That approximation , however , gives no information \non how close it actually is to the exact value, whereas the bounding m ethod limits \nit to a definit e region. \n\n\f80 \n\n60 \n\n40 \n\n20 \n\n-0.2 \n\nMeans \n\n600,---------=----, , - - - - - - ,6010 \n\nCorrelations \n\n500 \n\n400 \n\n300 \n\n200 \n\n100 \n\n100 \n\nDistance to exact \n\nDistance to exact \n\n0.2 \n\n-0.2 \n\n-0.1 \n\no \n\n0.1 \n\n0.2 \n\nFigure 2: For the specific choice IJo = IJw = 0.4 thirty fully connected Boltzmann \nmachines with 14 neurons were initialized and the bounds were computed. The two \nleft panels show the distance between the lower bound and the exact means (left) \nand similarly for the upper bound (right). The right two panels show the distances \nof both bounds for the correlations. \n\n1.5 \n\n0.5 \n\n0.8 \n\n0.6 \n\no~ \n\n0.4 \n\n0.2 \n\n00 \n\nOw \n\n1.5 \n\n0.5 \n\nFigure 3: In the left panel the average bandwidth is colour coded for the means , \nwhere IJo and IJw are varied in ten steps along the axes. The right panel shows \nthe same for the correlations. For each IJo , IJw thirty fully connected Boltzmann \nmachines were initialized and the bounds on all the m eans and correlations were \ncomputed. For three specific bandwidths a line is drawn. \n\n\f\u00b0e=0.3 \n0 =0.1 \n\nw \n\n0.01 \n\n0.008 \n\n.r::; \n~0006 \n\n\"0 \n@0.004 \nc:J \n\n0.002 \n\n00 \n\n10 \n\n20 \n\n30 \n\n40 \n\n0.4 \n\n\u00b0e=0.3 \n0.3 \u00b0 =0 .3 \n\nw \n\n0.2 \n\n0.1 \n\n00 \n\n10 \n\n20 \n\n30 \n\nNetwork size \n\n2 \n\n\u00b0e=0.3 \n1.5 \u00b0 =0 .5 \n\nw \n\n0.5 \n\n40 \n\n00 \n\n10 \n\n20 \n\n30 \n\n40 \n\nFigure 4: For (Tw = 0.1, 0.3 and 0.5 the bandwidth for the correlations is shown \nversus the network size. (To = 0.3 in all cases, but the plots are nearly the same for \nother values. Please note the different scales for the y-axis. A similar graph for the \nmeans is not shown here, but it is roughly the same. The solid line is the average \nbandwidth over all correlations, whereas the dashed lines indicate the minimum and \nmaximum bandwidth found. \n\nUnfortunately, the bounds have the unwanted property that the error scales badly \nwith the size of the network. Although this makes the bounds unsuitable for very \nlarge networks , there is still a wide range of networks small enough to take advan(cid:173)\ntage of the proposed method and still much too large to be treated exactly. The \nbandwidth versus network size is shown in figure 4 for three values of (T w' Obviously, \nthe threshold of practical usefulness is reached earlier for larger weights . \nFinally, we remark that the computation time for the upper bound is (') (N4) and \n(') (N 3 ) for the mean field and third order lower bound. This is not shown here. \n\n4 Conclusions \n\nIn this article we combined two already existing bounds in such a way that not only \nthe partition function of a Boltzmann machine is bounded from both sides , but also \nthe means and correlations . This may seem superfluous, since there exist already \nseveral powerful approximation methods. Our method, however, can be used apart \nfrom any approximation technique and gives at least some information you can rely \non. Although approximation techniques might do a good job on your data, you \ncan never be sure about that. The method outlined in this paper ensures that the \nquantities of interest, the means and correlations, are restricted to a certain region. \n\nWe have seen that , generally speaking, the results are useful for weight sizes where \nan ordinary mean field approximation performs well. This makes the method ap(cid:173)\nplicable to a large class of problems . Moreover, since many architectures are not \nfully connected, one can take advantage of that structure. At least for the upper \nbound it is shown already that this can improve computation speed and tightness . \nThis would partially cancel the unwanted scaling with the network size. \n\nFinally, we would like to give some directions for further research. First of all, an \nextension to sigmoid belief networks can easily be done, since both a lower and an \nupper bound are a lready described. The upper bound, however , is only applicable to \ntwo layer networks. A more general upper bound can probably be found. Secondly \none can obtain even better bounds (especially for larger weights) if the general \nconstraint \n\n(17) \nis taken into account. This might even be extended to similar constraints, where \nthree or more neurons are involved. \n\n\fAcknowledgelllents \n\nThis research is supported by the Technology Foundation STW, applied science \ndevision of NWO and the technology programme of the Ministry of Economic Affairs. \n\nReferences \n\n[1] C. Peterson and J. Anderson. A mean field theory learning algorithm for neural net(cid:173)\n\nworks. Co mplex systems, 1:995- 1019, 1987. \n\n[2] S.K. Saul, T.S . Jaakkola, and M.l. Jordan. Mean field theory for sigmoid b elief net(cid:173)\n\nworks. Journal of Artific ial Intelligence Research, 4:61- 76 , 1996. \n\n[3] Martijn A.R. Leisink and Hilbert J. Kappen. A tighter bound for graphical models. In \nTodd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural \nInformation Processing Systems 13, pages 266- 272. M IT Press, 2001. \n\n[4] Martijn A.R. Leisink and Hilbert J. Kappen. A tighter bound for graphical models. \n\nNeural Co mputation, 13(9) , 2001. To appear. \n\n[5] T. Jaakkola and M.l. Jordan. Recursive algorithms for approximating probabilities in \n\ngraphical models. MIT Co mpo Cogn. Sc ience Technical Report 9604, 1996. \n\n[6] Tommi S. Jaakkola and Michael 1. Jordan. Computing upper and lower bounds on \nlikelihoods in intractable n etworks. In Proceedings of th e T welfth Annual Conf ere nce \non Uncertainty in Artificial Intelligence (UAI- 96) , pages 340- 348, San Francisco, CA, \n1996. Morgan Kaufmann Publishers. \n\n[7] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. \n\nCognitive Science, 9:147-169, 1985. \n\n[8] D. Sherrington and S. Kirkpatrick. Solvable model of a spin-glass. Physical R eview \n\nLetters, 35(26):1793-1796, 121975. \n\n[9] T. Plefka. Convergence condition of the TAP equation for the infinite-ranged ising spin \n\nglass model. J.Phys.A: Math. Gen., 15:1971-1978, 1981. \n\n\f", "award": [], "sourceid": 1970, "authors": [{"given_name": "Martijn", "family_name": "Leisink", "institution": null}, {"given_name": "Bert", "family_name": "Kappen", "institution": null}]}