{"title": "Recursive Algorithms for Approximating Probabilities in Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 493, "abstract": null, "full_text": "Recursive algorithms for approximating \n\nprobabilities in graphical models \n\nTommi S. Jaakkola and Michael I. Jordan \n\n{tommi,jordan}Opsyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe develop a recursive node-elimination formalism for efficiently \napproximating large probabilistic networks. No constraints are set \non the network topologies. Yet the formalism can be straightfor(cid:173)\nwardly integrated with exact methods whenever they are/become \napplicable. The approximations we use are controlled: they main(cid:173)\ntain consistently upper and lower bounds on the desired quantities \nat all times. We show that Boltzmann machines, sigmoid belief \nnetworks, or any combination (i.e., chain graphs) can be handled \nwithin the same framework. The accuracy of the methods is veri(cid:173)\nfied experimentally. \n\n1 \n\nIntroduction \n\nGraphical models (see, e.g., Lauritzen 1996) provide a medium for rigorously em(cid:173)\nbedding domain knowledge into network models. The structure in these graphical \nmodels embodies the qualitative assumptions about the independence relationships \nin the domain while the probability model attached to the graph permits a consis(cid:173)\ntent computation of belief (or uncertainty) about the values of the variables in the \nnetwork. The feasibility of performing this computation determines the ability to \nmake inferences or to learn on the basis of observations. The standard framework \nfor carrying out this computation consists of exact probabilistic methods (Lauritzen \n1996). Such methods are nevertheless restricted to fairly small or sparsely connected \nnetworks and the use of approximate techniques is likely to be the rule for highly \ninterconnected graphs of the kind studied in the neural network literature. \n\nThere are several desiderata for methods that calculate approximations to posterior \nprobabilities on graphs. Besides having to be (1) reasonably accurate and fast to \ncompute, such techniques should yield (2) rigorous estimates of confidence about \n\n\f488 \n\nT. S. Jaakkola and M. I. Jordan \n\nthe attained results; this is especially important in many real-world applications \n(e.g., in medicine). Furthermore, a considerable gain in accuracy could be obtained \nfrom (3) the ability to use the techniques in conjunction with exact calculations \nwhenever feasible. These goals have been addressed in the literature with varying \ndegrees of success. For inference and learning in Boltzmann machines, for example, \nclassical mean field approximations (Peterson & Anderson, 1987) address only the \nfirst goal. In the case of sigmoid belief networks (Neal 1992), partial solutions have \nbeen provided to the first two goals (Dayan et al. 1995; Saul et al. 1996; Jaakkola \n& Jordan 1996). The goal of integrating approximations with exact techniques \nhas been introduced in the context of Boltzmann machines (Saul & Jordan 1996) \nbut nevertheless leaving the solution to our second goal unattained. In this paper, \nwe develop a recursive node-elimination formalism that meets all three objectives \nfor a powerful class of networks known as chain graphs (see, e.g., Lauritzen 1996); \nthe chain graphs we consider are of a restricted type but they nevertheless include \nBoltzmann machines and sigmoid belief networks as special cases. \n\nWe start by deriving the recursive formalism for Boltzmann machines. The results \nare then generalized to sigmoid belief networks and the chain graphs. \n\n2 Boltzmann machines \n\nWe begin by considering Boltzmann machines with binary (Ojl) variables. We \nassume the joint probability distribution for the variables S = {SI,' .. , Sn} to be \ngiven by \n\nPn(Slh, J) = Zn(h, J) Bn(Slh, J) \n\n1 \n\n(1) \n\nwhere hand J are the vector of biases and weights respectively, and the Boltzmann \nfactor B has the form \n\n(2) \n\nThe partition function Zn(h, J) = Z=s Bn(Slh, J) normalizes the distribution. The \nBoltzmann distribution defined in this manner is tractable insofar as we are able to \ncompute the partition function; indeed, all marginal distributions can be reduced \nto ratios of partition functions in different settings. \n\nWe now turn to methods for computing the partition function . In special cases \n(e.g., trees or chains) the structure of the weight matrix Jij may allow us to em(cid:173)\nploy exact methods for calculating Z. Although exact methods are not feasible \nin more generic networks, selective approximations may nevertheless restore their \nutility. The recursive framework we develop provides a general and straightforward \nmethodology for combining approximate and exact techniques. \n\nThe crux of our approach lies in obtaining variational bounds that allow the creation \nof recursive node-elimination formulas of the form 1 : \n\nZn(h,J) < \n\n> C(h, J) Zn-l(h, J) \n\n(3) \n\nSuch formulas are attractive for three main reasons: (1) a variable (or many at \nthe same time) can be eliminated by merely transforming the model parameters (h \nand J); (2) the approximations involved in the elimination are controlled, i.e., they \n\n1 Related schemes in the physics literature (renormalization group) are unsuitable here \n\nas they generally don't provide strict upper/lower bounds. \n\n\fRecursive Bounds for Graphical Models \n\n489 \n\nconsistently yield upper or lower bounds at each stage of the recursion; (3) most \nimportantly, if the remaining (simplified) partition function Zn-l(h, j) allow the \nuse of exact methods, the corresponding model parameters hand j can simply be \npassed on to such routines. \n\nNext we will consider how to obtain the bounds and outline their implications. Note \nthat since the quantities of interest are predominantly ratios of partition functions, \nit is the combination of upper and lower bounds that is necessary to rigorously \nbound the target quantities. This applies to parameter estimation as well even if \nonly a lower bound on likelihood of examples is used; such likelihood bound relies \non both upper and lower bounds on partition functions. \n\n2.1 Simple recursive factorizations \n\nWe start by developing a lower bound recursion. Consider eliminating the variable \nSi: \n\nZn(h,J) \n\nLBn(Slh,J) = L LBn(Slh,J) \nS \n\nS\\S, S, \n\n(1 + eh,+L::, J,jSj)Bn_l(S \\ Sil h , J) \n\nL \nS\\Si \n\n> L \n\nel',(h,+L:: j J,jSj)+H(I'.) B n-l(S \\ Silh, J) \n\nS\\S, \nel',h,+H(I',) L B n- 1(S \\ Si Ih, J) \n\nS\\S, \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\nel',hi+H (l'i) Zn-l(h, J) \n\n(8) \nwhere hi = hi + l'iJii for j =j:. i, H(\u00b7) is the binary entropy function and I'i are free \nparameters that we will refer to as \"variational parameters.\" The variational bound \nintroduced in eq. (6) can be verified by a direct maximization which recovers the \noriginal expression. This lower bound recursion bears a connection to mean field \napproximation and in particular to the structured mean field approximation studied \nby Saul and Jordan (1996).2 \n\nEach recursive elimination translates into an additional bound and therefore the \napproximation (lower bound) deteriorates with the number of such iterations. It \nis necessary, however, to continue with the recursion only to the extent that the \nprevailing partition function remains unwieldy to exact methods. Consequently, \nthe problem becomes that of finding the variables the elimination of which would \nrender the rest of the graph tractable. Figure 1 illustrates this objective. Note \nthat the simple recursion does not change the connection matrix J for the remain(cid:173)\ning variables; thus, graphically, the operation translates into merely removing the \nvariable. \n\nThe above recursive procedure maintains a lower bound on the partition function \nthat results from the variational representation introduced in eq. (6). For rigorous \n\n2Each lower bound recursion can be shown to be equivalent to a mean field approx(cid:173)\n\nimation of the eliminated variable(s). The structured mean field approach of Saul and \nJordan (1996) suggests using exact methods for tractable substructures while mean field \nfor the variables mediating these structures. Translated into our framework this amounts \nto eliminating the mediating variables through the recursive lower bound formula with a \nsubsequent appeal to exact methods. The connection is limited to the lower bound. \n\n\f490 \n\nT. S. Jaakkola and M. I. Jordan \n\nFigure 1: Enforcing tractable networks. Each variable in the graph can be removed \n(in any order) by adding the appropriate biases for the existing adjacent variables. \nThe elimination of the dotted nodes reveals a simplified graph underneath. \n\nbounds we need an upper bound as well. In order to preserve the graphical inter(cid:173)\npretation of the lower bound, the upper bound should also be factorized. With this \nin mind, the bound of eq. (6) can be replaced with \n\n(9) \n\nwhere \n\n(10) \nfor j > 0, ho = f(h i ), f(x) = log(1 + eX), and qj are variational parameters such \nthat Lj qj = 1. The derivation of this bound can be found in appendix A. \n\n2.2 Refined recursive bound \n\nIf the (sub )network is densely or fully connected the simple recursive methods pre(cid:173)\nsented earlier can hardly uncover any useful structure. Thus a large number of \nrecursive steps are needed before relying on exact methods and the accuracy of \nthe overall bound is compromised. To improve the accuracy, we introduce a more \nsophisticated variational (upper) bound to replace the one in eq. (6). By denoting \nXi = hi + Lj Jij Sj we have: \n\n1 + eX. ::; ex./2+>.(x.)X~-F(>'.Xi) \n\n(11) \n\nThe derivation and the functional forms of A(Xi) and F(A, Xi) are presented in \nappendix B. We note here, however, that the bound is exact whenever Xi = Xi. In \nterms of the recursion we obtain \n\nZn(h,J) < eh./2+>.(x.)h~-F(>'.Xi) Zn-l(h, J) \n\nwhere \n\n(12) \n\n(13) \n\nh\u00b7 3 \nJjk \n\nhj + 2hjA(Xi)Jij + Ji j/2 + A(Xi)Ji} \nJjk + 2A(Xi)JjJik \n\n(14) \nfor j 1= k 1= i . Importantly and as shown in figure 2a, this refined recursion \nimposes (qualitatively) the proper structural changes on the remaining network: the \nvariables adjacent to the eliminated (or marginalized) variable become connected. \nIn other words, if Jij 1= 0 and hk 1= 0 then Jjk 1= 0 after the recursion. \nTo substantiate the claim of improved accuracy we tested the refined upper bound \nrecursion against the factorized lower bound recursion in random fully connected \nnetworks with 8 variables3 . The weights in these networks were chosen uniformly \nin the range [-d, d) and all the initial biases were set to zero. Figure 3a plots the \nrelative errors in the log-partition function estimates for the two recursions as a \n\n3The small network size was chosen to facilitate comparisons with exact results. \n\n\fRecursive Bounds for Graphical Models \n\n491 \n\na) \n\n:'~\"\"\"\"\"\"\"\"\"'\\ \nt \n: \n-. -- --- - --- .-\n\" \n.~ \n\nb) ( ............ ~~\\_ .. J \n\nFigure 2: a) The graphical changes in the network following the refined recursion \nmatch those of proper marginalization. b) Example of a chain graph. The dotted \novals indicate the undirected clusters. \n\n0.04.---~----.----~-----, \n\n0.02 \n\no~~~~~ __ ~-~-~-~~-\u00b7~ ____ M-4-\n\nrefined \n\n-0.02 \n\na) -0\u00b7040 \n\n0.03 \n\n0.025 \n\n0.02 \n\n0.015 \n\n0.D1 \n\n0.005 \n\n...... \n\n0.25 \n\n0.5 \n\n0.75 \n\nb) \n\n00 \n\n0.5 \n\n0.75 \n\nFigure 3: a) The mean relative errors in the log-partition function as a function of \nthe scale of the random weights (uniform in [-d, dJ). Solid line: factorized lower \nbound recursion; dashed line: refined upper bound. b) Mean relative difference \nbetween the upper and lower bound recursions as a function of dJn/8, where n is \nthe network size. Solid: n = 8; dashed: n = 64; dotdashed: n = 128. \n\nfunction of the scale d. Figure 3b reveals how the relative difference between the \ntwo bounds is affected by the network size. In the illustrated scale the size has little \neffect on the difference. We note that the difference is mainly due to the factorized \nlower bound recursion as is evident from Figure 3a. \n\n3 Chain graphs and sigmoid belief networks \n\nThe recursive bounds presented earlier can be carried over to chain graphs4. An \nexample of a chain graph is given in figure 2b. The joint distribution for a chain \ngraph can written as a product of conditional distributions for clusters of variables: \n\nPn(SIJ) = II p(Sk Ipa[k], hk, Jk) \n\nwhere Sk = {SdiECk is the set of variables in cluster k. In our case, the conditional \nprobabilities for each cluster are conditional Boltzmann distributions given by \n\nk \n\np(Sk I [k] hk Jk) = B(Sk Ih~, Jk) \nZ(h~,Jk) \n\npa \n\n\" \n\n(16) \n\nwhere the added complexity beyond that of ordinary Boltzmann machines is that \nthe Boltzmann factors now include also outside cluster biases: \n\n[h~]i = hf + L Ji~\u00b7out Sj \n\n4While Boltzmann machines are undirected networks (interactions defined through po(cid:173)\ntentials), sigmoid networks are directed models (constructed from conditional probabili(cid:173)\nties). Chain graphs contain both directed and undirected interactions. \n\nj~Ck \n\n(15) \n\n(17) \n\n\f492 \n\nT. S. Jaakkola and M. I. Jordan \n\nwhere the index i stays within the kth cluster. We note that sigmoid belief networks \ncorrespond to the special case where there is only single binary variable in each \ncluster; Boltzmann machines, on the other hand, have only one cluster. \n\nWe now show that the recursive formalism can be extended to chain graphs. This \nis achieved by rewriting or bounding the conditional probabilities in terms of vari(cid:173)\national Boltzmann factors. Consequently, the joint distribution - being a prod(cid:173)\nuct of the conditionals - will also be a Boltzmann factor. Computing likelihoods \n(marginals) from such a joint distribution amounts to calculating the value of a \nparticular partition function and therefore reduces to the case considered earlier. \n\nIt suffices to find variational Boltzmann factors that bound (or rerepresent in some \ncases) the cluster partition functions in the conditional probabilities. We observe \nfirst that in the factorized lower bound or in the refined upper bound recursions, the \ninitial biases will appear in the resulting expressions either linearly or quadratically \nin the exponent 5 . Since the initial biases for the clusters are of the form of eq. (17), \nthe resulting expressions must be Boltzmann factors with respect to the variables \noutside the cluster. Thus, applying the recursive approximations to each cluster \npartition function yields an upper/lower bound in the form of a Boltzmann factor. \nCombining such bounds from each cluster finally gives upper/lower bounds for the \njoint distribution in terms of variational Boltzmann factors. \n\nWe note that for sigmoid belief networks the Boltzmann factors bounding the joint \ndistribution are in fact exact variational translations of the true joint distribution. \nTo see this, let us denote Xi = L: Jij Sj + hi and use the variational forms, for \nexample, from eq. (6) and (11): \n\nO'(Xi) = (1 + e- X \u2022 )-1 < el1 \u2022Xi - H (l1i) \n\n> eXi/2->'(Xi)X~+F(>.,Xi) \n\n(18) \n(19) \n\nwhere the sigmoid function 0'(.) is the inverse cluster partition function in this case. \nBoth the variational forms are Boltzmann factors (at most quadratic in Xi in the \nexponent) and are exact if minimized/maximized with respect to the variational \nparameters. \n\nIn sum, we have shown how the joint distribution for chain graphs can be bounded \nby (translated into) Boltzmann factors to which the recursive approximation for(cid:173)\nmalism is again applicable. \n\n4 Conclusion \n\nTo reap the benefits of probabilistic formulations of network architectures, approx(cid:173)\nimate methods are often unavoidable in real-world problems. We have developed a \nrecursive node-elimination formalism for rigorously approximating intractable net(cid:173)\nworks. The formalism applies to a large class of networks known as chain graphs \nand can be straightforwardly integrated with exact probabilistic calculations when(cid:173)\never they are applicable. Furthermore, the formalism provides rigorous upper and \nlower bounds on any desired quantity (e.g., the variable means) which is crucial in \nhigh risk application domains such as medicine. \n\nbThis follows from the linearity of the propagation rules for the biases, and the fact \n\nthat the emerging prefactors are either linear or quadratic in the exponent. \n\n\fRecursive Bounds for Graphical Models \n\n493 \n\nReferences \n\nP. Dayan, G. Hinton, R. Neal, and R. Zemel (1995). The Helmholtz machine. \nNeural Computation 7: 889-904. \nS. L. Lauritzen (1996). Graphical Models. Oxford: Oxford University Press. \nT . Jaakkola and M. Jordan (1996). Computing upper and lower bounds on likeli(cid:173)\nhoods in intractable networks. To appear in Proceedings of the twelfth Conference \non Uncertainty in Artificial Intelligence. \nR. Neal. Connectionist learning of belief networks (1992). Artificial Intelligence 56 : \n71-113. \nC. Peterson and J. R. Anderson (1987). A mean field theory learning algorithm for \nneural networks. Complex Systems 1: 995-1019. \nL. K. Saul, T. Jaakkola, and M. I. Jordan (1996). Mean field theory for sigmoid \nbelief networks. lAIR 4: 61-76. \nL. Saul and M. Jordan (1996). Exploiting tractable substructures in intractable \nnetworks. To appear in Advances of Neural Information Processing Systems 8. \nMIT Press. \n\nA Factorized upper bound \n\nThe bound follows from the convexity of f( x) = loge 1 + eX) and from an application \nof Jensen's inequality. Let fk(x) = I(x + hk) and note that Ik(X) has the same \nconvexity properties as I. For any convex function Ik then we have (by Jensen's \ninequality) \n\nIk (L.ilkiSi) = Ik (L.i% lki~i) :::; ~qilk (lkj~i) \n\n(20) \n\nq) \n\ni \n\nq) \n\nBy rewriting Ik (Jk;jSj) = Si [/A: ( ~) - Ik(O)] + Ida) we get the desired result . \n\nB Refined upper bound \n\nTo derive the upper bound consider first \n\n1 + eX = exj2 + log(e- x / 2 + eX/ 2 ) \n\n(21) \nNow, g(x) = log(e- x / 2 + ex / 2 ) is a symmetric function of x and also a concave \nfunction of x 2 . Any tangent line for a concave function always remains above the \nfunction and so it also serves as an upper bound. Therefore we may bound g(x) by \nthe tangents of g(.,fY) (due to the concavity in x 2 ). Thus \n\nlog(e- x / 2 + eX / 2 ) < ag~v:) (x 2 - y) + g(.JY) \n\nA(Y)X2 - F(A, y) \n\n(22) \n\n(23) \n\n(24) \n\nwhere \n\nA(y) \n\na \nayg( .JY) \nA(y) y - g(.JY) \n\n(25) \nThe desired result now follows the change of variables: y = xl- Note that the \ntangent bound is exact whenever Xi = x (a tangent defined at that point) . \n\nF(A, y) \n\n\f", "award": [], "sourceid": 1316, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}