{"title": "Tractable Variational Structures for Approximating Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 189, "abstract": null, "full_text": "Tractable Variational Structures for \nApproximating Graphical Models \n\nDavid Barber \n\nWim Wiegerinck \n\n{davidb,wimw}@mbfys,kun,nl \n\nRWCP* Theoretical Foundation SNNt University of Nijmegen \n\n6525 EZ Nijmegen, The Netherlands. \n\nAbstract \n\nGraphical models provide a broad probabilistic framework with ap(cid:173)\nplications in speech recognition (Hidden Markov Models), medical \ndiagnosis (Belief networks) and artificial intelligence (Boltzmann \nMachines). However, the computing time is typically exponential \nin the number of nodes in the graph. Within the variational frame(cid:173)\nwork for approximating these models, we present two classes of dis(cid:173)\ntributions, decimatable Boltzmann Machines and Tractable Belief \nNetworks that go beyond the standard factorized approach. We \ngive generalised mean-field equations for both these directed and \nundirected approximations. Simulation results on a small bench(cid:173)\nmark problem suggest using these richer approximations compares \nfavorably against others previously reported in the literature. \n\n1 \n\nIntroduction \n\nGraphical models provide a powerful framework for probabilistic inference[l] but \nsuffer intractability when applied to large scale problems. Recently, variational ap(cid:173)\nproximations have been popular [2, 3, 4, 5], and have the advantage of providing \nrigorous bounds on quantities of interest, such as the data likelihood, in contrast to \nother approximate procedures such as Monte Carlo methods[l]. One of the original \nmodels in the neural networks community, the Boltzmann machine (BM), belongs \nto the class of undirected graphical models. The lack of a suitable algorithm has \nhindered its application to larger problems. The deterministic BM algorithm[6], a \nvariational procedure using a factorized approximating distribution, speeds up the \nlearning of BMs, although the simplicity of this approximation can lead to unde(cid:173)\nsirable effects [7] . Factorized approximations have also been successfully applied to \nsigmoid belief networks[4]. One approach to producing a more accurate approxi(cid:173)\nmation is to go beyond the class of factorized approximating models by using, for \nexample, mixtures of factorized models. However, it may be that very many mix(cid:173)\nture components are needed to obtain a significant improvement beyond using the \nfactorized approximation[5]. In this paper, after describing the variational learn-\n\nOReal World Computing Partnership \ntFoundation for Neural Networks \n\n\f184 \n\nD. Barber and W Wiegerinck \n\ning framework, we introduce two further classes of non-factorized approximations, \none undirected (decimatable BMs in section (3)) and the other, directed (Tractable \nBelief Networks in section (4)) . To demonstrate the potential benefits of these \nmethods, we include results on a toy benchmark problem in section (5) and discuss \ntheir relation to other methods in section (6). \n\n2 Variational Learning \n\nWe assume the existence of a graphical model P with known qualitative structure \nbut for which the quantitative parameters of the structure remain to be learned from \ndata. Given that the variables can be considered as either visible (V) or hidden \n(H), one approach to learning is to carry out maximum likelihood on the visible \nvariables for each example in the dataset. Considering the KL divergence between \nthe true distribution P(HIV) and a distribution Q(H), \n\nKL(Q(H),P(H/V\u00bb = ~ Q(H) In P(H/V} ~ 0 \n\nQ(H) \n\n\" \n\nH \n\nand using P(H/V) = P(H, V}/ pev) gives the bound \n\nIn P(V) 2: - L Q(H) In Q(H) + L Q(H) In P(H, V) \n\nH \n\nH \n\n(1) \n\nBetraying the connection to statistical physics, the first term is termed the \"en(cid:173)\ntropy\" and the second the \"energy\". One typically chooses a variational distribu(cid:173)\ntion Q so that the entropic term is \"tractable\". We assume that the energy E(Q) \nis similarly computable, perhaps with recourse to some extra variational bound (as \nin section (5)). By tractable, we mean that all necessary marginals and desired \nquantities are computationally feasible, regardless of the issue of the scaling of the \ncomputational effort with the graph size. Learning consists of two iterating steps: \nfirst optimize the bound (1) with respect to the parameters of Q, and then with \nrespect to the parameters of P(H, V). We concentrate here on the first step. For \nclarity, we present our approach for the case of binary variables Si E {O, 1} ,i = LN. \nWe now consider two classes of approximating distributions Q. \n\n3 Undirected Q: Decimatable Boltzmann Machines \n\nBoltzmann machines describe probability distributions parameterized by a sym(cid:173)\nmetric weight matrix J \n\nwhere the normalization constant, or \"partition function\" is Z = Es exp fjJ. For \nconvenience we term the diagonals of J the \"biases\", hi = Jii . Since In Z (J, h) is a \ngenerating function for the first and second order statistics of the variables s, the \nentropy is tractable provided that Z is tractable. For general connection structures, \nJ, computing Z is intractable as it involves a sum over 2N states; however, not all \nBoltzmann machines are intractable. A class of tractable structures is described by \na set of so-called decimation rules in which nodes from the graph can be removed \none by one, fig(l). Provided that appropriate local changes are made to the BM \nparameters, the partition function of the reduced graph remains unaltered (see eg \n[2]). For example, node c in fig(l) can be removed, provided that the weight matrix \nJ and bias h are transformed, J -t JI, h -t hi, with J~c = Jtc = h~ = 0 and \n\nI \n\n_ \n\nJab -\n\n1 \n\n(1 + e he ) (1 + ehe+2(Jae+Jbe)) \nJab + 2ln (1 + ehe+2Jae) (1 + ehe+2Jbc )\n\nI \n\n_ \n\nha/ b - ha/ b + In \n\n' \n\n1 + ehc+2Ja/b.c \n1 + e he (3) \n\nQ(s) = Z expfjJ, \n\n1 \n\nfjJ == L JijSiSj = s\u00b7Js \n\nij \n\n(2) \n\n\fTractable Variational Strnctures for Approximating Graphical Models \n\n185 \n\nFigure 1: A decimation rule for BMs. We can remove the upper node on the left so \nthat the partition function of the reduced graph is the same. This requires a simple \nchange in the parameters J, h coupling the two nodes on the right (see text). \n\nBy repeatedly applying such rules, Z is calculable in time linear in N. \n\n3.1 Fixed point (Mean Field) Equations \n\nUsing (2) in (1), the bound we wish to optimize with respect to the parameters \nB = (J, h) of Q has the form (( ... ) denotes averages with respect to Q) \n\nwhere E(B) is the energy. Differentiating (4) with respect to Jij(i =1= j) gives \n\nB(B) = - (\u00a2) + In Z + E(B) \n\n8B \n8E \n8J .. = - L Fij,ktlkl + 8J\u00b7 \u00b7 \n\ntJ \n\ntJ \n\nkl \n\n(4) \n\n(5) \n\nwhere Fij,kl = (SiSjSkSI) -\n(SiSj) (SkSI) is the Fisher information matrix. A similar \nexpression holds for the bias parameters, h, so that we can form a linear fixed point \nequation in the total parameter set B where the derivatives of the bound vanish. \nThis suggests the iterative solution, Bnew = F- 1 'Voj where the right hand side is \nevaluated at the current parameter values, Bold. \n\n4 Directed Q: Tractable Belief Networks \n\nBelief networks are products of conditional probability distributions, \n\nQ(H) = II Q(Hil 1Ti) \n\niEH \n\n(6) \n\nin which 1Ti denotes the parents of node i (see for example, [1]). The efficiency \nof computation depends on the underlying graphical structure of the model and is \nexponential in the maximal clique size (of the moralized triangulated graph [1]). We \nnow assume that our model class consists of belief networks with a fixed, tractable \ngraphical structure. The entropy can then be computed efficiently since it decouples \ninto a sum of averaged entropies per site i (Q(1TJ == 1 if 1Ti = \u00a2), \n\nH \n\niEH 7ri \n\nH, \n\nNote that the conditional entropy at each site i is trivial to compute since the values \nrequired can be read off directly from the definition of Q (6). By assumption, the \nmarginals Q(1Ti) are tractable, and can be found by standard methods, for example \nusing the Junction Tree Algorithm[I]. \nTo optimize the bound (1), we parameterize Q via its conditional probabilities, \nqi(1Ti) == Q(Hi = II1Ti). The remaining probability Q(Hi = 011Ti) follows from \n\n\f186 \n\nD. Barber and W. Wiegerinck \n\nnormalization. We therefore have a set {qi(1I'dI1l'i = (0 . .. 0), ... ,(1 ... I)} of vari(cid:173)\national parameters for each node in the graph. Setting the gradient of the bound \nwith respect to the qi (11' d 's equal to zero yields the equations \n\nwith \n\n(8) \n\n(9) \n\nwhere a (z) = 1/ (1 + e- Z ). The gradient V'ilTi \nis with respect to qi(1I'i). The \nexplicit evaluation of the gradients can be performed efficiently, since all that need \nto be differentiated are at most scalar functions of quantities that depend again \nonly linearly on the parameters Qi(1I'd . To optimize the bound, we iterate (8) till \nconvergence, analogous to using factorized models[4]. However, the more powerful \nclass of approximating distributions described by belief networks should enable a \nmuch tighter bound on the likelihood of the visible units. \n\n5 Application to Sigmoid Belief Networks \n\nWe now describe an application of these non-factorized approximations to a par(cid:173)\nticular class of directed graphical models, sigmoid belief networks[8J for which the \nconditional distributions have the form \n\nWij = 0 if j tJ. 1I'i. The joint distribution then has the form \nP(H, V) = II exp [ZiSi -In(1 + eZi)J \n\n(10) \n\n(11) \n\nwhere Zi = 2:j WijSj + ki. In (11) it is to be understood that the visible units are \nset to their observed values. In the lower bound (1) , unfortunately, the average of \nIn P(H, V) is not tractable, since (In [1 + e Z ]) does not decouple into a polynomial \nnumber of single site averages. Following [4J we use therefore the bound \n\n(12) \n\nwhere ~ is a variational parameter in [0, IJ. We can then define the energy function \n\nE(Q,O = L Wij (SiSj) + L kdsi) - L ki~i - LIn (e-~iZ; + e(1-~;)Zi) \n\nij \n\ni i i (13) \nwhere ki = ki - 2:j ~j Wji. Expect for the final term, the energy is a function of \nfirst or second order statistics of the variables. For using a BM as the variational \ndistribution, the final terms of (13) (e-~iZi) = 2:H e-~iZi /Z are simply the ratio of \ntwo partition functions, with the one in the numerator having a shifted bias. This \nis therefore tractable, provided that we use a tractable BM Q. \n\nSimilarly, if we are using a Belief Network as the variational distribution, all but the \nlast term in (13) is trivially tractable, provided that Q is tractable. We write the \nterms (e-~iZ;) = e-~ihi 2:HR(H), where R(H) = Ilj R(Hj I1l'j) and R(Hj I1l'j) == \n\n\fTractable Variational Structures for Approximating Graphical Models \n\n187 \n\n(a) Directed graph \ntoy problem. Hidden \nunits are black \n\ne e \n\ne e e e Lii \n\no \n\n0~.02~--:0C':\".04:-' \n\n(c) disconnected (,standard mean \nfield') - 16 parameters, mean: \n0.01571. Max. clique size: 1 \n\ne \n/1 \ne e \n\ne \nI'\\.. e e \n\n(b) Decimatable BM - 25 parame(cid:173)\nters, mean: 0.0020. \n\ne e \n\ne-e-e-e \n(d) chain - 19 parameters, mean: \n0.01529. Max. clique size: 2 \n\ne e \n\n~ e e e e \n\no \n\n0.02 \n\n0.04 \n\n(e) trees - 20 parameters, mean: \n0.0089. Max. clique size: 2 \n\n(f) network - 28 parameters, mean: \n0.00183. Max. clique size: 3 \n\nFigure 2: (a) Sigmoid Belief Network for which we approximate In P(V) . (b): BM approx(cid:173)\nimation. (c,d,e,f): Structures of the directed approximations on H. For each structure, \nhistograms of the relative error between the true log likelihood and the lower bound is \nplotted. The horizontal scale has been fixed to [0,0.05] in all plots. The maximum clique \nsize refers to the complexity of computation for each approximation, which is exponential \nin this quantity. The number of parameters includes the vector \u20ac. \n\nQ (Hj In j) exp ( -~Jij Hj). Rand Q have the same graphical structure and we can \ntherefore use message propagation techniques again to compute (e-{iZi). \nTo test our methods numerically, we generated 500 networks with parameters \n{Wij , kj } drawn randomly from the uniform distribution over [-1, 1J. The lower \nbounds Fv for several approximating structures are compared with the true log \nlikelihood, using the relative error [ = Fv/lnP(V} -1, fig. 2. These show that \nconsiderable improvements can be obtained when non-factorized variational dis(cid:173)\ntributions are used. Note that a 5 component mixture model (~ 80 variational \nparameters) yields [ = 0.01139 On this problem [5F. These results suggest there(cid:173)\nfore that exploiting knowledge of the graphical structure of the model is useful. For \ninstance, the chain (fig. 2(b\u00bb with no graphical overlap with the original graph \nshows hardly any improvement over the standard mean field approximation. On \nthe other hand, the tree model (fig. 2(c), which has about the same number of \nparameters, but a larger overlap with the original graph, does improve considerably \nover the mean field approximation (and even over the 5 component mixture model). \nBy increasing the overlap, as in fig. 2(d), the improvement gained is even greater. \n\n\f188 \n\nD. Barber and W. Wiegerinck \n\n6 Discussion \nIn this section, we briefly explain the relationship of the introduced methods to \nother, \"non-factorized\" methods in the literature, namely node-elimination[9] and \nsubstructure variation[lO]. \n\n6.1 Graph Partitioning and Node Elimination \n\nA further class of approximating distributions Q that could be considered are those \nin which the nodes can be partitioned into clusters, with independencies between \nthe clusters. For expositional clarity, consider two partitions, s = (S1' S2), and \ndefine Q to be factorized over these partitions2 , Q = Q1(sdQ2(S2). Using this Q \nin (1), we obtain (with obvious notational simplifications) \n\nInP(V) 2:: - (lnQ1)1 - (InQ2) 2 + (InP)1.2 \n\nA functional derivative with respect to Ql and Q2 gives the optimal forms: \n\nQ2 = exp (InP)1/Z2 \n\n(14) \n\n(15) \n\nIf we substitute this form for Q2 in (14) and use Z2 = E exp (In P)l' we obtain \n\nInP(V) 2:: - (InQ1)1 + In L exp (InP)1 \n\n(16) \n\n2 \n\n2 \n\nIn general, the final term may not have a simple form. In the case of approximating \na BM P , InP = SI\u00b7JllSI + 2s1\u00b7J12S2 + s2\u00b7h2S2 -lnZpo Used in (16), we get: \nIn P(V) 2:: - (In Q1)1 -In Zp + (SI \u00b7Jll S1)1 + In L exp (S2\u00b7 J22 S2 + 2s2 \u00b7J21 ($1)1) \n(17) \nso that the final term of (17) is the normalizing constant of a BM with connection \nmatrix h2 and whose diagonals are shifted by J21 (SI)1' One can therefore identify a \nset of nodes S1 which, when eliminated, reveal a tractable structure on the nodes S2. \nThe nodes that were removed are compensated for by using a variational distribution \nQ1(sd. If P is a BM, then the optimal Q1 has its weights fixed to those of P \nrestricted to variables S1, but with variable biases shifted by J12 (S2)2' Restricting \nQ1 to factorized models, we recover the node elimination bound [9] which can \nreadily be improved by considering non-factorized distributions Q1 (for example \nthose introduced in this paper), see fig(3) . Note, however, that there is no a(cid:173)\npriori guarantee that using such partitioned approximations will lead to a better \napproximation than that obtained from a tractable variational distribution defined \non the whole graph, but which does not have such a product form . Using a product \nof conditional distributions over clusters of nodes is developed more fully in [11]. \n\n6.2 Substructure Variation \n\nThe process of using a Q defined on the whole graph but for which only a subset of \nthe connections are adaptive is termed substructure variation [10]. In the context of \nBMs, Saul et al [2] identified weights in the original intractable distribution P that, \nif set to zero, would lead to a tractable graph Q(s) = P(slh, J, Jintractable = 0). To \ncompensate for these removed weights they allowed the biases in Q to vary such \nthat the KL divergence between Q and P is minimized. In general, this is a weaker \nmethod than one in which potentially all the parameters in the approximating \nnetwork are adaptive, such as using a decimatable BM. \n\n2In the case of fully connected BMs, for computing with a Q which is the product of \nK partitions (each of which is fully connected say), the computing time reduces from 2N \nfor the \"intractable\" P to K2N/K for Q, which can be a considerable reduction. \n\n\fTractable Variational Structures for Approximating Graphical Models \n\n189 \n\no \n\no \n\n0 \n\no \n\n0 \n\no \n\n~o \n\n(a) Intractable Model (b) \"Naive\" mean field \n(d) Partioning \nFigure 3: (a) A non-decimatable 5 node BM. (b) The standard factorized approxi(cid:173)\nmation. (c) Node Elimination (d) Partitioning, where a richer distribution is con(cid:173)\nsidered on the eliminated nodes. A solid line denotes a weight fixed to those in the \noriginal graph. A solid node is fixed , and an open node represents a variable bias. \n\n(c) Node elimination \n\n7 Conclusion \nFinding accurate, controllable approximations of graphical models is crucial if their \napplication to large scale problems is to be realised. We have elucidated two general \nclasses of tractable approximations, both based on the Kullback-Leibler divergence. \nFuture interesting directions include extending the class of distributions to higher \norder Boltzmann Machines (for which the class of decimation rules is greater), and to \nmixtures of these approaches. Higher order perturbative approaches are considered \nin [12]. These techniques therefore facilitate the approximating power of tractable \nmodels which can lead to a considerable improvement in performance. \n\n[1] E. Castillo, J . M. Gutierrez, and A. S. Hadi. Expert Systems and Probabilistic Network \n\nModels. Springer, 1997. \n\n[2] L. K. Saul and M. I. Jordan. Boltzmann Chains and Hidden Markov Models. In \nG. Tesauro, D. S. Touretzky, and T. K. Leen, editors , Advances in Neural Information \nProcessing Systems, pages 435- 442 . MIT Press, 1995. NIPS 7. \n\n[3] T. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. \n\nPhD thesis, Massachusetts Institute of Technology, 1997. \n\n[4] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean Field Theory for Sigmoid Belief \n\nNetworks. Journal of Artificial Intelligence Research, 4:61- 76, 1996. \n\n[5] C.M. Bishop, N. Lawrence, T. Jaakkola, and M. I. Jordan. Approximating Posterior \n\nDistributions in Belief Networks using Mixtures. MIT Press, 1998. NIPS 10. \n\n[6] C. Peterson and J. R. Anderson. A Mean Field Theory Learning Algorithm for Neural \n\nNetworks. Complex Systems, 1:995- 1019, 1987. \n\n[7] Conrad C. Galland. The limitations of deterministic Boltzmann machine learning . \n\nNetwork: Computation in Neural Systems, 4:355- 379, 1993. \n\n[8] R. Neal. Connectionist learning of Belief Networks. Artificial Intelligence, 56:71-113, \n\n1992 . \n\n[9] T . S. Jaakkola and M. I. Jordan. Recursive Algorithms for Approximating Probabil(cid:173)\n\nities in Graphical Models. MIT Press, 1996. NIPS 9. \n\n[10] L. K. Saul and M. I. Jordan. Exploiting Tractable Substructures in Intractable Net(cid:173)\n\nworks. MIT Press, 1996. NIPS 8. \n\n[11] W. Wiegerinck and D. Barber. Mean Field Theory based on Belief Networks for \n\nApproximate Inference. 1998. ICANN 98. \n\n[12] D. Barber and P. van de Laar. Variational Cumulant Expansions for Intractable \n\nDistributions. Journal of Artificial Intelligence Research, 1998. Accepted. \n\n\f", "award": [], "sourceid": 1617, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Wim", "family_name": "Wiegerinck", "institution": null}]}