{"title": "Approximate Inference A lgorithms for Two-Layer Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 533, "page_last": 539, "abstract": null, "full_text": "Approximate inference algorithms for two-layer \n\nBayesian networks \n\nAndrewY. Ng \n\nComputer Science Division \n\nUC Berkeley \n\nBerkeley, CA 94720 \nang@cs.berkeley.edu \n\nMichael I. Jordan \n\nComputer Science Division and \n\nDepartment of Statistics \n\nUC Berkeley \n\nBerkeley, CA 94720 \n\njordan@cs.berkeley.edu \n\nAbstract \n\nWe present a class of approximate inference algorithms for graphical \nmodels of the QMR-DT type. We give convergence rates for these al(cid:173)\ngorithms and for the Jaakkola and Jordan (1999) algorithm, and verify \nthese theoretical predictions empirically. We also present empirical re(cid:173)\nsults on the difficult QMR-DT network problem, obtaining performance \nof the new algorithms roughly comparable to the Jaakkola and Jordan \nalgorithm. \n\n1 Introduction \n\nThe graphical models formalism provides an appealing framework for the design and anal(cid:173)\nysis of network-based learning and inference systems. The formalism endows graphs with \na joint probability distribution and interprets most queries of interest as marginal or con(cid:173)\nditional probabilities under this joint. For a fixed model one is generally interested in the \nconditional probability of an output given an input (for prediction), or an input conditional \non the output (for diagnosis or control). During learning the focus is usually on the like(cid:173)\nlihood (a marginal probability), on the conditional probability of unobserved nodes given \nobserved nodes (e.g., for an EM or gradient-based algorithm), or on the conditional proba(cid:173)\nbility of the parameters given the observed data (in a Bayesian setting). \n\nIn all of these cases the key computational operation is that of marginalization. There are \nseveral methods available for computing marginal probabilities in graphical models, most \nof which involve some form of message-passing on the graph. Exact methods, while viable \nin many interesting cases (involving sparse graphs), are infeasible in the dense graphs that \nwe consider in the current paper. A number of approximation methods have evolved to treat \nsuch cases; these include search-based methods, loopy propagation, stochastic sampling, \nand variational methods. \n\nVariational methods, the focus of the current paper, have been applied successfully to a \nnumber of large-scale inference problems. In particular, Jaakkola and Jordan (1999) de(cid:173)\nveloped a variational inference method for the QMR-DT network, a benchmark network \ninvolving over 4,000 nodes (see below). The variational method provided accurate ap(cid:173)\nproximation to posterior probabilities within a second of computer time. For this difficult \n\n\f534 \n\nA. Y. Ng and M. 1. Jordan \n\ninference problem exact methods are entirely infeasible (see below), loopy propagation \ndoes not converge to correct posteriors (Murphy, Weiss, & Jordan, 1999), and stochastic \nsampling methods are slow and unreliable (Jaakkola & Jordan, 1999). \n\nA significant step forward in the understanding of variational inference was made by Kearns \nand Saul (1998), who used large deviation techniques to analyze the convergence rate of \na simplified variational inference algorithm. Imposing conditions on the magnitude of the \nweights in the network, they established a 0 ( Jlog N / N) rate of convergence for the error \nof their algorithm, where N is the fan-in. \n\nIn the current paper we utilize techniques similar to those of Kearns and Saul to derive a \nnew set of variational inference algorithms with rates that are faster than 0 ( Jlog N / N). \nOur techniques also allow us to analyze the convergence rate of the Jaakkola and Jordan \n(1999) algorithm. We test these algorithms on an idealized problem and verify that our \nanalysis correctly predicts their rates of convergence. We then apply these algorithms to \nthe difficult the QMR-DT network problem. \n\n2 Background \n\n2.1 The QMR-DT network \n\nThe QMR-DT (Quick Medical Reference, Decision-Theoretic) network is a bipartite graph \nwith approximately 600 top-level nodes di representing diseases and approximately 4000 \nlower-level nodes Ii representing findings (observed symptoms). All nodes are binary(cid:173)\nvalued. Each disease is given a prior probability P(di = 1), obtained from archival data, \nand each finding is parameterized as a \"noisy-OR\" model: \n\nP(h = lid) = 1- e-(lio-L:jE\"i (lijd j\n\n, \n\nwhere 7T'i is the set of parent diseases for finding h and where the parameters Oij are \nobtained from assessments by medical experts (see Shwe, et aI., 1991). \nLetting Zi = OiQ + I:jE 1l'i Oijdj , we have the following expression for the likelihood I : \n\n(1) \n\nwhere the sum is a sum across the approximately 2600 configurations of the diseases. Note \nthat the second product, a product over the negative findings, factorizes across the diseases \ndj ; these factors can be absorbed into the priors P (dj ) and have no significant effect on the \ncomplexity of inference. It is the positive findings which couple the diseases and prevent \nthe sum from being distributed across the product. \n\nGeneric exact algorithms such as the junction tree algorithm scale exponentially in the \nsize of the maximal clique in a moralized, triangulated graph. Jaakkola and Jordan (1999) \nfound cliques of more than 150 nodes in QMR-DT; this rules out the junction tree algo(cid:173)\nrithm. Heckerman (1989) discovered a factorization specific to QMR-DT that reduces the \ncomplexity substantially; however the resulting algorithm still scales exponentially in the \nnumber of positive findings and is only feasible for a small subset of the benchmark cases. \n\nI In this expression, the factors P( dj) are the probabilities associated with the (parent-less) disease \nnodes, the factors (1 - e - Zi) are the probabilities of the (child) finding nodes that are observed to be \nin their positive state, and the factors e -Zi are the probabilities of the negative findings. The resulting \nproduct is the joint probability P(f, d), which is marginalized to obtain the likelihood P(f). \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n535 \n\n2.2 The Jaakkola and Jordan (JJ) algorithm \n\nJaakkola and Jordan (1999) proposed a variational algorithm for approximate inference in \nthe QMR-DT setting. Briefly, their approach is to make use of the following variational \ninequality: \n\nwhere Ci is a deterministic function of Ai. This inequality holds for arbitrary values of \nthe free \"variational parameter\" Ai. Substituting these variational upper bounds for the \nprobabilities of positive findings in Eq. (1), one obtains a factorizable upper bound on the \nlikelihood. Because of the factorizability, the sum across diseases can be distributed across \nthe joint probability, yielding a product of sums rather than a sum of products. One then \nminimizes the resulting expression with respect to the variational parameters to obtain the \ntightest possible variational bound. \n\n2.3 The Kearns and Saul (KS) algorithm \n\nA simplified variational algorithm was proposed by Kearns and Saul (1998), whose main \ngoal was the theoretical analysis of the rates of convergence for variational algorithms. In \ntheir approach, the local conditional probability for the finding Ii is approximated by its \nvalue at a point a small distance Ci above or below (depending on whether upper or lower \nbounds are desired) the mean input E[Zi]. This yields a variational algorithm in which the \nvalues Ci are the variational parameters to be optimized. Under the assumption that the \nweights Oij are bounded in magnitude by T / N, where T is a constant and N is the number \nof parent (\"disease\") nodes, Kearns and Saul showed that the error in likelihood for their \nalgorithm converges at a rate of O( Vlog N / N). \n\n3 Algorithms based on local expansions \n\nInspired by Kearns and Saul (1998), we describe the design of approximation algorithms \nfor QMR-DT obtained by expansions around the mean input to the finding nodes. Rather \nthan using point approximations as in the Kearns-Saul (KS) algorithm, we make use of \nTaylor expansions. (See also Plefka (1982), and Barber and van de Laar (1999) for other \nperturbational techniques.) \n\nConsider a generalized QMR-DT architecture in which the noisy-OR model is replaced by a \ngeneral function 'IjJ( z) : R -t [0, 1] having uniformly bounded derivatives, i.e., \\'IjJ(i) (z) \\ :::; \nB i . Define F(Zl, . .. , ZK) = rr~l ('IjJ(zi))fi rr~l (1 - 'IjJ(Zd)l-fi so that the likelihood \ncan be written as \n\nP(f) = E{z;}[F(Zl\"\" ,ZK)]. \n\n(2) \n\nAlso define /-ti = E[Zi] = Ow + 2:7=1 OijP(dj = 1). \nA simple mean-field-like approximation can be obtained by evaluating F at the mean values \n\nWe refer to this approximation as \"MF(O).\" \n\nP(f) ~ F(/-t1, ... ,/-tK). \n\nExpanding the function F to second order, and defining (i = Zi - /-ti, we have: \n\nP(f) \n\nE{fi} F(jl) + L Fi1 (J1)(i 1 + 21 .L L Fid2 (J1)Eh (i2 + \n\nr \n\nK \n\n1 K K \n\nL \n\n11 =1 \n\n11 =112=1 \n\n(3) \n\n(4) \n\n\f536 \n\nA. Y. Ng and M I. Jordan \n\nwhere the subscripts on F represent derivatives. Dropping the remainder term and bringing \nthe expectation inside, we have the \"MF(2)\" approximation: \n\nP(f) ~ F(il) + 2 L L Fili2(Jt)E[EilEi2] \n\nK \n\n1 K \n\nMore generally, we obtain a \"MF(i)\" approximation by carrying out a Taylor expansion to \ni-th order. \n\ni 1 =1 i2=1 \n\n3.1 Analysis \n\nIn this section, we give two theorems establishing convergence rates for the MF( i) family \nof algorithms and for the Jaakkola and Jordan algorithm. As in Kearns and Saul (1998), \nour results are obtained under the assumption that the weights are of magnitude at most \nO(lIN) (recall that N is the number of disease nodes). For large N, this assumption of \n\"weak interactions\" implies that each Zi will be close to its mean value with high probability \n(by the law of large numbers), and thereby gives justification to the use of local expansions \nfor the probabilities of the findings. \n\nDue to space constraints, the detailed proofs of the theorems given in this section are de(cid:173)\nferred to the long version of this paper, and we will instead only sketch the intuitions for \nthe proofs here. \nTheorem 1 Let K (the number offindings) be fixed, and suppose IfJij I :::; ~ for all i, j for \nsome fixed constantT. Then the absolute error of the MF(k) approximation is 0 (N(!:!1)/2) \nfor k odd and 0 (N(k72+1) ) for k even. \nProof intuition. First consider the case of odd k. Since IfJij I :::; ~, the quantity Ei = Zi -\nJ-Li = 2:j fJij (dj - E[ dj ]) is like an average of N random variables, and hence has standard \ndeviation on the order 11m. Since MF(k) matches F up to the k-th order derivatives, we \nfind that when we take a Taylor expansion ofMF(k)'s error, the leading non-zero term is the \nk + 1-st order term, which contains quantities such as 10:+1. Now because Ei has standard \ndeviation on the order 11m, it is unsurprising that E[E:+l] is on the order 1IN(k+l)/2, \nwhich gives the error of MF(k) for odd k. \nFor k even, the leading non-zero term in the Taylor expansion of the error is a k + 1-st order \nterm with quantities such as 10:+1. But if we think of Ei as converging (via a central limit \ntheorem effect) to a symmetric distribution, then since symmetric distributions have small \nodd central moments, E[ 10:+1] would be small. This means that for k even, we may look to \nthe order k + 2 term for the error, which leads to MF(k) having the the same big-O error as \nMF(k + 1). Note this is also consistent with how MF(O) and MF(l) always give the same \nestimates and hence have the same absolute error. \n0 \n\nA theorem may also be proved for the convergence rate of the Jaakkola and Jordan (JJ) \nalgorithm. For simplicity, we state it here only for noisy-OR networks. 2 A closely related \nresult also holds for sigmoid networks with suitably modified assumptions; see the full \npaper. \n\nTheorem 2 Let K befixed, and suppose 'Ij;(z) = 1-e-z is the noisy-ORfunction. Suppose \nfurther that 0 :::; fJij :::; ~ for all i, j for some fixed constant T, and that J-Li ~ J-Lmin for all \ni, for some fixed J-Lmin > O. Then the absolute error of the JJ approximation is 0 (~ ). \n\n2Note in any case that 11 can be applied only when 1/J is log-concave, such as in noisy-OR networks \n\n(where incidentally all weights are non-negative). \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n537 \n\nThe condition of some Pmin lowerbounding the Pi'S ensures that the findings are not too \nunlikely; for it to hold, it is sufficient that there be bias (\"leak\") nodes in the network with \nweights bounded away from zero. \nProof intuition. Neglecting negative findings, (which as discussed do not need to be han(cid:173)\ndled variationally,) this result is proved for a \"simplified\" version of the JJ algorithm, that \nalways chooses the variational parameters so that for each i, the exponential upperbound \non '1/J(Zi) is tangent to '1/J at Zi = Pi. (The \"normal\" version of JJ can have error no worse \nthan this simplified one.) Taking a Tay lor expansion again of the approximation's error, we \nfind that since the upperbound has matched zeroth and first derivatives with F, the error is \na second order term with quantities such as f.t. As discussed in the MF(k) proof outline, \nthis quantity has expectation on the order 1jN, and hence JJ's error is O(ljN). \n0 \n\nTo summarize our results in the most useful cases, we find that MF(O) has a convergence \nrate of O(ljN) , both MF(2) and MF(3) have rates of O(ljN2) , and JJ has a convergence \nrate of O(ljN). \n\n4 Simulation results \n\n4.1 Artificial networks \n\nWe carried out a set of simulations that were intended to verify the theoretical results pre(cid:173)\nsented in the previous section. We used bipartite noisy-OR networks, with full connectivity \nbetween layers and with the weights ()ij chosen uniformly in (0, 2jN). The number N of \ntop-level (\"disease\") nodes ranged from 10 to 1000. Priors on the disease nodes were \nchosen uniformly in (0,1). \n\nThe results are shown in Figure 1 for one and five positive findings (similar results where \nobtained for additional positive findings). \n\n100 r--_ _ ___ ~------___. \n\n-... _--_._---... \n\n10\u00b7' -- - - - - - __ _ \n\n... e \nw 10' \nth 1t \n\n--..... _--... -. \n\n10\u00b7' ....\u2022. ...... .... . .. . ............... ... .. .... \n\n10 \n\n100 \n\n#diseases \n\n1000 \n\n10 \n\n100 \n\n#diseases \n\n1000 \n\nFigure 1: Absolute error in likelihood (averaged over many randomly generated networks) as a func(cid:173)\ntion of the number of disease nodes for various algorithms. The short-dashed lines are the KS upper \nand lower bounds (these curves overlap in the left panel), the long-dashed line is the 11 algorithm and \nthe solid lines are MF(O), MF(2) and MF(3) (the latter two curves overlap in the right panel). \n\nThe results are entirely consistent with the theoretical analysis, showing nearly exactly the \nexpected slopes of -112, -1 and -2 on a loglog plot. 3 Moreover, the asymptotic results are \n\n3The anomalous behavior of the KS lower bound in the second panel is due to the fact that the \nalgorithm generally finds a vacuous lower bound of 0 in this case, which yields an error which is \nessentially constant as a function of the number of diseases. \n\n\f538 \n\nA. Y. Ng and M. 1. Jordan \n\nalso predictive of overall performance: the MF(2) and MF(3) algorithms perform best in \nall cases, MF(O) and JJ are roughly equivalent, and KS is the least accurate. \n\n4.2 QMR-DT network \n\nWe now present results for the QMR-DT network, in particular for the four benchmark \nCPC cases studied by Jaakkola and Jordan (1999). These cases all have fewer than 20 \npositive findings; thus it is possible to run the Heckerman (1989) \"Quickscore\" algorithm \nto obtain the true likelihood. \n\n10'''' \n\n10\u00b7:10 \n\n10'r. \n\n'0 \n\n8 \n~t O\u00b730 \nQj \n~ \n\n1 0~ \n\n'0-\n\n'0-\n\n\u00b0 \n\nCase 16 \n\n3 . , \n\n7 \n#Exactly treated findings \n\n, \n\n, , \n, , \n\n'0' \n\n'0\u00b7 , , \n, \n, , \n\n'0\u00b7' \n\n8'\u00b0\u00b7 \n.c \n~to'\" \n\n10\" \u00b7 \n\nto'\" :----\n\n10 \u2022 \u2022 1 \n0 \n\nCase 32 \n\n3 \n\n5 \n\n#Exactly treated findings \n\n\u2022 \n\nFigure 2: Results for CPC cases 16 and 32, for different numbers of exactly treated findings. The \nhorizontal line is the true likelihood, the dashed line is J1's estimate, and the lower solid line is \nMF(3)'s estimate. \n\nlO\" ~-~--...___---,.---...,.--~------, \n\nCase 34 \n\nlO'\u00b7.--~--...___---,.---...,.--~--__, \n\nCase 46 \n\n10\u00b7\"'F - - - - - - - - - - - - --\n\n-- ----- ----\n\n----:;;\",.-\"\"==;J \n\nlO'\"F - - - - -____ _ -=\",...-=======j \n\n1O~ \n\n,0 \u00b7o~-----!---~. - - 7 ', --'-7'---:':lO:------!12 \n\nto ... o~-____!_--~. --7'.---!-'---::-----!12' \n\n#Exacdy treated findings \n\n#Exaclly treated findings \n\nFigure 3: Results for CPC cases 34 and 46. Same legend as above. \n\nIn Jaakkola and Jordan (1999), a hybrid methodology was proposed in which only a portion \nof the findings were treated approximately ; exact methods were used to treat the remaining \nfindings . Using this hybrid methodology, Figures 2 and 3 show the results of running JJ \nand MF(3) on these four cases.4 \n\n4These experiments were run using a version of the 11 algorithm that optimizes the variational \nparameters just once without any findings treated exactly, and then uses these fixed values of the \nparameters thereafter. The order in which findings are chosen to be treated exactly is based on 11's \nestimates, as described in Jaakkola and Jordan (1999). Missing points in the graphs for cases 16 and \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n539 \n\nThe results show the MF algorithm yielding results that are comparable with the JJ algo(cid:173)\nrithm. \n\n5 Conclusions and extension to multilayer networks \n\nThis paper has presented a class of approximate inference algorithms for graphical models \nof the QMR-DT type, supplied a theoretical analysis of convergence rates, verified the rates \nempirically, and presented promising empirical results for the difficult QMR-DT problem. \n\nAlthough the focus of this paper has been two-layer networks, the MF(k) family of al(cid:173)\ngorithms can also be extended to multilayer networks. For example, consider a 3-layer \nnetwork with nodes bi being parents of nodes di being parents of nodes Ii. To approximate \nPr[J] using (say) MF(2), we first write Pr[J] as an expectation of a function (F) of the \nZi'S, and approximate this function via a second-order Taylor expansion. To calculate the \nexpectation of the Taylor approximation, we need to calculate terms in the expansion such \nas E[di ], E[didj ] and E[dn When di had no parents, these quantities were easily derived in \nterms of the disease prior probabilities. Now, they instead depend on the joint distribution \nof di and dj , which we use our two-layer version of MF(k), applied to the first two (bi and \ndi ) layers of the network, to approximate. It is important future work to carefully study the \nperformance of this algorithm in the multilayer setting. \n\nAcknowledgments \n\nWe wish to acknowledge the helpful advice of Tommi Jaakkola, Michael Kearns, Kevin \nMurphy, and Larry Saul. \n\nReferences \n\n[1] Barber, D., & van de Laar, P. (1999) Variational cumulant expansions for intractable distributions. \nJournal of Artificial Intelligence Research, 10,435-455. \n\n[2] Heckerman, D. (1989). A tractable inference algorithm for diagnosing multiple diseases. In \nProceedings of the Fifth Conference on Uncertainty in Artificial Intelligence. \n\n[3] Jaakkola, T. S., & Jordan, M. 1. (1999). Variational probabilistic inference and the QMR-DT \nnetwork. Journal of Artificial Intelligence Research, 10,291-322. \n\n[4] Jordan, M. 1., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational \nmethods for graphical models. In Learning in Graphical Models. Cambridge: MIT Press. \n\n[5] Kearns, M. 1., & Saul, L. K. (1998). Large deviation methods for approximate probabilistic infer(cid:173)\nence, with rates of convergence. In G. F. Cooper & S. Moral (Eds.), Proceedings of the Fourteenth \nConference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann. \n\n[6] Murphy, K. P., Weiss, Y, & Jordan, M. 1. (1999). Loopy belief propagation for approximate \ninference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial \nIntelligence. \n\n[7] Plefka, T. (1982). Convergence condition of the TAP equation for the infinite-ranged Ising spin \nglass model. In 1. Phys. A: Math. Gen., 15(6). \n\n[8] Shwe, M., Middleton, B., Heckerman, D., Henrion, M., Horvitz, E., Lehmann, H., & Cooper, G. \n(1991). Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base 1. \nThe probabilistic model and inference algorithms. Methods of Information in Medicine, 30, 241-255. \n\n34 correspond to runs where our implementation of the Quickscore algorithm encountered numerical \nproblems. \n\n\f", "award": [], "sourceid": 1640, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}