{"title": "On the Concentration of Expectation and Approximate Inference in Layered Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 400, "abstract": "", "full_text": "On the concentration of expectation and\napproximate inference in layered networks\n\nXuanLong Nguyen\n\nUniversity of California\n\nBerkeley, CA 94720\n\nxuanlong@cs.berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe present an analysis of concentration-of-expectation phenomena in\nlayered Bayesian networks that use generalized linear models as the local\nconditional probabilities. This framework encompasses a wide variety of\nprobability distributions, including both discrete and continuous random\nvariables. We utilize ideas from large deviation analysis and the delta\nmethod to devise and evaluate a class of approximate inference algo-\nrithms for layered Bayesian networks that have superior asymptotic error\nbounds and very fast computation time.\n\nIntroduction\n\n1\nThe methodology of variational inference has developed rapidly in recent years, with in-\ncreasingly rich classes of approximation being considered (see, e.g., Yedidia, et al., 2001,\nJordan et al., 1998). While such methods are intuitively reasonable and often perform well\nin practice, it is unfortunately not possible, except in very special cases, to provide error\nbounds for these inference algorithms. Thus the user has little a priori guidance in choosing\nan inference algorithm, and little a posteriori reassurance that the approximate marginals\nproduced by an algorithm are good approximations. The situation is somewhat better for\nsampling algorithms, but there the reassurance is only asymptotic.\n\nA line of research initiated by Kearns and Saul (1998) aimed at providing such error bounds\nfor certain classes of directed graphs. Analyzing the setting of two-layer networks, binary\nnodes with large fan-in, noisy-OR or logistic conditional probabilities, and parameters that\nscale as O(1=N ), where N are the number of nodes in each layer, they used a simple\nlarge deviation analysis to design an approximate inference algorithm that provided error\nbounds. In later work they extended their algorithm to multi-layer networks (Kearns and\n\nSaul, 1999). The error bound provided by this approach was O(pln N=N ). Ng and Jordan\n\n(2000) pursued this line of work, obtaining an improved error bound of O(1=N (k+1)=2)\nwhere k is the order of a Taylor expansion employed by their technique. Their approach\nwas, however, restricted to two-layer graphs.\n\nLayered graphs are problematic for many inference algorithms, including belief propa-\ngation and generalized belief propagation algorithms. These algorithms convert directed\ngraphs to undirected graphs by moralization, which creates infeasibly large cliques when\nthere are nodes with large fan-in. Thus the work initiated by Kearns and Saul is notable\nnot only for its ability to provide error bounds, but also because it provides one of the few\n\n\fpractical algorithms for general layered graphs. It is essential to develop algorithms that\nscale in this setting\u2014e.g., a recent application at Google studied layered graphs involving\nmore than a million nodes (Harik and Shazeer, personal communication).\n\nIn this paper, we design and analyze approximate inference algorithms for general multi-\nlayered Bayesian networks with generalized linear models as the local conditional proba-\nbility distributions. Generalized linear models including noisy-OR and logistic functions\nin the binary case, but go signi\ufb01cantly further, allowing random variables from any dis-\ntribution in the exponential family. We show that in such layered graphical models, the\nconcentration of expectations of any \ufb01xed number of nodes propagate from one layer to\nanother according to a topological sort of the nodes. This concentration phenomenon can\nbe exploited to devise ef\ufb01cient approximate inference algorithms that provide error bounds.\nSpeci\ufb01cally, in a multi-layer network with N nodes in each layer and random variables in\nsome exponential family of distribution, our algorithm has an O((ln N )3=N )(k+1)=2) error\nbound and O(N k) time complexity. We perform a large number of simulations to con-\n\ufb01rm this error bound and compare with Kearns and Saul\u2019s algorithm, which has not been\nempirically evaluated before.\n\nThe paper is organized as follows. In Section 2, we study the concentration of expectation\nin generalized linear models. Section 3 introduces the use of delta method for approximat-\ning the expectations. Section 4 describes an approximate inference algorithm in a general\ndirected graphical model, which is evaluated empirically in Section 5. Finally, Section 6\nconcludes the paper.\n\n2 Generalized linear models\nConsider a generalized linear model (GLIM; see McCullagh and Nelder, 1983, for de-\ntails) consisting of N covariates (inputs) X1; : : : ; XN and a response (output) variable Y .\nA GLIM makes three assumptions regarding the form of the conditional probability dis-\ntribution P (Y jX): (1) The inputs X1; : : : ; XN enter the model via a linear combination\ni=1 (cid:18)iXi; (2) the conditional mean (cid:22) is represented as a function f ((cid:24)), known as the\nresponse function; and (3) the output Y is characterized by an exponential family distribu-\ntion (cf. Brown, 1986) with natural parameter (cid:17) and conditional mean (cid:22). The conditional\nprobability takes the following form:\n\n(cid:24) =PN\n\nP(cid:18);(cid:30)(Y jX) = h(y; (cid:30)) exp\n\n(cid:17)y (cid:0) A((cid:17))\n\n(cid:30)\n\n;\n\n(1)\n\nwhere (cid:30) is a scale parameter, h is a function re\ufb02ecting the underlying measure, and A((cid:17))\nis the log partition function.\n\nIn this section, for ease of exposition, we shall assume that the response function f is a\ni=1 (cid:18)iXi. As will soon\nbe clear, however, our analysis is applicable to a general setting in which f is only required\nto have bounded derivatives on compact sets.\n\ncanonical response function, which simply means that (cid:17) = (cid:24) = PN\n\nIt is a well-known property of exponential family distributions that\n\nE(Y jX) = (cid:22) = A0((cid:17)) = f ((cid:17)) = f N\nXi=1\n\nVar(YjX) = (cid:30)A00((cid:17)) = (cid:30)f 0((cid:17)):\n\n(cid:18)iXi!\n\nThe exponential family includes the Bernoulli, multinomial, and Gaussian distributions,\nbut many other useful distributions as well, including the Poisson, gamma and Dirichlet.\nWe will be studying GLIMs de\ufb01ned on layered graphical models, and thus X1; : : : ; XN are\nthemselves taken to be random variables in the exponential family. We also make the key\n\n\fassumption that all parameters obey the bound j(cid:18)ij (cid:20) (cid:28) =N for some constant (cid:28) , although\nthis assumption shall be relaxed later on.\n\nUnder these assumptions, we can show that the linear combination (cid:17) = PN\n\ni=1 (cid:18)iXi is\ntightly concentrated around its mean with very high probability. Kearns and Saul (1998)\nhave proved this for binary random variables using large deviation analysis. This type\nof analysis can be used to prove general results for (bounded and unbounded) random\nvariables in any standard exponential family.1\nLemma 1 Assume that X1; : : : ; XN are independent random variables in a standard ex-\nponential family distribution. Furthermore, EXi 2 [pi (cid:0) (cid:1)i; pi + (cid:1)i]. Then there are\n\ni=1 j(cid:18)ij(cid:1)i:\n\nN\n\nP (j(cid:17) (cid:0)\n\nabsolute constants C and (cid:11) such that, for any (cid:15) >PN\n(cid:11)((cid:15) (cid:0)PN\n(PN\nXi=1\n\n(cid:20) C expf(cid:0)(cid:11)N 1=3(cid:28) (cid:0)2=3((cid:15) (cid:0)\n\n(cid:18)ipij > (cid:15)) (cid:20) C exp (cid:0)\n\nXi=1\n\ni=1 j(cid:18)ij(cid:1)i)2=3\ni=1 (cid:18)2\nN\n\ni )1=3\n\nj(cid:18)ij(cid:1)i)2=3g\n\nWe will study architectures that are strictly layered; that is, we require that there are no\nedges directly linking the parents of any node. In this setting the parents of each node are\nconditionally independent given all ancestor nodes (in the previous layers) in the graph.\nThis will allow us to use Lemma 1 and iterated conditional expectation formulas to ana-\nlyze concentration phenomena in these models. The next lemma shows that under certain\nassumptions about the response function f, the tight concentration of (cid:17) also entails the\nconcentration of E(Y jX) and Var(YjX).\n\nLemma 2 Assume that\ninterval\n\ni=1 (cid:18)ipi + (cid:15)] with high probability,\n\n[pmin; pmax] and f has bounded derivatives on compact sets.\n\nthe means of X1; : : : ; XN are bounded within some \ufb01xed\nIf (cid:17) 2\nthen: E(Y jX) = f ((cid:17)) 2\ni=1 (cid:18)ipi) (cid:0)\n\ni=1 (cid:18)ipi) + O((cid:15))]; and Var(YjX) = f 0((cid:17)) 2 [f 0(PN\n\ni=1 (cid:18)ipi (cid:0) (cid:15);PN\n[PN\n[f (PN\ni=1 (cid:18)ipi) (cid:0) O((cid:15)); f (PN\nO((cid:15)); f 0(PN\nare approximated by pi (i = 1; : : : ; N ), then E(Y ) can be approximated by f (PN\n\nLemmas 1 and 2 provide a mean-\ufb01eld-like basis for propagating the concentration of ex-\npectations from the input layer X1; : : : ; XN to the output layer Y . Speci\ufb01cally, if E(Xi)\ni=1 (cid:18)ipi).\n\ni=1 (cid:18)ipi) + O((cid:15))] with high probability.\n\n3 Higher order expansion (the delta method)\nWhile Lemmas 1 and 2 already provide a procedure for approximating E(Y ), one can\nuse higher-order (Taylor) expansion to obtain a signi\ufb01cantly more accurate approximation.\nThis approach, known in the statistics literature as the delta method, has been used in\nslightly different contexts for inference problems in the work of Plefka (1982), Barber and\nvan der Laar (1999), and Ng and Jordan (2000). In our present setting, we will show that\nestimates based on Taylor expansion up to order k can be obtained by propagating the\nexpectation of the product of up to k nodes from one layer to an offspring layer.\nThe delta method is based on the same assumptions as in Lemma 2; that is, the means\nof X1; : : : ; XN are assumed to be bounded within some \ufb01xed interval [pmin; pmax], and\ni=1 (cid:18)ipi\nbounded within \ufb01xed interval [(cid:28) pmin; (cid:28) pmax]. By Lemma 1, with high probability (cid:17) =\n\nthe response function f has bounded derivatives on compact sets. We have PN\n\n1The proofs of this and all other theorems can be found in a longer version of this paper, available\n\nat www.cs.berkeley.edu/(cid:24)xuanlong.\n\n\fi=1 (cid:18)ipi + (cid:15), for some small (cid:15). Using Taylor\u2019s expansion up to second order, we have\n\nPN\n\nthat with high probability:\nE(Y ) = ExE(Y jX) = Exf ((cid:17)) = f(cid:17) +\n\nN\n\nN\n\n(cid:17) +\n\n1\n2!\n\n(\n\n(cid:18)ipi)f 0\n\n(cid:18)iEXi (cid:0)\n\nXi=1\n\nXi=1\n\n(cid:18)i(cid:18)j(E(Xi (cid:0) pi)(Xj (cid:0) pj))f 00\n\n(cid:17) + O((cid:15)3);\n\n(Xi;j\nwhere f(cid:17) and its derivatives are evaluated atPN\n\ni=1 (cid:18)ipi. This gives us a method of approxi-\nmating E(Y ) by recursion: Assuming that one can approximate all needed expectations of\nvariables in the parent layer X with error O((cid:15)3), one can also obtain an approximation of\nE(Y ) with the error O((cid:15)3). Clearly, the error can be improved to O((cid:15)k+1) by using Taylor\nexpansion to some order k (provided that the response function f ((cid:17)) = A0((cid:17)) has bounded\nderivatives up to that order). In this case, the expectation of the product of up to k elements\nin the input layer, e.g., E(X1 (cid:0) p1) : : : (Xk (cid:0) pk), needs to be computed.\nThe variance of Y (as well as other higher-order expectations) can also be approximated in\nthe same way:\n\nVar(Y) = Ex(Var(YjX)) + Varx(E(YjX))\n= (cid:30)Exf 0((cid:17)) + Exf ((cid:17))2 (cid:0) (E(Y ))2\n\nwhere each component can be approximated using the delta method.\n\n4 Approximate inference for layered Bayesian networks\nIn this section, we shall harness the concentration of expectation phenomenon to design\nand analyze a family of approximate inference algorithms for multi-layer Bayesian net-\nworks that use GLIMs as local conditional probabilities. The recipe is clear by now. First,\norganize the graph into layers that respect the topological ordering of the graph. The algo-\nrithm is comprised of two stages: (1) Propagate the concentrated conditional expectations\nfrom ancestor layers to offspring layers. This results in a rough approximation of the ex-\npectation of individual nodes in the graph; (2) Apply the delta method to obtain more a\nre\ufb01ned marginal expectation of the needed statistics, also starting from ancestor layers to\noffspring layers.\n\nij\n\ni gN\n\ni=1 is the input layer, and fX L\n\ni, where fX 1\ni gN\ni=1\ni ) of the \ufb01rst layer are given. For each 2 (cid:20) l (cid:20) L,\ni and its parent X l(cid:0)1\n. De\ufb01ne the weighted sum of\nj=1 (cid:18)l(cid:0)1\n; where we assume that\ni: (cid:17)l\n\nConsider a multi-layer network that has L layers, each of which has N random variables.\nWe refer to the ith variable in layer l by X l\nis the output layer. The expectations E(X 1\nlet (cid:18)l(cid:0)1\ndenote the parameter linking X l\ncontributions from parents to a node X l\nj(cid:18)l\nWe \ufb01rst consider the problem of estimating expectations of nodes in the output layer.\nFor binary networks, this amounts to estimating marginal probabilities, say, P [X L\n1 =\nm = xm], for given observed values (x1; :::; xm), where m < N. We sub-\nx1; ::::; X L\nsequently consider a more general inference problem involving marginal and conditional\nprobabilities of nodes residing in different layers in the graph.\n\nijj (cid:20) (cid:28) =N for some constant (cid:28) .\n\ni = PN\n\nij X l(cid:0)1\n\nj\n\nj\n\n4.1 Algorithm stage 1: Propagating the concentrated expectation of single nodes\nWe establish a rough approximation of the expectations of all single nodes of the graph,\nstarting from the input layer l = 1 to the output layer l = L in an inductive manner. For\nl = 1, let (cid:1)1\n\ni for all i = 1; : : : ; N. For l > 1, let\n\ni = 0 and p1\n\ni = EX 1\n\ni = (cid:14)1\n\n(cid:22)l\n\ni =\n\n(cid:18)l(cid:0)1\nij pl(cid:0)1\n\nj\n\nN\n\nXj=1\n\n(2)\n\n\fi = C expf(cid:0)(cid:11)N 1=3(cid:28) (cid:0)2=3((cid:15)l\n(cid:14)l\n\nj(cid:1)l(cid:0)1\n\n)2=3g\n\nj\n\n(cid:15)l\ni =\n\nj(cid:18)l(cid:0)1\n\nij\n\nN\n\nXj=1\n\npl\ni =\n\n(cid:1)l\n\ni =\n\n1\n\nx2Al\n\n2 sup\n2 sup\n\nx2Al\n\n1\n\ni\n\ni\n\nj(cid:1)l(cid:0)1\n\nij\n\nN\n\ni (cid:0)\n\nj(cid:18)l(cid:0)1\n\nj + (cid:28)p((cid:13) ln N )3=N\nXi=1\nf (x)!\nf (x)! where Al\n\ni\n\ni\n\nf (x) (cid:0) inf\nx2Al\n\nf (x) + inf\nx2Al\n\ni = [(cid:22)l\n\ni (cid:0) (cid:15)l\n\ni; (cid:22)l\n\ni + (cid:15)l\ni]:\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nIn the above updates, constants (cid:11) and C arise from Lemma 1, (cid:13) is an arbitrary constant\nthat is greater than 1=(cid:11). The following proposition, whose proof makes use of Lemma 1\ncombined with union bounds, provides the error bounds for our algorithm.\n\nProposition 3 With probability at least QL\n\nany 1 (cid:20) i (cid:20) N; 1 (cid:20) l (cid:20) L we have: E[X l\n(cid:1)l\n\n1\ni]: Furthermore, (cid:15)l\n\ni] and (cid:17)l\n\ni 2 [(cid:22)l\n\ni + (cid:15)l\n\ni (cid:0) (cid:15)l\n\ni; (cid:22)l\n\ni jX l(cid:0)1\n\ni) = (1 (cid:0) CN 1(cid:0)(cid:11)(cid:13))L(cid:0)1, for\nN ] = f ((cid:17)l\ni +\n\ni=1 (cid:14)l\n; : : : ; X l(cid:0)1\n\nl=1(1 (cid:0)PN\ni = O(p(ln N )3=N ) for all i; l.\n\ni) 2 [pl\n\ni (cid:0) (cid:1)l\n\ni; pl\n\nFor layered networks with only bounded and Gaussian variables, Lemma 1 can be tight-\n\nwe drop the conditions that all parameters (cid:18)l\nj=1((cid:18)l(cid:0)1\n\nened, and this results in an error bound of O(p(ln N )2=N ). For layered networks with\nonly bounded variables, the error bound can be tightened to O(pln N=N ). In addition, if\nthrough by replacing (cid:28) by qNPN\nand l. The asymptotic error bound O(p(ln N )3=N ) no longer holds, but it can be shown\n\nij are bounded by (cid:28) =N, Proposition 3 still goes\ni for all i\n\nthat there are absolute constants c1 and c2 such that for all i; l:\n\nij )2 in updating equations for (cid:15)l\n\ni and (cid:14)l\n\nwhere jj(cid:18)l(cid:0)1\n\ni\n\nj=1((cid:18)l(cid:0)1\n\n(cid:15)l\n\ni (cid:20) (c1jj(cid:15)l(cid:0)1jj + c2p(ln N )3)jj(cid:18)l(cid:0)1\nij )2 and jj(cid:15)ljj (cid:17)qPN\n\ni\ni)2.\n\ni=1((cid:15)l\n\njj\n\njj (cid:17)qPN\n\n4.2 Algorithm stage 2: Approximating expectations by recursive delta method\nThe next step is to apply the delta method presented in Section 3 in a recursive manner.\nWrite:\n\nE[X L\n\n1 :::X L\n\nm] = EX L(cid:0)1E[X L\n\n1 : : : XmjX L(cid:0)1] = EX L(cid:0)1\n\nf ((cid:17)L\n\ni ) = EX L(cid:0)1F ((cid:17)L\n\n1 ; :::; (cid:17)L\nm)\n\nm\n\nYi=1\n\nwhere F ((cid:17)L\n\n1 ; :::; (cid:17)L\ni (cid:0) (cid:22)l\n\nm) :=Qm\n\ni=1 f ((cid:17)L\n\ni ).\n\nLet (cid:12)l\n\ni = (cid:17)l\n\ni. So, with probability (1 (cid:0) CN 1(cid:0)(cid:11)(cid:13))L(cid:0)1 we have j(cid:12)l\n\nij (cid:20) (cid:15)l\n\ni =\n\nO(p(ln N )3=N ) for all l = 1; : : : ; L and i = 1; : : : ; N. Applying the delta method\n\nby expanding F around the vector (cid:22) = ((cid:22)L\nm) up to order k gives an approxima-\ntion, which is denoted by MF(k), that depends on expectations of nodes in the previous\nlayer. Continuing this approximation recursively on the previous layers, we obtain an ap-\nproximate algorithm that has an error bound O(((ln N )3=N )(k+1)=2) (see the derivation in\nSection 3) with probability at least (1 (cid:0) CN 1(cid:0)(cid:11)(cid:13))L(cid:0)1 and an error bound O(1) with the\nremaining probability. We conclude that,\n\n1 ; :::; (cid:22)L\n\nTheorem 4 The absolute error of the MF(k) approximation is O(((ln N )3=N )(k+1)=2).\nFor networks with bounded variables,\nthe error bound can be tightened to\nO((ln N=N )(k+1)=2).\n\n\fIt is straightforward to check that MF(k) takes O(N maxfk;2g) computational time. The\nasymptotic error bound O(((ln N )3=N )(k+1)=2) is guaranteed for the aproximation of\nexpectations of a \ufb01xed number m of nodes in the output layer.\nIn principle, this im-\nplies that m has to be small compared to N for the approximation to be useful. For bi-\nnary networks, for instance, the marginal probabilities of m nodes could be as small as\nO(1=2m), so we need O(1=2m) to be greater than O((ln N=N )(k+1)=2). This implies that\nm < ln 1\n(ln N (cid:0) ln ln N ) for some constant c. However, we shall see that our\napproximation is still useful for large m as long as the quantity it tries to approximate is\nnot too small.\n\nc + (k+1)\n\n2\n\nFor two-layer networks, an algorithm by Ng and Jordan (2000) yields a better error rate of\nO(1=N (k+1)=2) by exploiting the Central Limit Theorem. However, this result is restricted\nto networks with only 2 layers. Barber and Sollich (1999) were also motivated by the\nCentral Limit Theorem\u2019s effect to approximate (cid:17)l\ni by a multivariate Gaussian distribution,\nresulting in a similar exploitation of correlation between pairs of nodes in the parent layer\nas in our MF(2) approximation. Also related to Barber and Sollich\u2019s algorithm of using an\napproximating family of distribution is the assumed-density \ufb01ltering approach (e.g., Minka,\n2001). These approaches, however, do not provide an error bound guarantee.\n\n4.3 Computing conditional expectations of nodes in different layers\n\nFor simplicity, in this subsection we shall consider binary layered networks. First, we\nare interested in the marginal probability of a \ufb01xed number of nodes in different layers.\nThis can be expressed in terms of product of conditional probabilities of nodes in the\nsame layer given values of nodes in the previous layer. As shown in the previous sub-\nsection, each of these conditional probabilities can be approximated with an error bound\nO((ln N=N )(k+1)=2) as N ! 1, and the product can also be approximated with the same\nerror bound.\n\n1 = xL\n\nn = xL\n\n1 = x1\n\n1; : : : ; X 1\n\nNext, we consider approximating the probability of several nodes in the input layer con-\nditioned on some nodes observed in the output layer L, i.e., P (X 1\nm =\nn ) for some \ufb01xed numbers m and n that are small com-\n1 ; : : : ; X L\nx1\nmjX L\npared to N.\nIn a multi-layer network, when even one node in the output layer is ob-\nserved, all nodes in the graph becomes dependent. Furthermore, the conditional prob-\nabilities of all nodes in the graph are generally not concentrated. Nevertheless, we can\nstill approximate the conditional probability by approximating two marginal probabilities\nP (X 1\nn = xL\nn )\nseparately and taking the ratio. This boils down to the problem of computing the marginal\nprobabilities of nodes residing in different layers of the graph. As discussed in the previous\nparagraph, since each marginal probabilities can be approximated with an asymptotic error\nbound O((ln N=N )(k+1)=2) as N ! 1 (for binary networks), the same asymptotic error\nbound holds for the conditional probabilities of \ufb01xed number of nodes. In the next section,\nwe shall present empirical results that show that this approximation is still quite good even\nwhen a large number of nodes are conditioned on.\n\nn ) and P (X L\n\n1 ; : : : ; X L\n\n1 ; : : : ; X L\n\nm; X L\n\n1 = xL\n\ni ; : : : ; X 1\n\nm = x1\n\n1 = x1\n\nn = xL\n\n1 = xL\n\n5 Simulation results\n\nIn our experiments, we consider a large number of randomly generated multi-layer\nBayesian networks with L = 3, L = 4 or L = 5 layers, and with the number of nodes\nin each layer ranging from 10 to 100. The number of parents of each node is chosen\nuniformly at random in [2; N ]. We use the noisy-OR function for the local conditional\nprobabilities; this choice has the advantage that we can obtain exact marginal probabili-\nties for single nodes by exploiting the special structure of noisy-OR function (Heckerman,\n\n\ftau = 2 \n\nK\u2212S \nand MF(1) \n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nr\no\nr\nr\ne\n\n \n\ne\n\nt\n\nl\n\nu\no\ns\nb\na\n\n0.25\n\nx 10\u22123\n\n3\n\n2\n\n1\n\n0.2\n\n0.15\n\n0.1\n0\n\n50\nN\n\n100\n\n0\n0\n\n50\nN \n\n100\n\nMF(2) \n\ntau = 4 \n\nK\u2212S \n\nand MF(1) \n\nx 10\u22123\n\nMF(2) \n\n \n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n50\nN \n\n100\n\n0\n0\n\n50\nN\n\n100\n\n0.25\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\ne\n\nt\n\nl\n\nu\no\ns\nb\na\n\n0.15\n\n0.1\n\n0\n\n(a)\n\n(b)\n\nFigure 1: The \ufb01gures show the average error in the marginal probabilities of nodes in the output\nlayer. The x-axis is the number of nodes in each layer (N = 10; : : : ; 100). The three curves (solid,\ndashed, dashdot) correspond to the different numbers of layers L = 3; 4; 5, respectively. Plot (a)\ncorresponds to the case (cid:28) = 2 and plot (b) corresponds to (cid:28) = 4. In each pair of plots, the leftmost\nplot shows MF(1) and Kearns and Saul\u2019s algorithm (K-S) (with the latter being distinguished by black\narrows), and the rightmost plot is MF(2). Note the scale on the y-axis for the rightmost plot is 10(cid:0)3.\n\nk\nNetwork 1\n\nNetwork 2\n\nNetwork 3\n\n1\n0.0001\n0.0007\n0.0003\n0.0018\n0.0002\n0.0008\n\n2\n0.0041\n0.0609\n0.0040\n0.0508\n0.0031\n0.0406\n\n3\n0.0052\n0.0912\n0.0148\n0.1431\n0.0082\n0.1150\n\n4\n0.0085\n0.1925\n0.0331\n0.3518\n0.0501\n0.6858\n\n5\n0.0162\n0.1862\n0.0981\n0.7605\n0.1095\n1.2392\n\n6\n0.0360\n0.3885\n0.1629\n0.7790\n0.0890\n0.6115\n\n7\n0.0738\n0.6262\n0.1408\n0.7118\n0.0957\n0.5703\n\n8\n0.1562\n1.6478\n0.1391\n0.9435\n0.1022\n0.7840\n\nTable 1: The experiments were performed on 24-node networks (3 layers with N = 8 nodes in each\nlayer). For each network, the \ufb01rst line shows the absolute error of our approximation of conditional\nprobabilities of nodes in the input layer given values of the \ufb01rst k nodes in the output layer, the second\nline shows the absolute error of the log likelihood of the k nodes. The numbers were obtained by\naveraging over k2 random instances of the k nodes.\n\n1989). All parameters (cid:18)ij are uniformly distributed in [0; (cid:28) =N ], with (cid:28) = 2 and (cid:28) = 4.\nFigure 1 shows the error rates for computing the expectation of a single node in the output\nlayer of the graph. The results for each N are obtained by averaging over many graphi-\ncal models with the same value of N. Our approximate algorithm, which is denoted by\nMF(2), runs fast: The running time for the largest network (with L = 5; N = 100) is\napproximately one minute.\n\nWe compare our algorithm (with (cid:13) \ufb01xed to be 2=(cid:11)) with that of Kearns and Saul (K-S).\nThe MF(1) estimates are slightly worse that of the K-S algorithm, but they have the same\nerror curve O(ln N=N )1=2. The MF(2) estimates, whose error curves were proven to be\nO(ln N=N )3=2, are better than both by orders of magnitude. The \ufb01gure also shows that the\nerror increases when we increase the size of the parameters (increase (cid:28) ).\nNext, we consider the inference problem of computing conditional probabilities of the in-\nput layer given that the \ufb01rst k nodes are observed in the output layer. We perform our\nexperiments on several randomly generated three-layer networks with N = 8. This size\nallows us to be able to compute the conditional probabilities exactly.2 For each value of\n\n2The amount of time spent on exact computation for each network is about 3 days, while our\n\napproximation routines take a few minutes.\n\n\fk, we generate k2 samples of the observed nodes generated uniformly at random from the\nnetwork and then compute the average of errors of conditional probability approximations.\nWe observe that while the error of conditional probabilities is higher than those of marginal\nprobabilities (see Table 1 and Figure 1), the error remains small despite the relatively large\nnumber of observed nodes k compared to N.\n\n6 Conclusions\nWe have presented a detailed analysis of concentration-of-expectation phenomena in lay-\nered Bayesian networks which use generalized linear models as local conditional probabil-\nities. Our analysis encompasses a wide variety of probability distributions, including both\ndiscrete and continuous random variables. We also performed a large number of simula-\ntions in multi-layer network models, showing that our approach not only provides a useful\ntheoretical analysis of concentration phenomena, but it also provides a fast and accurate\ninference algorithm for densely-connected multi-layer graphical models.\n\nIn the setting of Bayesian networks in which nodes have large in-degree, there are few vi-\nable options for probabilistic inference. Not only are junction tree algorithms infeasible,\nbut (loopy) belief propagation algorithms are infeasible as well, because of the need to\nmoralize. The mean-\ufb01eld algorithms that we have presented here are thus worthy of atten-\ntion as one of the few viable methods for such graphs. As we have shown, the framework\nallows us to systematically trade time for accuracy with such algorithms, by accounting for\ninteractions between neighboring nodes via the delta method.\nAcknowledgement. We would like to thank Andrew Ng and Martin Wainwright for very\nuseful discussions and feedback regarding this work.\n\nReferences\n\nD. Barber and P. van de Laar, Variational cumulant expansions for intractable distributions. Journal\nof Arti\ufb01cial Intelligence Research, 10, 435-455, 1999.\n\nL. Brown, Fundamentals of Statistical Exponential Families with Applications in Statistical Decision\nTheory, Institute of Mathematical Statistics, Hayward, CA, 1986.\n\nP. McCullagh and J.A. Nelder, Generalized Linear Models, Chapman and Hall, London, 1983.\n\nT. Minka, Expectation propagation for approximate Bayesian inference, In Proc. UAI, 2001.\n\nD. Heckerman, A tractable inference algorithm for diagnosing multiple diseases, In Proc. UAI, 1989.\n\nM.I. Jordan, Z. Ghahramani, T.S. Jaakkola and L.K. Saul, An introduction to variational methods for\ngraphical models, In Learning in Graphical Models, Cambridge, MIT Press, 1998.\n\nM.J. Kearns and L.K. Saul, Large deviation methods for approximate probabilistic inference, with\nrates of convergence, In Proc. UAI, 1998.\n\nM.J. Kearns and L.K. Saul, Inference in multi-layer networks via large deviation bounds, NIPS 11,\n1999.\n\nA.Y. Ng and M.I. Jordan, Approximate inference algorithms for two-layer Baysian networks, NIPS\n12, 2000.\n\nD. Barber and P. Sollich, Gaussian \ufb01elds for approximate inference in layered sigmoid belief net-\nworks, NIPS 11, 1999.\n\nT. Plefka, Convergence condition of the TAP equation for the in\ufb01nite-ranged Ising spin glass model,\nJ. Phys. A: Math. Gen., 15(6), 1982.\n\nJ.S. Yedidia, W.T. Freeman, and Y. Weiss. Generalized belief propagation. NIPS 13, 2001.\n\n\f", "award": [], "sourceid": 2411, "authors": [{"given_name": "XuanLong", "family_name": "Nguyen", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}