{"title": "Inference in Multilayer Networks via Large Deviation Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Inference in Multilayer Networks \n\n\u2022 VIa \n\nLarge Deviation Bounds \n\nMichael Kearns and Lawrence Saul \n\nAT&T Labs - Research \n\nShannon Laboratory \n\n180 Park A venue A-235 \nFlorham Park, NJ 07932 \n\n{mkearns ,lsaul}Oresearch.att. com \n\nAbstract \n\nWe study probabilistic inference in large, layered Bayesian net(cid:173)\nworks represented as directed acyclic graphs. We show that the \nintractability of exact inference in such networks does not preclude \ntheir effective use. We give algorithms for approximate probabilis(cid:173)\ntic inference that exploit averaging phenomena occurring at nodes \nwith large numbers of parents. We show that these algorithms \ncompute rigorous lower and upper bounds on marginal probabili(cid:173)\nties of interest, prove that these bounds become exact in the limit \nof large networks, and provide rates of convergence. \n\n1 \n\nIntroduction \n\nThe promise of neural computation lies in exploiting the information processing \nabilities of simple computing elements organized into large networks. Arguably one \nof the most important types of information processing is the capacity for proba(cid:173)\nbilistic reasoning. \n\nThe properties of undirectedproDabilistic models represented as symmetric networks \nhave been studied extensively using methods from statistical mechanics (Hertz et \naI, 1991). Detailed analyses of these models are possible by exploiting averaging \nphenomena that occur in the thermodynamic limit of large networks. \n\nIn this paper, we analyze the limit of large, multilayer networks for probabilistic \nmodels represented as directed acyclic graphs. These models are known as Bayesian \nnetworks (Pearl, 1988; Neal, 1992), and they have different probabilistic semantics \nthan symmetric neural networks (such as Hopfield models or Boltzmann machines). \nWe show that the intractability of exact inference in multilayer Bayesian networks \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n261 \n\ndoes not preclude their effective use. Our work builds on earlier studies of varia(cid:173)\ntional methods (Jordan et aI, 1997). We give algorithms for approximate proba(cid:173)\nbilistic inference that exploit averaging phenomena occurring at nodes with N \u00bb 1 \nparents. We show that these algorithms compute rigorous lower and upper bounds \non marginal probabilities of interest, prove that these bounds become exact in the \nlimit N -+ 00, and provide rates of convergence. \n\n2 Definitions and Preliminaries \n\nA Bayesian network is a directed graphical probabilistic model, in which the nodes \nrepresent random variables, and the links represent causal dependencies. The joint \ndistribution of this model is obtained by composing the local conditional probability \ndistributions (or tables), Pr[childlparents], specified at each node in the network. \nFor networks of binary random variables, so-called transfer functions provide a \nconvenient way to parameterize conditional probability tables (CPTs). A transfer \nfunction is a mapping f : [-00,00] -+ [0,1] that is everywhere differentiable and \nsatisfies f' (x) 2: 0 for all x (thus, f is nondecreasing). If f' (x) ::; a for all x, we say \nthat f has slope a. Common examples of transfer functions of bounded slope include \nthe sigmoid f(x) = 1/(1 + e- X ), the cumulative gaussian f(x) = J~oodt e- t2 / ft, \nand the noisy-OR f(x) = 1 - e- x . Because the value of a transfer function f \nis bounded between 0 and 1, it can be interpreted as the conditional probability \nthat a binary random variable takes on a particular value. One use of transfer \nfunctions is to endow multilayer networks of soft-thresholding computing elements \nwith probabilistic semantics. This motivates the following definition: \n\nDefinition 1 For a transfer function f, a layered probabilistic f-network has: \n\u2022 Nodes representing binary variables {xf}, f = 1, ... ,L and i = 1, ... , N. \n\nThus, L is the number of layers, and each layer contains N nodes. \n\n\u2022 For every pair of n~des XJ- 1 and xf in adjacent layers, a real-valued weight \n\n0'-:-1 from X l - 1 to Xl \nt . \ntJ \n\nJ \n\n\u2022 For every node xl in the first layer, a bias Pi. \n\nWe will sometimes refer to nodes in layer 1 as inputs, and to nodes in layer L as \noutputs. A layered probabilistic f-network defines a joint probability distribution \nover all of the variables {Xf} as follows: each input node xl is independently set \nto 1 with probability Pi, and to 0 with probability 1 - Pi. Inductively, given binary \nvalues XJ-1 = x;-l E {O, 1} for all of the nodes in layer f - 1, the node xf is set \nto 1 with probability f('Lf=l ofj- 1 x;-l). \n\nAmong other uses, multilayer networks of this form have been studied as hierarchi(cid:173)\ncal generative models of sensory data (Hinton et aI, 1995). In such applications, \nthe fundamental computational problem (known as inference) is that of estimating \nthe marginal probability of evidence at some number of output nodes, say the first \nf{ ::; N. (The computation of conditional probabilities, such as diagnostic queries, \ncan be reduced to marginals via Bayes rule.) More precisely, one wishes to estimate \nPr[Xf = Xl, ... ,XI< = XK] (where Xi E {O, 1}), a quantity whose exact computa(cid:173)\ntion involves an exponential sum over all the possible settings of the uninstantiated \nnodes in layers 1 through L - 1, and is known to be computationally intractable \n(Cooper, 1990). \n\n\f262 \n\nM. Kearns and L. Saul \n\n3 Large Deviation and Union Bounds \n\nOne of our main weapons will be the theory of large deviations. As a first illustration \nof this theory, consider the input nodes {Xl} (which are independently set to 0 or 1 \naccording to their biases pj) and the weighted sum 2::7= 1 Blj Xl that feeds into the \nith node xl in the second layer. A typical large deviation bound (Kearns & Saul, \n1997) states that for all f > 0, Pr[1 2::7=1 Blj (XJ - pj) I > f] ~ 2e-2~2 /(N0 2) where \ne is the largest weight in the network. If we make the scaling assumption that \neach weight Blj is bounded by T / N for some constant T (thus, e ~ T / N), then we \nsee that the probability of large (order 1) deviations of this weighted sum from its \nmean decays exponentially with N. (Our methods can also provide results under \nthe weaker assumption that all weights are bounded by O(N-a) for a > 1/2.) \nHow can we apply this observation to the problem of inference? Suppose we are \ninterested in the marginal probability Pr[Xl = 1]. Then the large deviation bound \ntells us that with probability at least 1 - 0 (where we define 0 = 2e- 2N\u20ac2/ r 2), the \nweighted sum at node Xl will be within f of its mean value Pi = 2::7=1 Bljpj. Thus, \nwith probability at least 1- 0, we are assured that Pr[Xl = 1] is at least f(pi - f) \nand at most f(Pi + f). Of course, the flip side of the large deviation bound is that \nwith probability at most 0, the weighted sum may fall more than f away from Pi. \nIn this case we can make no guarantees on Pr[Xl = 1] aside from the trivial lower \nand upper bounds of 0 and 1. Combining both eventualities, however, we obtain \nthe overall bounds: \n\n(1 - O)f(Pi - f) ~ Pr[Xl = 1] ~ (1 - O)f(Pi + f) + o. \n\n(1) \n\nEquation (1) is based on a simple two-point approximation to the distribution over \nthe weighted sum of inputs, 2::7=1 BtjX]. This approximation places one point, \nwith weight 1 - 0, at either f above or below the mean Pi (depending on whether \nwe are deriving the upper or lower bound); and the other point, with weight 0, at \neither -00 or +00. The value of 0 depends on the choice of f: in particular, as f \nbecomes smaller, we give more weight to the \u00b1oo point, with the trade-off governed \nby the large deviation bound. We regard the weight given to the \u00b1oo point as a \nthrow-away probability, since with this weight we resort to the trivial bounds of 0 \nor 1 on the marginal probability Pr[Xl = 1]. \nNote that the very simple bounds in Equation (1) already exhibit an interesting \ntrade-off, governed by the choice of the parameter f-namely, as f becomes smaller, \nthe throw-away probability 0 becomes larger, while the terms f(Pi \u00b1 f) converge to \nthe same value. Since the overall bounds involve products of f(Pi \u00b1 f) and 1 - 0, \nthe optimal value of f is the one that balances this competition between probable \nexplanations of the evidence and improbable deviations from the mean. This trade(cid:173)\noff is reminiscent of that encountered between energy and entropy in mean-field \napproximations for symmetric networks (Hertz et aI, 1991). \n\nSo far we have considered the marginal probability involving a single node in the \nsecond layer. We can also compute bounds on the marginal probabilities involving \n]{ > 1 nodes in this layer (which without loss of generality we take to be the nodes \nXr through Xi<). This is done by considering the probability that one or more \n\nof the weighted sums entering these ]{ nodes in the second layer deviate by more \nthan f from their means. We can upper bound this probability by ]{ 0 by appealing \nto the so-called union bound, which simply states that the probability of a union of \nevents is bounded by the sum of their individual probabilities. The union bound \nallows us to bound marginal probabilities involving multiple variables. For example, \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n263 \n\nconsider the marginal probability Pr[Xf = 1, ... , Xl(cid:173)\ndeviation and union bounds, we find: \n\n1]. Combining the large \n\n(I-Kb\") rr f(Pi- f) ~ Pr[Xf = 1, ... , xl- = 1] < (I-Kb\") rr f(Pi+f)+Kb\". (2) \n\nK \n\nK \n\ni=1 \n\ni=1 \n\nA number of observations are in order here. First, Equation (2) directly leads to \nefficient algorithms for computing the upper and lower bounds. Second, although \nfor simplicity we have considered f- deviations of the same size at each node in the \nsecond layer, the same methods apply to different choices of fi (and therefore b\"i) \nat each node. Indeed, variations in fi can lead to significantly tighter bounds, and \nthus we exploit the freedom to choose different fi in the rest of the paper. This \nresults, for example, in bounds of the form: \n( _ ~ .) rrK \n1 \n\ni=1 f(pt - ft ~ Pr Xl - 1, . .. ,XK - 1, where b\"t - 2e \n\n-2NE; /r2 \n\n[ 2 _ \n\nt;tb\"t \n\n.) \n\n. _ \n\n2 _ \n\n] \n\n.\n\n. \n\n(3) \nThe reader is invited to study the small but important differences between this \nlower bound and the one in Equation (2). Third, the arguments leading to bounds \non the marginal probability Pr[X; = 1, ... , Xl- = 1] generalize in a straightfor(cid:173)\nward manner to other patterns of evidence besides all 1 'so For instance, again just \nconsidering the lower bound, we have: \n\n( 1 - t, 0;) ny -/(1';+';)] }I /(/4 -';) :s Pr[Xf = X\" ... , Xl\" = XK] (4) \n\nwhere Xi E {a, I} are arbitrary binary values. Thus together the large deviation \nand union bounds provide the means to compute upper and lower bounds on the \nmarginal probabilities over nodes in the second layer. Further details and conse(cid:173)\nquences of these bounds for the special case of two-layer networks are given in a \ncompanion paper (Kearns & Saul, 1997); our interest here, however, is in the more \nchallenging generalization to multilayer networks. \n\n. \n\n4 Multilayer Networks: Inference via Induction \n\nan event whose probability we quantified in the last section -\n\nIn extending the ideas of the previous section to multilayer networks, we face the \nproblem that the nodes in the second layer, unlike those in the first, are not inde(cid:173)\npendent . But we can still adopt an inductive strategy to derive bounds on marginal \nprobabilities. The crucial observation is that conditioned on the values of the incom(cid:173)\ning weighted sums at the nodes in the second layer, the variables {xl} do become \nindependent. More generally, conditioned on these weighted sums all falling \"near\" \ntheir means -\nthe \nnodes {Xl} become \"almost\" independent. It is exactly this near-independence \nthat we now formalize and exploit inductively to compute bounds for multilayer \nnetworks. The first tool we require is an appropriate generalization of the large \ndeviation bound, which does not rely on precise knowledge of the means of the \nrandom variables being summed. \nTheorem 1 For all 1 ~ j ~ N, let Xj E {a, I} denote independent binary random \nvariables, and let I Tj I ~ T. Suppose that the means are bounded by IE[Xj ]-Pj I ~ !:l.j, \nwhere \u00b0 < !:l.j ~ Pj ~ 1 -\n\n!:l.j. Then for all f > ~ L:f=l h I!:l.j,' \n\nPr [ ~tTj(Xj-Pj) >f] ~2e-~~(E-ttL:~==1IrJI~Jr \n\n(5) \n\nJ=1 \n\n\f264 \n\nM. Keams and L. Saul \n\nThe proof of this result is omitted due to space considerations. Now for induction, \nconsider the nodes in the fth layer of the network. Suppose we are told that for \nevery i, the weighted sum 2:;=1 07j-1 XJ-1 entering into the node Xl lies in the \nfr , J.lr + frJ, for some choice of the J.l~ and the ff . Then the mean of \ninterval (p~ -\nnode xf is constrained to lie in the interval [pf - ~r, pf + ~n, where \n\n~ [f(J.l~ - ff) + f(J.lf + ff)] \n~ [J(J.lf + ff) - f(J.lf - fDJ . \n\n(6) \n\n(7) \n\nJ \n\n, 0J =1 1J \n\nHere we have simply run the leftmost and rightmost allowed values for the incoming \nweighted sums through the transfer function , and defined the interval around the \nmean of unit xf to be centered around pf. Thus we have translated uncertainties \non the incoming weighted sums to layer f into conditional uncertainties on the \nmeans of the nodes Xf in layer f . To complete the cycle, we now translate these \ninto conditional uncertainties on the incoming weighted sums to layer f + 1. \nIn \nparticular, conditioned on the original intervals [J.lf - ff , J.lf + ff] , what is probability \nthat for each i \"N O~ .X~ lies inside some new interval [//+1 _l+l 1I~+1 + fl+1J? \nIn order to make some guarantee on this probability, we set J.lf+1 = 2:;=1 efjP] \nand assume that ff+1 > 2:;=1 IOfj I~]. These conditions suffice to ensure that \nthe new intervals contain the (conditional) expected values of the weighted sums \n2:;=1 efjxf , and that the new intervals are large enough to encompass the incoming \nuncertainties. Because these conditions are a minimal requirement for establishing \nany probabilistic guarantees, we shall say that the [J.lf - d, J.lf + ffj define a valid \nset of f-intervals if they meet these conditions for all 1 ::; i ::; N. Given a valid set \nof f-intervals at the (f + 1 )th layer , it follows from Theorem 1 and the union bound \nthat the weighted sums entering nodes in layer f + 1 obey \n1I~+1 > f~+l for some 1 < i < N] \n\nPr [ ~ O~ \u00b7Xl -\n\nl ' r1 \n\n(8) \n\nr1 \n\nr1 \n\n1 \n\nJ \n\n' \n\n-\n\n-\n\nl\n\n~ 1J \nj=l \n\ni=l \n\nwhere \n\n8~+1 = 2e - -;:2 f, \n\n2N (l+1 \"N I l I l)2 \n\n1 \n\n1 \n\n-0)=1 8,) L:.) \n\n(9) \n\"N Ol.x~ are bounded by intervals r .. l+1 _l+l 1I~+1 + f~+l] This motivates the \n\nIn what follows, we shall frequently make use of the fact that the weighted sums \n\n0J=11J \nfollowing definitions. \nDefinition 2 Given a valid set of f-intervals and binary values {Xf = xf} for the \nnodes in the fth layer, we say that the (f + 1)st layer of the network satisfies its \nf-intervals if 12:;=1 Ofjx] - J.lf+11 < fl+1 for all 1 ::; i::; N. Otherwise, we say that \nthe (f + 1)st layer violates its f-intervals . \n\nl ' \n\n, r1 \n\nlP'1 \n\n1 \n\nSuppose that we are given a valid set of f-intervals and that we sample from the joint \ndistribution defined by the probabilistic I-network. The right hand side of Equation \n(8) provides an upper bound on the conditional probability that the (f + 1)st layer \nviolates its f-intervals , given that the fth layer did not. This upper bound may be \nvacuous (that is, larger than 1) , so let us denote by 81+1 whichever is smaller -\nthe \nright hand side of Equation (8) , or 1; in other words, 81+1 = min {2:~1 8;+1,1 }. \nSince at the fth layer, the probability of violating the f-intervals is at most 81 we \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n265 \n\nare guaranteed that with probability at least TIl> 1 [1 - 6l ], all the layers satisfy \ntheir f-intervals. Conversely, we are guaranteed that the probability that any layer \nviolates its f-intervals is at most 1 - TIl>l [1 - 6l ]. Treating this as a throw-away \nprobability, we can now compute upper and lower bounds on marginal probabilities \ninvolving nodes at the Lth layer exactly as in the case of nodes at the second layer. \nThis yields the following theorem. \n\nTheorem 2 For any subset {Xf, ... , Xi(} of the outputs of a probabilistic f(cid:173)\nnetwork, for any setting Xl, ... ,XK, and for any valid set of f-intervals, the marginal \nprobability of partial evidence in the output layer obeys: \n\n< Pr[ X f = Xl, ... , X f< = X K] \n< D [1- 0'] }I f(\"f +tf) }}o [1- f(\"f - tf)] + (1-D [1- O']}ll) \n\n(10) \n\nTheorem 2 generalizes our earlier results for marginal probabilities over nodes in the \nsecond layer; for example, compare Equation (10) to Equation (4). Again, the upper \nand lower bounds can be efficiently computed for all common transfer functions. \n\n5 Rates of Convergence \n\nTo demonstrate the power of Theorem 2, we consider how the gap (or additive \ndifference) between these upper and lower bounds on Pr[Xf = Xl,\u00b7 .. , Xi( = XK] \nbehaves for some crude (but informed) choices of the {fn. Our goal is to derive \nthe rate at which these upper and lower bounds converge to the same value as we \nexamine larger and larger networks. Suppose we choose the f-intervals inductively \nby defining .6.; = 0 and setting \n\n~+1 = ;....I()~ .1.6.l J ,r2 ln N \n\nJ + \n\nN \n\n(12) \n\nfl \n\nL...J \nj=l \n\nlJ \n\nfor some / > 1. From Equations (8) and (9), this choice gives 6l+ 1 ::;: 2N l - 2,,/ as \nan upper bound on the probability that the (\u00a3 + 1 )th layer violates its f-intervals . \nMoreover, denoting the gap between the upper and lower bounds in Theorem 2 by \nG, it can be shown that: \n\n(13) \nLet us briefly recall the definitions of the parameters on the right hand side of this \nequation: a is the maximal slope of the transfer function f, N is the number of \nnodes in each layer, ]{ is the number of nodes with evidence, r = N8 is N times the \nlargest weight in the network, L is the number of layers, and / > 1 is a parameter \nat our disposal. The first term of this bound essentially has a 1/ VN dependence on \nN, but is multiplied by a damping factor that we might typically expect to decay \nexponentially with the number ]{ of outputs examined. To see this, simply notice \nthat each of the factors f(f.lj +fj) and [1- f(f.lj -fj)] is bounded by 1; furthermore, \n\n\f266 \n\nM Kearns and L. Saul \n\nsince all the means J.lj are bounded, if N is large compared to 1 then the Ci are \nsmall, and each of these factors is in fact bounded by some value f3 < 1. Thus \nthe first term in Equation (13) is bounded by a constant times f3 K - l f{ Jln(N)/N. \nSince it is natural to expect the marginal probability of interest itself to decrease \nexponentially with f{, this is desirable and natural behavior. \n\nOf course, in the case of large f{, the behavior of the resulting overall bound can \nbe dominated by the second term 2L/ N 2'Y- l of Equation (13). In such situations, \nhowever, we can consider larger values of I, possibly even of order f{; indeed, for \nsufficiently large I, the first term (which scales like y0) must necessarily overtake \nthe second one. Thus there is a clear trade-off between the two terms, as well as \noptimal value of 1 that sets them to be (roughly) the same magnitude. Generally \nspeaking, for fixed f{ and large N, we observe that the difference between our upper \nand lower bounds on Pr[Xf = Xl, ... , xi = XK] vanishes as 0 (Jln(N)/N). \n\n6 An Algorithm for Fixed Multilayer Networks \n\nWe conclude by noting that the specific choices made for the parameters Ci in \nSection 5 to derive rates of convergence may be far from the optimal choices for a \nfixed network of interest. However, Theorem 2 directly suggests a natural algorithm \nfor approximate probabilistic inference. In particular, regarding the upper and lower \nbounds on Pr [X f = Xl, ... , Xi = X K] as functions of { cn, we can optimize these \nbounds by standard numerical methods. For the upper bound, we may perform \ngradient descent in the {cn to find a local minimum, while for the lower bound, we \nmay perform gradient ascent to find a local maximum. The components of these \ngradients in both cases are easily computable for all the commonly studied transfer \nfunctions. Moreover, the constraint of maintaining valid c-intervals can be enforced \nby maintaining a floor on the c-intervals in one layer in terms of those at the previous \none. The practical application of this algorithm to interesting Bayesian networks \nwill be studied in future work. \n\nReferences \n\nCooper, G. (1990). Computational complexity of probabilistic inference usmg \nBayesian belief networks. Artificial Intelligence 42:393-405. \nHertz, J, . Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural \ncomputation. Addison-Wesley, Redwood City, CA. \nHinton, G., Dayan, P., Frey, B., and Neal, R. (1995). The wake-sleep algorithm for \nunsupervised neural networks. Science 268:1158- 1161. \nJordan, M., Ghahramani, Z. , Jaakkola, T. , & Saul , 1. (1997) . An introduction to \nvariational methods for graphical models. In M. Jordan , ed. Learning in Graphical \nModels. Kluwer Academic. \nKearns , M. , & Saul, 1. (1998) . Large deviation methods for approximate proba(cid:173)\nbilistic inference. In Proceedings of the 14th Annual Conference on Uncertainty in \nA rtificial Intelligence. \nNeal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence \n56:71-113 . \n\nPearl, J . (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plau(cid:173)\nsible Inference. Morgan Kaufmann, San Mateo, CA. \n\n\f", "award": [], "sourceid": 1564, "authors": [{"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}