{"title": "Approximating Posterior Distributions in Belief Networks Using Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 416, "page_last": 422, "abstract": "", "full_text": "Approximating Posterior Distributions \n\nin Belief Networks using Mixtures \n\nChristopher M. Bishop \n\nNeil Lawrence \n\nNeural Computing Research Group \n\nDept. Computer Science & Applied Mathematics \n\nAston University \n\nBinningham, B4 7ET, U.K. \n\nTommi Jaakkola \n\nMichael I. Jordan \n\nCenter for Biological and Computational Learning \n\nMassachusetts Institute of Technology \n\n79 Amherst Street, ElO-243 \n\nCambridge, MA 02139, U.S.A. \n\nAbstract \n\nExact inference in densely connected Bayesian networks is computation(cid:173)\nally intractable, and so there is considerable interest in developing effec(cid:173)\ntive approximation schemes. One approach which has been adopted is to \nbound the log likelihood using a mean-field approximating distribution. \nWhile this leads to a tractable algorithm, the mean field distribution is as(cid:173)\nsumed to be factorial and hence unimodal. In this paper we demonstrate \nthe feasibility of using a richer class of approximating distributions based \non mixtures of mean field distributions. We derive an efficient algorithm \nfor updating the mixture parameters and apply it to the problem of learn(cid:173)\ning in sigmoid belief networks. Our results demonstrate a systematic \nimprovement over simple mean field theory as the number of mixture \ncomponents is increased. \n\n1 Introduction \n\nBayesian belief networks can be regarded as a fully probabilistic interpretation of feed(cid:173)\nforward neural networks. Maximum likelihood learning for Bayesian networks requires \nthe evaluation of the likelihood function P(VIO) where V denotes the set of instantiated \n(visible) variables, and 0 represents the set of parameters (weights and biases) in the net(cid:173)\nwork. Evaluation of P(VIO) requires summing over exponentially many configurations of \n\n\fApproximating Posterior Distributions in Belief Networks Using Mixtures \n\n417 \n\nthe hidden variables H, and is computationally intractable except for networks with very \nsparse connectivity, such as trees. One approach is to consider a rigorous lower bound on \nthe log likelihood, which is chosen to be computationally tractable, and to optimize the \nmodel parameters so as to maximize this bound instead. \nIf we introduce a distribution Q (H), which we regard as an approximation to the true \nposterior distribution, then it is easily seen that the log likelihood is bounded below by \n\nF[Q] = L Q(H) In P(V ~H). \n) \n\nQ( \n\n{H} \n\n(1) \n\nThe difference between the true log likelihood and the bound given by (1) is equal to the \nKullback-Leibler divergence between the true posterior distribution P(HIV) and the ap(cid:173)\nproximation Q(H). Thus the correct log likelihood is reached when Q(H) exactly equals \nthe true posterior. The aim of this approach is therefore to choose an approximating distri(cid:173)\nbution which leads to computationally tractable algorithms and yet which is also flexible so \nas to permit a good representation of the true posterior. In practice it is convenient to con(cid:173)\nsider parametrized distributions, and then to adapt the parameters to maximize the bound. \nThis gives the best approximating distribution within the particular parametric family. \n\n1.1 Mean Field Theory \n\nConsiderable simplification results if the model distribution is chosen to be factorial over \n\nthe individual variables, so that Q(H) = ni Q(hd, which gives meanfieid theory. Saul \net al. (1996) have applied mean field theory to the problem of learning in sigmoid belief \nnetworks (Neal. 1992). These are Bayesian belief networks with binary variables in which \nthe probability of a particular variable Si being on is given by \n\nP(S, = Ilpa(S,\u00bb =\" ( ~ ],;S; + b) \n\n(2) \n\nwhere u(z) == (1 + e-Z)-l is the logistic sigmoid function, pa(Si) denote the parents of \nSi in the network. and Jij and bi represent the adaptive parameters (weights and biases) \nin the model. Here we briefly review the framework of Saul et ai. (1996) since this forms \nthe basis for the illustration of mixture modelling discussed in Section 3. The mean field \ndistribution is chosen to be a product of Bernoulli distributions of the form \n\nQ(H) = II p,~i (1 _ p,;)l-hi \n\n(3) \n\nin which we have introduced mean-field parameters J.Li. Although this leads to considerable \nsimplification of the lower bound, the expectation over the log of the sigmoid function. \narising from the use of the conditional distribution (2) in the lower bound (I), remains \nintractable. This can be resolved by using variational methods (Jaakkola, 1997) to find a \nlower bound on F(Q), which is therefore itself a lower bound on the true log likelihood. \nIn particular, Saul et al. (1996) make use of the following inequality \n(In[l + e Zi ]) ::; ei(Zi) + In(e-~iZi + e(1-~;)Zi) \n\n(4) \nwhere Zi is the argument of the sigmoid function in (2), and ( ) denotes the expectation with \nrespect to the mean field distribution. Again, the quality of the bound can be improved by \nadjusting the variational parameter ei. Finally, the derivatives of the lower bound with \nrespect to the Jij and bi can be evaluated for use in learning. In summary. the algorithm \ninvolves presenting training patterns to the network. and for each pattern adapting the P,i \nand ei to give the best approximation to the true posterior within the class of separable \ndistributions of the form (3). The gradients of the log likelihood bound with respect to the \nmodel parameters Jij and bi can then be evaluated for this pattern and used to adapt the \nparameters by taking a step in the gradient direction. \n\n\f418 \n\n2 Mixtures \n\nC. M. Bishop, N. LAwrence, T. Jaakkola and M I. Jordan \n\nAlthough mean field theory leads to a tractable algorithm, the assumption of a completely \nfactorized distribution is a very strong one. In particular, such representations can only ef(cid:173)\nfectively model posterior distributions which are uni-modal. Since we expect multi-modal \ndistributions to be common, we therefore seek a richer class of approximating distribu(cid:173)\ntions which nevertheless remain computationally tractable. One approach (Saul and Jordan, \n1996) is to identify a tractable substructure within the model (for example a chain) and then \nto use mean field techniques to approximate the remaining interactions. This can be effec(cid:173)\ntive where the additional interactions are weak or are few in number, but will again prove \nto be restrictive for more general, densely connected networks. We therefore consider an \nalternative approach I based on mixture representations of the form \n\nM \n\nQmix(H) = L amQ(Hlm) \n\nm=l \n\n(5) \n\nin which each of the components Q(Hlm) is itself given by a mean-field distribution, for \nexample of the form (3) in the case of sigmoid belief networks. Substituting (5) into the \nlower bound (1) we obtain \n\nF[Qmix] = L amF[Q(Hlm)] + f(m, H) \n\n(6) \n\nm \n\nwhere f(m, H) is the mutual information between the component label m and the set of \nhidden variables H, and is given by \n\nQ(Hlm) \nf(m,H) = L L amQ(Hlm) In Q . (H)' \n\nm {H} \n\nmix \n\n(7) \n\nThe first tenn in (6) is simply a convex combination of standard mean-field bounds and \nhence is no greater than the largest of these and so gives no useful improvement over a \nsingle mean-field distribution. It is the second term, i.e. the mutual infonnation, which \ncharacterises the gain in using mixtures. Since f(m, H) ~ 0, the mutual information \nincreases the value of the bound and hence improves the approximation to the true posterior. \n\n2.1 Smoothing Distributions \n\nAs it stands, the mutual infonnation itself involves a summation over the configurations \nof hidden variables, and so is computationally intractable. In order to be able to treat it \nefficiently we first introduce a set of 'smoothing' distributions R(Hlm), and rewrite the \nmutual infonnation (7) in the form \n\nf(m, H) \n\n-\n\nLLamQ(Hlm)lnR(Hlm) - LamInam \nm {H} \n\nm \n\n- L L amQ(Hlm) In {R(H1m) Qmix(H) } . \n\n(8) \n\nm {H} \n\nam Q(Hlm) \n\nIt is easily verified that (8) is equivalent to (7) for arbitrary R(Hlm). We next make use of \nthe following inequality \n\n- In x ~ - ~x + In ~ + 1 \n\n(9) \n\nlHere we outline the key steps. A more detailed discussion can be found in Jaakkola and Jordan \n\n(1997). \n\n\fApproximating Posterior Distributions in Belief Networks Using Mixtures \n\n419 \n\nto replace the logarithm in the third term in (8) with a linear function (conditionally on \nthe component label m). This yields a lower bound on the mutual information given by \nJ(m,H) ~ J),(m,H) where \n\nh(m,H) \n\nI:I:amQ(Hlm)lnR(Hlm)- I: am In am \nm {H} \n- I: Am I: R(Hlm)Qmix(H) + I: am InAm + 1. \n\nm \n\nm \n\n{H} \n\nm \n\n(10) \n\nWith J),(m, H) substituted for J(m, H) in (6) we again obtain a rigorous lower bound on \nthe true log likelihood given by \n\nF),[Qmix(H)] = I: amF[Q(Hlm)] + h(m, H). \n\n(11) \n\nm \n\nThe summations over hidden configurations {H} in (10) can be performed analytically if \nwe assume that the smoothing distributions R(Hlm) factorize. In particular, we have to \nconsider the following two summations over hidden variable configurations \n\nII I: R(hilm)Q(hilk) ~ 7rR,Q(m, k) \n\n(12) \n\nh. \n\ni \nI: I: Q(hilm) InR(hilm) ~f H(QIIRlm). (13) \n\nh. \n\nI: R(Hlm)Q(Hlk) \n{H} \n\nI: Q(Hlm) InR(Hlm) \n{H} \n\nWe note that the left hand sides of (12) and (13) represent sums over exponentially many \nhidden configurations, while on the right hand sides these have been re-expressed in terms \nof expressions requiring only polynomial time to evaluate by making use of the factoriza(cid:173)\ntion of R(Hlm). \n\nIt should be stressed that the introduction of a factorized form for the smoothing distribu(cid:173)\ntions still yields an improvement over standard mean field theory. To see this, we note that \nif R(Hlm) = const. for all {H, m} then J(m, H) = 0, and so optimization over R(Hlm) \ncan only improve the bound. \n\n2.2 Optimizing the Mixture Distribution \n\nIn order to obtain the tightest bound within the class of approximating distributions, we can \nmaximize the bound with respect to the component mean-field distributions Q(Hlm), the \nmixing coefficients am, the smoothing distributions R(Hlm) and the variational parame(cid:173)\nters Am' and we consider each of these in turn. \nWe will assume that the choice of a single mean field distribution leads to a tractable lower \nbound, so that the equations \n\n8F[Q] \n8Q(hj ) = const \n\n(14) \n\ncan be solved efficiently2. Since h(m, H) in (10) is linear in the marginals Q(hjlm), it \nfollows that its derivative with respect to Q(hj 1m) is independent of Q(hjlm), although it \nwill be a function of the other marginals, and so the optimization of (11) with respect to \nindividual marginals again takes the form (14) and by assumption is therefore soluble. \n\nNext we consider the optimization with respect to the mixing coefficients am. Since all of \nthe terms in (11) are linear in am, except for the entropy term, we can write \n\nF),[Qmix(H)] = I:am(-Em) - I:amlnam + 1 \n\n(15) \n\nm \n\nm \n\n2In standard mean field theory the constant would be zero, but for many models of interest the \n\nslightly more general equations given by (14) will again be soluble. \n\n\f420 \n\nC. M. Bishop, N. Lawrence, T. Jaakkola and M. L Jordan \n\nwhere we have used (10) and defined \n\nF[Q(Hlm)] + L Q(Hlm) InR(Hlm) \n\n{H} \n\n+ LAk LR(Hlk)Q(Hlm) +lnAm . \n\n(16) \n\nk \n\n{H} \n\nMaximizing (15) with respect to am, subject to the constraints 0 ~ am ~ 1 and Lm am = \n1, we see that the mixing coefficients which maximize the lower bound are given by the \nBoltzmann distribution \n\nexp(-Em) \n\nam = Lk exp(-Ek)' \n\n(17) \n\nWe next maximize the bound (11) with respect to the smoothing marginals R(hj 1m). Some \nmanipulation leads to the solution \n\nR(hilm) = amQA~ilm) [~>'~'Q(m'k)Q(hi'k)l-1 \n\n(18) \n\nin which 7r~,Q(m, k) denotes the expression defined in (12) but with the j term omitted \nfrom the product. \nThe optimization of the JLmj takes the form of a re-estimation formula given by an extension \nof the result obtained for mean-field theory by Saul et al. (1996). For simplicity we omit \nthe details here. \nFinally, we optimize the bound with respect to the Am, to give \n~ = - L 7rR,Q(m, k). \n\n(19) \n\n1 \n\nm \n\n1 \nam k \n\nSince the various parameters are coupled, and we have optimized them individually keeping \nthe remainder constant, it will be necessary to maximize the lower bound iteratively until \nsome convergence criterion is satisfied. Having done this for a particular instantiation of \nthe visible nodes, we can then determine the gradients of the bound with respect to the \nparameters governing the original belief network, and use these gradients for learning. \n\n3 Application to Sigmoid Belief Networks \n\nWe illustrate the mixtures formalism by considering its application to sigmoid belief net(cid:173)\nworks of the form (2). The components of the mixture distribution are given by factorized \nBernoulli distributions of the form (3) with parameters JLmi. Again we have to introduce \nvariational parameters ~mi for each component using (4). The parameters {JLmi, ~mi} are \noptimized along with {am, R(hjlm), Am} for each pattern in the training set. \n\nWe first investigate the extent to which the use of a mixture distribution yields an improve(cid:173)\nment in the lower bound on the log likelihood compared with standard mean field theory. \nTo do this, we follow Saul et al. (1996) and consider layered networks having 2 units in \nthe first layer, 4 units in the second layer and 6 units in the third layer, with full connec(cid:173)\ntivity between layers. In all cases the six final-layer units are considered to be visible and \nhave their states clamped at zero. We generate 5000 networks with parameters {Jij, bi } \nchosen randomly with uniform distribution over (-1, 1). The number of hidden variable \nconfigurations is 26 = 64 and is sufficiently small that the true log likelihood can be com(cid:173)\nputed directly by summation over the hidden states. We can therefore compare the value of \n\n\fApproximating Posterior Distributions in Belief Networks Using Mixtures \n\n421 \n\nthe lower bound F with the true log likelihood L, using the nonnalized error (L - F)/ L. \nFigure 1 shows histograms of the relative log likelihood error for various numbers of mix(cid:173)\nture components, together with the mean values taken from the histograms. These show \na systematic improvement in the quality of the approximation as the number of mixture \ncomponents is increased. \n\n5 components, mean: 0.011394 \n\n4 components, mean: 0.012024 \n\n3~r-----~----~----~---. \n\n3000r-----~----~----~--_. \n\n0.02 \n3 components, mean: 0.01288 \n\n0.06 \n\n0.04 \n\n0.08 \n\n0.02 \n\n0.04 \n\n0.06 \n\n0.08 \n\n1 component, mean: 0.015731 \n\n~r-----~----~----~--~ \n\n0.04 \n\n0.06 \n\n0.08 \n\n0.02 \n2 components, mean: 0.013979 \n\n0.04 \n\n0.06 \n\n0.08 \n\n3000r-----~----~----~--_. \n\n2000 \n\n0.04 \n\n0.06 \n\n0.08 \n\n0.01 Gn..:-p----~----~----~----..., \n\ng 0.014 \n\nIII \nc \nas \n~ 0.012 \n\no \n\no \n\no \n\n0.01 '------~----~----~-----' \n5 \n\n1 \n\n4 \n\n2 \n\n3 \n\nFigure 1: Plots of histograms of the normalized error between the true log likelihood and the lower \nbound. for various numbers of mixture components. Also shown is the mean values taken from the \nhistograms. plotted against the number of components. \n\nno. of components \n\nNext we consider the impact of using mixture distributions on learning. To explore this \nwe use a small-scale problem introduced by Hinton et al. (1995) involving binary images \nof size 4 x 4 in which each image contains either horizontal or vertical bars with equal \nprobability, with each of the four possible locations for a bar occupied with probability 0.5. \nWe trained networks having architecture 1-8-16 using distributions having between 1 and \n5 components. Randomly generated patterns were presented to the network for a total of \n500 presentations, and the J-tmi and ~mi were initialised from a unifonn distribution over \n(0,1). Again the networks are sufficiently small that the exact log likelihood for the trained \nmodels can be evaluated directly. A Hinton diagram of the hidden-to-output weights for the \neight units in a network trained with 5 mixture components is shown in Figure 2. Figure 3 \nshows a plot of the true log likelihood versus the number M of components in the mixture \nfor a set of experiments in which, for each value of M, the model was trained 10 times \nstarting from different random parameter initializations. These results indicate that, as the \nnumber of mixture components is increased, the learning algorithm is able to find a set \nof network parameters having a larger likelihood, and hence that the improved flexibility \nof the approximating distribution is indeed translated into an improved training algorithm. \nWe are currently applying the mixture fonnalism to the large-scale problem of hand-written \ndigit classification. \n\n\f422 \n\n.\u2022 0. \n\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\nt \u2022 \u2022 \u2022 \u2022 \n\nj \u2022 \u2022 \u2022 \njO 0.0 \n100.0 \n10 \u2022\u2022. \n\nC. M. Bishop, N. Lawrence, T. Jaakkola and M. I. Jordan \n\no \u2022 \u2022 \n\n\u2022\u2022\u2022 \u2022 \n\no \n- 0 \n0.0 \n0 \u2022 \u2022 0 \no \n\n-0 \n\nFigure 2: Hinton diagrams of the hidden-to-output weights for each of the 8 hidden units in a \nnetwork trained on the 'bars' problem using a mixture distribution having 5 components. \n\n-5 \n\n-6 \n\n-7 \n\n'8 0 \n:\u00a3 -8 \nQ) \n~ \nCI \n.S! \nQ) -9 \nE \n\n-10 \n\n-11 \n\n-12 \n0 \n\n8 \n0 \n~ \n\n0 \n0 \n0 \n\no \no \n\n8 o \n8 \n\no \n\no \n\no \n\no \n6 \nB \n8 \n\no \no \no \no \n\no \no \no \n\n23 4 \n\nno. of components \n\n5 \n\n6 \n\nFigure 3: True log likelihood (divided by the number of patterns) versus the number M of mixture \ncomponents for the 'bars' problem indicating a systematic improvement in performance as M is \nincreased. \n\nReferences \n\nHinton, G. E., P. Dayan, B. 1. Frey, and R. M. Neal (1995). The wake-sleep algorithm \n\nfor unsupervised neural networks. Science 268, 1158-1161. \n\nJaakkola, T. (1997). Variational Methods for Inference and Estimation in Graphical \n\nModels. Ph.D. thesis, MIT. \n\nJaakkola, T. and M. I. Jordan (1997). Approximating posteriors via mixture models. To \nappear in Proceedings NATO ASI Learning in Graphical Models, Ed. M. I. Jordan. \nKluwer. \n\nNeal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence 56, \n\n71-113. \n\nSaul, L. K., T. Jaakkola, and M. I. Jordan (1996). Mean field theory for sigmoid belief \n\nnetworks. Journal of Artificial Intelligence Research 4,61-76. \n\nSaul, L. K. and M. I. Jordan (1996). Exploiting tractable substructures in intractable \nnetworks. In D. S. Touretzky, M . C. Mozer, and M. E. Hasselmo (Eds.), Advances in \nNeural Information Processing Systems, Volume 8, pp. 486-492. MIT Press. \n\n\f", "award": [], "sourceid": 1392, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}