{"title": "A Variational Baysian Framework for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 215, "abstract": null, "full_text": "A Variational Bayesian Framework for \n\nGraphical Models \n\nHagai Attias \n\nhagai@gatsby.ucl.ac.uk \n\nGatsby Unit, University College London \n\n17 Queen Square \n\nLondon WC1N 3AR, U.K. \n\nAbstract \n\nThis paper presents a novel practical framework for Bayesian model \naveraging and model selection in probabilistic graphical models. \nOur approach approximates full posterior distributions over model \nparameters and structures, as well as latent variables, in an analyt(cid:173)\nical manner. These posteriors fall out of a free-form optimization \nprocedure, which naturally incorporates conjugate priors. Unlike \nin large sample approximations, the posteriors are generally non(cid:173)\nGaussian and no Hessian needs to be computed. Predictive quanti(cid:173)\nties are obtained analytically. The resulting algorithm generalizes \nthe standard Expectation Maximization algorithm, and its conver(cid:173)\ngence is guaranteed. We demonstrate that this approach can be \napplied to a large class of models in several domains, including \nmixture models and source separation. \n\n1 \n\nIntroduction \n\nA standard method to learn a graphical model 1 from data is maximum likelihood \n(ML). Given a training dataset, ML estimates a single optimal value for the model \nparameters within a fixed graph structure. However, ML is well known for its ten(cid:173)\ndency to overfit the data. Overfitting becomes more severe for complex models \ninvolving high-dimensional real-world data such as images, speech, and text. An(cid:173)\nother problem is that ML prefers complex models, since they have more parameters \nand fit the data better. Hence, ML cannot optimize model structure. \nThe Bayesian framework provides, in principle, a solution to these problems. Rather \nthan focusing on a single model, a Bayesian considers a whole (finite or infinite) class \nof models. For each model, its posterior probability given the dataset is computed. \nPredictions for test data are made by averaging the predictions of all the individ(cid:173)\nual models, weighted by their posteriors. Thus, the Bayesian framework avoids \noverfitting by integrating out the parameters. In addition, complex models are \nautomatically penalized by being assigned a lower posterior probability, therefore \noptimal structures can be identified. \nUnfortunately, computations in the Bayesian framework are intractable even for \n\nlWe use the term 'model' to refer collectively to parameters and structure. \n\n\f210 \n\nH. Attias \n\nvery simple cases (e.g. factor analysis; see [2]). Most existing approximation meth(cid:173)\nods fall into two classes [3]: Markov chain Monte Carlo methods and large sample \nmethods (e.g., Laplace approximation). MCMC methods attempt to achieve exact \nresults but typically require vast computational resources, and become impractical \nfor complex models in high data dimensions. Large sample methods are tractable, \nbut typically make a drastic approximation by modeling the 'posteriors over all \nparameters as Normal, even for parameters that are not positive definite (e.g., co(cid:173)\nvariance matrices). In addition, they require the computation ofthe Hessian, which \nmay become quite intensive. \nIn this paper I present Variational Bayes (VB), a practical framework for Bayesian \ncomputations in graphical models. VB draws together variational ideas from in(cid:173)\ntractable latent variables models [8] and from Bayesian inference [4,5,9], which, in \nturn, draw on the work of [6]. This framework facilitates analytical calculations of \nposterior distributions over the hidden variables, parameters and structures. The \nposteriors fall out of a free-form optimization procedure which naturally incorpo(cid:173)\nrates conjugate priors, and emerge in standard forms, only one of which is Normal. \nThey are computed via an iterative algorithm that is closely related to Expectation \nMaximization (EM) and whose convergence is guaranteed. No Hessian needs to \nbe computed. In addition, averaging over models to compute predictive quantities \ncan be performed analytically. Model selection is done using the posterior over \nstructure; in particular, the BIC/MDL criteria emerge as a limiting case. \n\n2 General Framework \n\nWe restrict our attention in this paper to directed acyclic graphs (DAGs, a.k.a. \nBayesian networks). Let Y = {y., ... ,YN} denote the visible (data) nodes, where \nn = 1, ... , N runs over the data instances, and let X = {Xl, ... , XN} denote the \nhidden nodes. Let e denote the parameters, which are simply additional hidden \nnodes with their own distributions. A model with a fixed structure m is fully defined \nby the joint distribution p(Y, X, elm). In a DAG, this joint factorizes over the \nnodes, i.e. p(Y,X I e,m) = TIiP(Ui I pai,Oi,m), where Ui E YUX, pai is the set \nof parents of Ui, and Oi E e parametrize the edges directed toward Ui. In addition, \nwe usually assume independent instances, p(Y, X Ie, m) = TIn p(y n, Xn Ie, m). \nWe shall also consider a set of structures m E M, where m controls the number \nof hidden nodes and the functional forms of the dependencies p( Ui I pai, 0 i, m), \nincluding the range of values assumed by each node (e.g., the number of components \nin a mixture model). Associated with the set of structures is a structure prior p( m). \nMarginal likelihood and posterior over parameters. For a fixed structure m, \nwe are interested in two quantities. The first is the parameter posterior distribution \np(e I Y,m). The second is the marginal likelihood p(Y I m), also known as the \nevidence assigned to structure m by the data. In the following, the reference to m is \nusually omitted but is always implied. Both quantities are obtained from the joint \np(Y, X, elm). For models with no hidden nodes the required computations can \noften be performed analytically. However, in the presence of hidden nodes, these \nquantities become computationally intractable. We shall approximate them using \na variational approach as follows. \nConsider the joint posterior p(X, elY) over hidden nodes and parameters. Since \nit is intractable, consider a variational posterior q(X, elY), which is restricted to \nthe factorized form \n\n(1) \nwher\"e given the data, the parameters and hidden nodes are independent. This \n\nq(X, elY) = q(X I Y)q(e I Y) , \n\n\fA Variational Baysian Frameworkfor Graphical Models \n\n211 \n\nrestriction is the key: It makes q approximate but tractable. Notice that we do \nnot require complete factorization, as the parameters and hidden nodes may still \nbe correlated amongst themselves. \nWe compute q by optimizing a cost function Fm[q] defined by \n\nFm[q] = ! dE> q(X)q(E\u00bb log ~~i~(:j ~ logp(Y I m) , \n\n(2) \n\nwhere the inequality holds for an arbitrary q and follows from Jensen's inequality \n(see [6]); it becomes an equality when q is the true posterior. Note that q is always \nunderstood to include conditioning on Y as in (1). Since Fm is bounded from above \nby the marginal likelihood, we can obtain the optimal posteriors by maximizing it \nw.r.t. q. This can be shown to be equivalent to minimizing the KL distance between \nq and the true posterior. Thus, optimizing Fm produces the best approximation to \nthe true posterior within the space of distributions satisfying (1), as well as the \ntightest lower bound on the true marginal likelihood. \nPenalizing complex models. To see that the VB objective function Fm penalizes \ncomplexity, it is useful to rewrite it as \n\nFm = (log \n\np(Y, X I E\u00bb \n\nq(X) \n\n)X,9 - KL[q(E\u00bb \n\nII p(E\u00bb] , \n\n(3) \n\nwhere the average in the first term on the r.h.s. is taken w.r.t. q(X, E\u00bb. The first \nterm corresponds to the (averaged) likelihood. The second term is the KL distance \nbetween the prior and posterior over the parameters. As the number of parameters \nincreases, the KL distance follows and consequently reduces Fm. \nThis penalized likelihood interpretation becomes transparent in the large sample \nlimit N -7 00, where the parameter posterior is sharply peaked about the most \nprobable value E> = E>o. \nIt can then be shown that the KL penalty reduces to \n(I E>o 1/2) log N, which is linear in the number of parameters I E>o I of structure m. \nFm then corresponds precisely the Bayesian information criterion (BIC) and the \nminimum description length criterion (MDL) (see [3]). Thus, these popular model \nselection criteria follow as a limiting case of the VB framework. \nFree-form optimization and an EM-like algorithm. Rather than assuming a \nspecific parametric form for the posteriors, we let them fall out of free-form opti(cid:173)\nmization of the VB objective function. This results in an iterative algorithm directly \nanalogous to ordinary EM. In the E-step, we compute the posterior over the hidden \nnodes by solving 8Fm/8q(X) = 0 to get \n\nq(X) ex e(log p(Y,XI9\u00bbe , \n\n(4) \n\nwhere the average is taken w.r.t. q(E\u00bb. \nIn the M-step, rather than the 'optimal' parameters, we compute the posterior \ndistribution over the parameters by solving 8Fm/8q(E\u00bb = 0 to get \n\nq(E\u00bb ex e(IOgp(y,XI9\u00bbxp (E\u00bb \n\n, \n\n(5) \n\nwhere the average is taken w.r.t. q(X). \nThis is where the concept of conjugate priors becomes useful. Denoting the expo(cid:173)\nnential term on the r.h.s. of (5) by f(E\u00bb, we choose the prior p(E\u00bb from a family of \ndistributions such that q(E\u00bb ex f(E\u00bbp(E\u00bb belongs to that same family. p(E\u00bb is then \nsaid to be conjugate to f(E\u00bb. This procedure allows us to select a prior from a fairly \nlarge family of distributions (which includes non-informative ones as limiting cases) \n\n\f212 \n\nH. Attias \n\nand thus not compromise generality, while facilitating mathematical simplicity and \nelegance. In particular, learning in the VB framework simply amounts to updat(cid:173)\ning the hyperparameters, i.e., transforming the prior parameters to the posterior \nparameters. We point out that, while the use of conjugate priors is widespread in \nstatistics, so far they could only be applied to models where all nodes were visible. \ninequal(cid:173)\nStructure posterior. \n-\n.1'[q] \nity once again \nl:mEM q(m) [.1'm + logp(m)jq(m)] ~ 10gp(Y), where now q = q(X I m, Y)q(8 I \nm, Y)q( m I Y) . After computing .1' m for each m EM, the structure posterior is \nobtained by free-form optimization of .1': \n\nTo compute q(m) we exploit Jensen's \n\nto define a more general objective \n\nfunction, \n\nq(m) ex e:Frnp(m) . \n\n(6) \n\nHence, prior assumptions about the likelihood of different structures, encoded by \nthe prior p( m), affect the selection of optimal model structures performed according \nto q( m), as they should. \nPredictive quantities. The ultimate goal of Bayesian inference is to estimate \npredictive quantities, such as a density or regression function. Generally, these \nquantities are computed by averaging over all models, weighting each model by \nits posterior. In the VB framework, exact model averaging is approximated by \nreplacing the true posterior p(8 I Y) by the variational q(8 I Y). \nIn density \nestimation, for example, the density assigned to a new data point Y is given by \np(y I Y) = J d8 p(y I 8) q(8 I Y) . \nIn some situations (e.g. source separation), an estimate of hidden node values x \nfrom new data y may be required. The relevant quantity here is the conditional \np(x I y, Y), from which the most likely value of hidden nodes is extracted. VB \napproximates it by p(x I y, Y) ex J d8 p(y, x I 8) q(8 I Y). \n\n3 Variational Bayes Mixture Models \n\nMixture models have been investigated and analyzed extensively over many years. \nHowever, the well known problems of regularizing against likelihood divergences and \nof determining the required number of mixture components are still open. Whereas \nin theory the Bayesian approach provides a solution, no satisfactory practical al(cid:173)\ngorithm has emerged from the application of involved sampling techniques (e.g., \n[7]) and approximation methods [3] to this problem. We now present the solution \nprovided by VB. \nWe consider models of the form \nm \n\nP(Yn I 8,m) = LP(Yn I Sn = s,8) p(sn = s I 8), \n\n(7) \n\ns=1 \n\nwhere Yn denotes the nth observed data vector, and Sn denotes the hidden compo(cid:173)\nnent that generated it. The components are labeled by s = 1, ... , m, with the struc(cid:173)\nture parameter m denoting the number of components. Whereas our approach can \nbe applied to arbitrary models, for simplicity we consider here Normal component \ndistributions, P(Yn I Sn = s, 8) = N(JJ.s' r 8), where I-Ls is the mean and r 8 the pre(cid:173)\ncision (inverse covariance) matrix. The mixing proportions are P(Sn = S I 8) = 'Trs' \nIn hindsight, we use conjugate priors on the parameters 8 = {'Trs, I-Ls' rs}. The \nmixing proportions are jointly Dirichlet, p( {'Trs}) = V(..~O), the means (conditioned \non the preCisions) are Normal, p(l-Ls Irs) = N(pO, f3 0 r s), and the precisions are \nWishart, p(r s) = W(vO, ~O). We find that the parameter posterior for a fixed m \n\n\fA Variational Baysian Framework/or Graphical Models \n\n213 \n\nfactorizes into q(8) = q({1I\"s})flsq(J.\u00a3s,rs). The posteriors are obtained by the \nfollowing iterative algorithm, termed VB-MOG. \nE-step. Compute the responsibilities for instance n using (4): \n\n,: == q(sn = S I Yn) ex ffs r!/2 e-(Yn-P,)Tr,(Yn-P,)/2 e- d / 2f3\u2022 \n\n(8) \nnoting that here X = S and q(S) = TIn q(Sn). This expression resembles the re(cid:173)\nsponsibilities in ordinary MLj the differences stem from integrating out the param(cid:173)\neters. The special quantities in (8) are logffs == (log1l\"s) = 1/1()..s) -1/1CLJs' )..s,), \nlogrs == (log I rs I) = Li=l1/1(lIs + 1 -\nlog 1 ~s 1 +dlog2, and \ni\\ == (r s) = IIs~;l, where 1/1(x) = dlog r(x)/dx is the digamma function, and \nthe averages (-} are taken w.r.t. q(8). The other parameters are described below. \nM-step. Compute the parameter posterior in two stages. First, compute the \nquantities \n\ni)/2) -\n\nd \n\n, \n\n-\n\n1 N \n\n1 N \n\nN \n\n1 '\" n en \n~ \nLis = N L..J's S ' \n\nJ.\u00a3s = N L..J's Yn , \n-\n\n' \" n \n\ns n=1 \n\nf.ts)(Yn -\n\n1I\"s = N L..J's , \n-\n\n' \" n \nn=l \nf.ts)T and fls = N7rs. This stage is identical to the \nwhere C~ = (Yn -\nM-step in ordinary EM where it produces the new parameters. In VB, however, the \nquantities in (9) only help characterize the new parameter posteriors. These posteri(cid:173)\nors are functionally identical to the priors but have different parameter values. The \nmixing proportions are jointly Dirichlet, q( {11\" s}) = D( {)..s}), the means are Normal, \nq(J..ts Irs) = N(ps' /3srs), and the precisions are Wishart, p(rs) = W(lIs, ~s). The \nposterior parameters are updated in the second stage, using the simple rules \n\ns n=l \n\n(9) \n\n)..s \nfls +)..0, \nlis = Ns + II, \n-\n\n0 \n\nPs = (flsf.ts + /3opO)/(Ns +~) , \n(10) \n\u00b0 \n~s = NsEs + N s/3 (J.\u00a3s - P )(I's - P ) /(Ns + f3 ) + ~ . \n\n/3s = fls + /30 , \na \naT \n\n0 -\n\n0 \n\n-\n\n-\n\n-\n\n-\n\n-\n\nThe final values of the posterior parameters form the output of the VB-MOG. We \nremark that (a) Whereas no specific assumptions have been made about them, the \nparameter posteriors emerge in suitable, non-trivial (and generally non-Normal) \nfunctional forms. (b) The computational overhead of the VB-MOG compared to \nEM is minimal. (c) The covariance of the parameter posterior is O(l/N), and VB(cid:173)\nMOG reduces to EM (regularized by the priors) as N ~ 00. (d) VB-MOG has no \ndivergence problems. (e) Stability is guaranteed by the existence of an objective \nfunction. (f) Finally, the approximate marginal likelihood Fm , required to optimize \nthe number of components via (6), can also be obtained in closed form (omitted). \nPredictive Density. Using our posteriors, we can integrate out the parameters \nand show that the density assigned by the model to a new data vector Y is a mixture \nof Student-t distributions, \n\nm \n\n(11) \n\ns=1 \n\nwhere component S has Ws = lis + 1 - d d.o.f., mean Ps' covariance As = \u00ab/3s + \n1)/ /3sws)~s, and proportion 7rs = )..s/ Ls' )..s\" \n(11) reduces to a MOG as N ~ 00. \nNonlinear Regression. We may divide each data vector into input and out(cid:173)\nput parts, Y = (yi,y o ), and use the model to estimate the regression function \nyO = f(yi) and error spheres. These may be extracted from the conditional \np(yO I yt, Y) = L:n=l Ws tw~ (yO I p~, A~), which also turns out to be a mixture \nof Student-t distributions, with means p~ being linear, and covariances A~ and mix(cid:173)\ning proportions Ws nonlinear, in yi, and given in terms of the posterior parameters. \n\n\f214 \n\nH Attias \n\nBuffalo post offIce digits \n\nMisclasslflcation rate histogram \n1 , - - - - -- -- - - - - -- , \n\n0.8 \n\n0 .4 \n\n0.6 ( \n\n0 .2 \n\n0 \n\n0 \n\n, \n_1 - ~ \n, \n, \n\n-\n\n, \n, \n, \n, \n- , \n0 .05 \n\n, \n, \n\n0 . 1 \n\nFigure 1: VB-MOG applied to handwritten digit recognition. \n\nVB-MOG was applied to the Boston housing dataset (UCI machine learning repos(cid:173)\nitory), where 13 inputs are used to predict the single output, a house's price. 100 \nrandom divisions of the N = 506 dataset into 481 training and 25 test points were \nused, resulting in an average MSE of 11.9. Whereas ours is not a discriminative \nmethod, it was nevertheless competitive with Breiman's (1994) bagging technique \nusing regression trees (MSE=11.7). For comparison, EM achieved MSE=14.6. \n\nClassification. Here, a separate parameter posterior is computed for each class c \nfrom a training dataset yc. Test data vector y is then classified according to the \nconditional p(c I y, {yC}), which has a form identical to (11) (with c-dependent \nparameters) multiplied by the relative size of yc. \nVB-MOG was applied to the Buffalo post office dataset, which contains 1100 exam(cid:173)\nples for each digit 0 - 9. Each digit is a gray-level 8 x 8 pixel array (see examples \nin Fig. 1 (left)). We used 10 random 500-digit batches for training, and a separate \nbatch of 200 for testing. An average misclassification rate of .018 was obtained \nusing m = 30 components; EM achieved .025. The misclassification histograms \n(VB=solid, EM=dashed) are shown in Fig. 1 (right). \n\n4 VB and Intractable Models: a Blind Separation Example \n\nThe discussion so far assumed that a free-form optimization of the VB objective \nfunction is feasible. Unfortunately, for many interesting models, in particular mod(cid:173)\nels where ordinary ML is intractable, this is not the case. For such models, we \nmodify the VB procedure as follows: (a) Specify a parametric functional form for \nthe posterior over the hidden nodes q(X) , and optimize w.r.t. its parameters, in the \nspirit of [8J. (b) Let the parameter posterior q(8) fall out of free-form optimization, \nas before. \n\nWe illustrate this approach in the context of the blind source separation (BSS) \nproblem (see, e.g., [1]). This problem is described by Yn = HXn + Un , where Xn is \nan unobserved m-dim source vector at instance n, H is an unknown mixing matrix, \nand the noise Un is Normally distributed with an unknown precision >'1. The task is \nto construct a source estimate xn from the observed d-dim data y. The sources are \nindependent and non-Normally distributed. Here we assume the high-kurtosis dis(cid:173)\ntribution p(xi) ex: cosh-\\xf /2) , which is appropriate for modeling speech sources. \nOne important but heretofore unresolved problem in BSS is determining the num(cid:173)\nber m of sources from data. Another is to avoid overfitting the mixing matrix. Both \nproblems, typical to ML algorithms, can be remedied using VB. \nIt is the non-Normal nature ofthe sources that renders the source posterior p(X I Y) \nintractable even before a Bayesian treatment. We use a Normal variational posterior \nq(X) = TIn N(xn lPn' r n) with instance-dependent mean and precision. The \nmixing matrix posterior q(H) then emerges as Normal. For simplicity, >. is optimized \nrather than integrated out. The reSUlting VB-BSS algorithm runs as follows: \n\n\fA Variational Baysian Framework for Graphical Models \n\n215 \n\no \n\n-1000 \n\n-2000 \n\n-3000 \n\nlog PrIm) \n\nsource reconstruction errOr \n\no ' \n\n-5 \n\n-10 \n\n-15 \n\n-4000 2 \n\n4 \n\n8 \n\n10 12 \n\n6 \n\nm \n\n-roO~--~5~--~1-0--~15 \n\nSNR(dB) \n\nFigure 2: Application of VB to blind source separation algorithm (see text). \n\nE-step. Optimize the variational mean Pn by iterating to convergence, for each n, \nthe fixed-point equation XfI:T(Yn - HPn) - tanhpn/2 = C- I Pn , where C is the \nsource covariance conditioned on the data. The variational precision matrix turns \nout to be n-independent: r n = A. T AA. + 1/2 + C- I . \nM-step. Update the mean and precision of the posterior q(H) (rules omitted). \nThis algorithm was applied to ll-dim data generated by linearly mixing 5 lOOmsec(cid:173)\nlong speech and music signals obtained from commercial CDs. Gaussian noise were \nadded at different SNR levels. A uniform structure prior p( m) = 1/ K for m ~ K \nwas used. The resulting posterior over the number of sources (Fig. 2 (left)) is \npeaked at the correct value m = 5. The sources were then reconstructed from test \ndata via p(x I y, Y). The log reconstruction error is plotted vs. SNR in Fig. 2 \n(right, solid). The ML error (which includes no model averaging) is also shown \n(dashed) and is larger, reflecting overfitting. \n\n5 Conclusion \n\nThe VB framework is applicable to a large class of graphical models. In fact, it may \nbe integrated with the junction tree algorithm to produce general inference engines \nwith minimal overhead compared to ML ones. Dirichlet, Normal and Wishart \nposteriors are not special to models treated here but emerge as a general feature. \nCurrent research efforts include applications to multinomial models and to learning \nthe structure of complex dynamic probabilistic networks. \n\nAcknowledgements \nI thank Matt Beal, Peter Dayan, David Mackay, Carl Rasmussen, and especially Zoubin \nGhahramani, for important discussions. \n\nReferences \n[1) Attias, H. (1999). Independent Factor Analysis. Neural Computation 11, 803-85l. \n[2) Bishop, C.M. (1999). Variational Principal Component Analysis. Proc. 9th ICANN. \n[3) Chickering, D.M. & Heckerman, D. (1997) . Efficient approximations for the marginal \nlikelihood of Bayesian networks with hidden variables. Machine Learning 29, 181-212. \n[4) Hinton, G.E. & Van Camp, D. (1993). Keeping neural networks simple by minimizing \nthe description length of the weights. Proc. 6th COLT, 5-13. \n[5) Jaakkola, T. & Jordan, M.L (1997). Bayesian logistic regression: A variational ap(cid:173)\nproach. Statistics and Artificial Intelligence 6 (Smyth, P. & Madigan, D., Eds). \n[6) Neal, R.M. & Hinton, G.E. (1998). A view of the EM algorithm that justifies incre(cid:173)\nmental, sparse, and other variants. Learning in Graphical Models, 355-368 (Jordan, M.L, \nEd). Kluwer Academic Press, Norwell, MA. \n[7) Richardson, S. & Green, P.J. (1997). On Bayesian analysis of mixtures with an un(cid:173)\nknown number of components. Journal of the Royal Statistical Society B, 59, 731-792. \n[8) Saul, L.K., Jaakkola, T., & Jordan, M.I. (1996). Mean field theory of sigmoid belief \nnetworks. Journal of Artificial Intelligence Research 4, 61-76. \n[9) Waterhouse, S., Mackay, D., & Robinson, T. (1996). Bayesian methods for mixture of \nexperts. NIPS-8 (Touretzky, D.S. et aI, Eds). MIT Press. \n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}