{"title": "A Variational Baysian Framework for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 215, "abstract": null, "full_text": "A  Variational  Bayesian Framework for \n\nGraphical  Models \n\nHagai Attias \n\nhagai@gatsby.ucl.ac.uk \n\nGatsby Unit,  University  College  London \n\n17 Queen Square \n\nLondon WC1N  3AR, U.K. \n\nAbstract \n\nThis paper presents a novel practical framework for Bayesian model \naveraging  and  model  selection  in  probabilistic  graphical  models. \nOur approach approximates full  posterior distributions over model \nparameters and structures, as well as latent variables, in an analyt(cid:173)\nical  manner.  These  posteriors fall  out of a  free-form  optimization \nprocedure,  which  naturally  incorporates  conjugate  priors.  Unlike \nin  large sample  approximations,  the posteriors are  generally  non(cid:173)\nGaussian and no Hessian needs to be computed.  Predictive quanti(cid:173)\nties  are obtained analytically.  The resulting algorithm generalizes \nthe standard Expectation Maximization algorithm, and its conver(cid:173)\ngence  is  guaranteed.  We  demonstrate  that  this  approach  can  be \napplied  to  a  large  class  of  models  in  several  domains,  including \nmixture models and source separation. \n\n1 \n\nIntroduction \n\nA standard method to learn a graphical model  1  from  data is  maximum likelihood \n(ML).  Given a  training dataset, ML estimates a  single optimal value for  the model \nparameters within  a fixed  graph structure.  However,  ML  is  well  known for  its ten(cid:173)\ndency  to  overfit  the  data.  Overfitting  becomes  more  severe  for  complex  models \ninvolving  high-dimensional  real-world data such  as  images,  speech,  and text.  An(cid:173)\nother problem is that ML prefers complex models, since they have more parameters \nand fit  the data better.  Hence,  ML  cannot optimize model structure. \nThe Bayesian framework provides, in principle, a solution to these problems.  Rather \nthan focusing on a single model, a Bayesian considers a whole (finite or infinite) class \nof models.  For each model,  its posterior probability given the dataset is computed. \nPredictions for  test  data are made  by  averaging the predictions of all  the individ(cid:173)\nual  models,  weighted  by  their  posteriors.  Thus,  the  Bayesian  framework  avoids \noverfitting  by  integrating  out  the  parameters.  In  addition,  complex  models  are \nautomatically  penalized  by  being  assigned  a  lower  posterior  probability,  therefore \noptimal structures can be identified. \nUnfortunately,  computations  in  the  Bayesian  framework  are  intractable  even  for \n\nlWe use the term 'model'  to refer collectively  to parameters and structure. \n\n\f210 \n\nH.  Attias \n\nvery simple cases (e.g.  factor analysis; see  [2]).  Most existing approximation meth(cid:173)\nods fall  into two classes  [3]:  Markov chain Monte  Carlo methods and large sample \nmethods  (e.g.,  Laplace approximation).  MCMC  methods attempt to achieve exact \nresults but typically require vast computational resources,  and become impractical \nfor  complex models  in  high data dimensions.  Large sample  methods are  tractable, \nbut  typically  make  a  drastic  approximation  by  modeling  the 'posteriors  over  all \nparameters as  Normal,  even for  parameters that are  not  positive definite  (e.g.,  co(cid:173)\nvariance matrices).  In addition, they require the computation ofthe Hessian,  which \nmay become quite intensive. \nIn this paper I present  Variational  Bayes (VB), a  practical framework for  Bayesian \ncomputations  in  graphical  models.  VB  draws  together  variational  ideas  from  in(cid:173)\ntractable latent variables models  [8]  and from  Bayesian inference  [4,5,9],  which,  in \nturn,  draw on the work of [6].  This framework facilitates  analytical calculations of \nposterior  distributions over  the  hidden  variables,  parameters and structures.  The \nposteriors fall  out of a  free-form  optimization procedure  which  naturally  incorpo(cid:173)\nrates conjugate priors, and emerge in standard forms,  only one of which is  Normal. \nThey are computed via an iterative algorithm that is  closely related to Expectation \nMaximization  (EM)  and  whose  convergence  is  guaranteed.  No  Hessian  needs  to \nbe  computed.  In addition,  averaging over models to compute predictive quantities \ncan  be  performed  analytically.  Model  selection  is  done  using  the  posterior  over \nstructure;  in particular, the BIC/MDL criteria emerge as a  limiting case. \n\n2  General Framework \n\nWe  restrict  our  attention  in  this  paper  to  directed  acyclic  graphs  (DAGs,  a.k.a. \nBayesian  networks).  Let  Y  = {y., ... ,YN}  denote  the  visible  (data)  nodes,  where \nn  = 1, ... , N  runs  over  the  data instances,  and  let  X  = {Xl, ... , XN}  denote  the \nhidden  nodes.  Let  e  denote  the  parameters,  which  are  simply  additional  hidden \nnodes with their own distributions.  A model with a fixed structure m is fully defined \nby  the joint distribution  p(Y, X, elm).  In  a  DAG,  this joint  factorizes  over  the \nnodes,  i.e.  p(Y,X I e,m) = TIiP(Ui  I pai,Oi,m),  where  Ui  E YUX, pai  is  the set \nof parents of Ui, and Oi  E e  parametrize the edges directed toward Ui.  In addition, \nwe  usually assume independent  instances,  p(Y, X  Ie, m)  =  TIn p(y n, Xn  Ie, m). \nWe  shall  also  consider  a  set  of structures  m  E  M,  where  m  controls  the  number \nof hidden  nodes  and  the  functional  forms  of the  dependencies  p( Ui  I pai, 0 i, m), \nincluding the range of values assumed by each node (e.g., the number of components \nin a mixture model).  Associated with the set of structures is a structure prior p( m). \nMarginal likelihood and posterior over parameters.  For a fixed  structure m, \nwe  are interested in two quantities.  The first is the parameter posterior distribution \np(e I Y,m).  The  second  is  the  marginal  likelihood  p(Y  I m),  also  known  as  the \nevidence assigned to structure m by the data.  In the following,  the reference to m is \nusually omitted but is  always implied.  Both quantities are obtained from  the joint \np(Y, X, elm).  For  models  with  no  hidden  nodes  the  required  computations can \noften  be  performed analytically.  However,  in  the  presence  of hidden  nodes,  these \nquantities  become  computationally intractable.  We  shall approximate them using \na  variational approach as follows. \nConsider the joint posterior p(X, elY) over hidden  nodes  and parameters.  Since \nit is  intractable,  consider a  variational posterior q(X, elY), which  is  restricted to \nthe factorized form \n\n(1) \nwher\"e  given  the  data,  the  parameters  and  hidden  nodes  are  independent.  This \n\nq(X, elY) =  q(X I Y)q(e I Y)  , \n\n\fA Variational Baysian Frameworkfor Graphical Models \n\n211 \n\nrestriction  is  the  key:  It  makes  q  approximate  but  tractable.  Notice  that  we  do \nnot  require  complete  factorization,  as  the  parameters and  hidden  nodes  may  still \nbe  correlated amongst themselves. \nWe  compute q by optimizing a  cost function  Fm[q]  defined  by \n\nFm[q]  = ! dE>  q(X)q(E\u00bb  log ~~i~(:j ~ logp(Y I m)  , \n\n(2) \n\nwhere  the inequality  holds  for  an  arbitrary q  and follows  from  Jensen's inequality \n(see  [6]);  it becomes an equality when  q is  the true posterior.  Note that q  is always \nunderstood to include conditioning on Y  as in (1).  Since Fm  is bounded from above \nby  the marginal likelihood,  we  can obtain the optimal posteriors by  maximizing it \nw.r.t.  q.  This can be shown to be equivalent to minimizing the KL distance between \nq and the true posterior.  Thus,  optimizing Fm  produces  the  best  approximation  to \nthe  true  posterior  within  the  space  of distributions  satisfying  (1),  as  well  as  the \ntightest  lower  bound  on  the  true  marginal likelihood. \nPenalizing complex models.  To see that the VB objective function Fm  penalizes \ncomplexity,  it is  useful to rewrite it as \n\nFm  =  (log \n\np(Y, X  I E\u00bb \n\nq(X) \n\n)X,9 - KL[q(E\u00bb \n\nII  p(E\u00bb]  , \n\n(3) \n\nwhere  the average in  the first  term on the r.h.s.  is  taken  w.r.t.  q(X, E\u00bb.  The first \nterm corresponds to the (averaged)  likelihood.  The second term is the KL distance \nbetween the prior and posterior over the parameters.  As  the number of parameters \nincreases, the KL distance follows  and consequently reduces Fm. \nThis  penalized  likelihood  interpretation  becomes  transparent  in  the  large  sample \nlimit  N  -7  00,  where  the  parameter  posterior  is  sharply  peaked  about  the  most \nprobable  value  E>  =  E>o. \nIt  can  then  be  shown  that  the  KL  penalty  reduces  to \n(I  E>o  1/2) log N, which is linear in the number of parameters I E>o  I of structure m. \nFm  then  corresponds  precisely  the  Bayesian  information  criterion  (BIC)  and  the \nminimum description length criterion  (MDL)  (see  [3]).  Thus, these popular model \nselection criteria follow  as  a  limiting case of the VB  framework. \nFree-form optimization and an EM-like algorithm.  Rather than assuming a \nspecific  parametric form  for  the  posteriors,  we  let  them  fall  out  of free-form  opti(cid:173)\nmization of the VB objective function.  This results in an iterative algorithm directly \nanalogous to ordinary EM. In the E-step, we compute the posterior over the hidden \nnodes  by  solving 8Fm/8q(X) =  0 to get \n\nq(X) ex  e(log p(Y,XI9\u00bbe  , \n\n(4) \n\nwhere the average is  taken w.r.t.  q(E\u00bb. \nIn  the  M-step,  rather  than  the  'optimal'  parameters,  we  compute  the  posterior \ndistribution  over the  parameters by solving 8Fm/8q(E\u00bb  =  0 to get \n\nq(E\u00bb  ex  e(IOgp(y,XI9\u00bbxp (E\u00bb \n\n, \n\n(5) \n\nwhere the average is taken w.r.t.  q(X). \nThis is  where  the concept of conjugate  priors becomes useful.  Denoting the expo(cid:173)\nnential term on the r.h.s.  of (5)  by  f(E\u00bb,  we  choose the prior p(E\u00bb  from a family of \ndistributions such that q(E\u00bb  ex  f(E\u00bbp(E\u00bb  belongs to that same family.  p(E\u00bb  is  then \nsaid to be conjugate to f(E\u00bb.  This procedure allows us to select a prior from a fairly \nlarge family of distributions (which includes non-informative ones as limiting cases) \n\n\f212 \n\nH.  Attias \n\nand thus not compromise generality, while facilitating mathematical simplicity and \nelegance.  In  particular,  learning  in  the  VB  framework  simply  amounts  to  updat(cid:173)\ning  the  hyperparameters,  i.e.,  transforming  the  prior  parameters  to  the  posterior \nparameters.  We  point out that,  while  the  use  of conjugate  priors is  widespread in \nstatistics, so far  they could only be  applied to models  where  all nodes were  visible. \ninequal(cid:173)\nStructure  posterior. \n-\n.1'[q] \nity  once  again \nl:mEM q(m) [.1'm  + logp(m)jq(m)]  ~ 10gp(Y),  where  now  q =  q(X  I m, Y)q(8  I \nm, Y)q( m  I Y) .  After  computing  .1' m  for  each  m  EM, the  structure  posterior  is \nobtained by free-form optimization of .1': \n\nTo  compute  q(m)  we  exploit  Jensen's \n\nto  define  a  more  general  objective \n\nfunction, \n\nq(m)  ex  e:Frnp(m)  . \n\n(6) \n\nHence,  prior  assumptions  about  the  likelihood  of different  structures,  encoded  by \nthe prior p( m), affect the selection of optimal model structures performed according \nto q( m), as they should. \nPredictive  quantities.  The  ultimate  goal  of Bayesian  inference  is  to  estimate \npredictive  quantities,  such  as  a  density  or  regression  function.  Generally,  these \nquantities  are  computed  by  averaging  over  all  models,  weighting  each  model  by \nits  posterior.  In  the  VB  framework,  exact  model  averaging  is  approximated  by \nreplacing  the  true  posterior  p(8  I  Y)  by  the  variational  q(8  I  Y). \nIn  density \nestimation,  for  example,  the  density  assigned  to  a  new  data  point  Y  is  given  by \np(y I Y)  =  J d8 p(y I 8) q(8 I Y) . \nIn some  situations  (e.g.  source  separation),  an  estimate  of hidden  node  values  x \nfrom  new  data y  may  be  required.  The relevant  quantity  here  is  the  conditional \np(x  I y, Y),  from  which  the  most  likely  value  of hidden  nodes  is  extracted.  VB \napproximates it by p(x I y, Y) ex  J d8 p(y, x  I 8) q(8 I Y). \n\n3  Variational Bayes  Mixture Models \n\nMixture  models  have  been investigated and analyzed extensively over  many years. \nHowever, the well known problems of regularizing against likelihood divergences and \nof determining the required number of mixture components are still open.  Whereas \nin  theory  the  Bayesian approach  provides  a  solution,  no  satisfactory  practical  al(cid:173)\ngorithm  has  emerged  from  the  application  of involved  sampling  techniques  (e.g., \n[7])  and approximation methods  [3]  to this  problem.  We  now  present the solution \nprovided by  VB. \nWe  consider models of the form \nm \n\nP(Yn  I 8,m) = LP(Yn I Sn  =  s,8) p(sn  =  s I 8), \n\n(7) \n\ns=1 \n\nwhere Yn  denotes the nth observed data vector, and Sn  denotes the hidden compo(cid:173)\nnent that generated it.  The components are labeled by s  =  1, ... , m, with the struc(cid:173)\nture parameter m  denoting the number of components.  Whereas our approach can \nbe applied to arbitrary models,  for  simplicity we  consider here  Normal component \ndistributions, P(Yn  I Sn  =  s, 8) =  N(JJ.s' r 8),  where  I-Ls  is the mean and r 8 the pre(cid:173)\ncision  (inverse covariance)  matrix.  The mixing proportions are P(Sn  =  S  I 8) =  'Trs' \nIn  hindsight,  we  use  conjugate  priors  on  the  parameters  8  =  {'Trs, I-Ls' rs}.  The \nmixing  proportions are jointly Dirichlet,  p( {'Trs})  =  V(..~O), the means  (conditioned \non  the  preCisions)  are  Normal,  p(l-Ls  Irs) =  N(pO, f3 0 r s),  and  the  precisions  are \nWishart, p(r s)  =  W(vO, ~O).  We  find  that the  parameter posterior for  a  fixed  m \n\n\fA Variational Baysian Framework/or Graphical Models \n\n213 \n\nfactorizes  into  q(8)  = q({1I\"s})flsq(J.\u00a3s,rs).  The  posteriors  are  obtained  by  the \nfollowing iterative algorithm,  termed VB-MOG. \nE-step.  Compute the responsibilities for  instance n  using  (4): \n\n,: ==  q(sn = S  I Yn)  ex  ffs  r!/2 e-(Yn-P,)Tr,(Yn-P,)/2 e- d / 2f3\u2022 \n\n(8) \nnoting that  here  X  = S and  q(S)  = TIn q(Sn).  This  expression resembles  the  re(cid:173)\nsponsibilities in ordinary MLj  the differences stem from  integrating out the param(cid:173)\neters.  The  special  quantities  in  (8)  are  logffs  ==  (log1l\"s)  =  1/1()..s)  -1/1CLJs'  )..s,), \nlogrs  ==  (log  I  rs  I)  =  Li=l1/1(lIs  +  1  -\nlog  1 ~s  1 +dlog2,  and \ni\\ ==  (r s)  = IIs~;l,  where  1/1(x)  = dlog r(x)/dx  is  the  digamma  function,  and \nthe averages (-}  are taken w.r.t.  q(8).  The other parameters are described below. \nM-step.  Compute  the  parameter  posterior  in  two  stages.  First,  compute  the \nquantities \n\ni)/2)  -\n\nd \n\n, \n\n-\n\n1  N \n\n1  N \n\nN \n\n1  '\"  n en \n~ \nLis  = N  L..J's  S '  \n\nJ.\u00a3s  = N  L..J's  Yn  , \n-\n\n' \"   n \n\ns  n=1 \n\nf.ts)(Yn  -\n\n1I\"s  =  N  L..J's  , \n-\n\n' \"   n \nn=l \nf.ts)T  and  fls  = N7rs.  This  stage  is  identical  to  the \nwhere  C~ = (Yn  -\nM-step in ordinary EM where it produces the new parameters.  In VB,  however, the \nquantities in (9) only help characterize the new parameter posteriors.  These posteri(cid:173)\nors are functionally identical to the priors but have different parameter values.  The \nmixing proportions are jointly Dirichlet, q( {11\" s})  =  D( {)..s}),  the means are Normal, \nq(J..ts Irs) =  N(ps' /3srs),  and the precisions are Wishart, p(rs) =  W(lIs, ~s). The \nposterior parameters are updated in the second stage, using the simple rules \n\ns  n=l \n\n(9) \n\n)..s \nfls  +)..0, \nlis  =  Ns  +  II, \n-\n\n0 \n\nPs  = (flsf.ts  +  /3opO)/(Ns +~) , \n(10) \n\u00b0 \n~s = NsEs + N s/3  (J.\u00a3s  - P  )(I's - P  )  /(Ns +  f3  ) +  ~  . \n\n/3s  = fls + /30  , \na \naT \n\n0 -\n\n0 \n\n-\n\n-\n\n-\n\n-\n\n-\n\nThe final  values of the posterior parameters form  the output of the VB-MOG.  We \nremark that (a)  Whereas no specific assumptions have  been made about them, the \nparameter  posteriors  emerge  in  suitable,  non-trivial  (and  generally  non-Normal) \nfunctional  forms.  (b)  The  computational overhead  of the  VB-MOG  compared  to \nEM  is  minimal.  (c)  The covariance of the parameter posterior is  O(l/N), and VB(cid:173)\nMOG  reduces  to EM  (regularized by  the priors)  as N  ~ 00.  (d)  VB-MOG  has  no \ndivergence  problems.  (e)  Stability  is  guaranteed  by  the  existence  of an  objective \nfunction.  (f)  Finally, the approximate marginal likelihood Fm ,  required to optimize \nthe number of components  via  (6),  can also be obtained in closed form  (omitted). \nPredictive  Density.  Using  our  posteriors,  we  can integrate out  the  parameters \nand show that the density assigned by the model to a new data vector Y is  a mixture \nof Student-t distributions, \n\nm \n\n(11) \n\ns=1 \n\nwhere  component  S  has  Ws  = lis  +  1 - d  d.o.f.,  mean  Ps'  covariance  As = \u00ab/3s + \n1)/ /3sws)~s, and proportion 7rs  = )..s/ Ls' )..s\" \n(11)  reduces to a MOG as N  ~ 00. \nNonlinear  Regression.  We  may  divide  each  data  vector  into  input  and  out(cid:173)\nput  parts,  Y  = (yi,y o ),  and  use  the  model  to  estimate  the  regression  function \nyO  =  f(yi)  and  error  spheres.  These  may  be  extracted  from  the  conditional \np(yO  I yt, Y)  =  L:n=l Ws  tw~ (yO  I p~, A~),  which  also  turns  out  to  be  a  mixture \nof Student-t distributions, with means p~ being linear,  and covariances A~ and mix(cid:173)\ning proportions Ws  nonlinear, in yi, and given in terms of the posterior parameters. \n\n\f214 \n\nH  Attias \n\nBuffalo post offIce digits \n\nMisclasslflcation rate histogram \n1 , - - - - -- -- - - - - -- ,  \n\n0.8 \n\n0 .4 \n\n0.6  ( \n\n0 .2 \n\n0 \n\n0 \n\n, \n_1 - ~ \n, \n, \n\n-\n\n, \n, \n, \n, \n- , \n0 .05 \n\n, \n, \n\n0 . 1 \n\nFigure 1:  VB-MOG  applied to handwritten  digit recognition. \n\nVB-MOG  was applied to the Boston housing dataset (UCI machine learning repos(cid:173)\nitory),  where  13  inputs are used  to predict  the single output,  a  house's price.  100 \nrandom divisions of the N  =  506 dataset into 481  training and  25  test points were \nused,  resulting  in  an average  MSE  of 11.9.  Whereas  ours  is  not  a  discriminative \nmethod,  it was  nevertheless competitive  with  Breiman's  (1994)  bagging technique \nusing regression trees  (MSE=11.7).  For comparison, EM  achieved MSE=14.6. \n\nClassification.  Here,  a  separate parameter posterior is  computed for  each  class  c \nfrom  a  training dataset  yc.  Test  data vector  y  is  then  classified  according to the \nconditional  p(c  I y, {yC}),  which  has  a  form  identical  to  (11)  (with  c-dependent \nparameters)  multiplied  by  the relative size of yc. \nVB-MOG was applied to the Buffalo post office dataset, which contains 1100 exam(cid:173)\nples for  each digit 0 - 9.  Each digit  is  a  gray-level 8 x 8  pixel  array  (see examples \nin  Fig.  1 (left)).  We  used  10 random 500-digit batches for  training,  and a separate \nbatch  of 200  for  testing.  An  average  misclassification  rate  of  .018  was  obtained \nusing  m  =  30  components;  EM  achieved  .025.  The  misclassification  histograms \n(VB=solid,  EM=dashed)  are shown in Fig.  1 (right). \n\n4  VB  and Intractable Models:  a  Blind Separation Example \n\nThe discussion  so  far  assumed  that  a  free-form  optimization  of the  VB  objective \nfunction  is feasible.  Unfortunately, for many interesting models,  in particular mod(cid:173)\nels  where  ordinary  ML  is  intractable,  this  is  not  the  case.  For  such  models,  we \nmodify  the  VB  procedure as  follows:  (a)  Specify  a  parametric  functional form  for \nthe posterior over the hidden nodes q(X) , and optimize w.r.t.  its parameters, in the \nspirit of [8J.  (b)  Let the parameter posterior q(8) fall out of free-form optimization, \nas before. \n\nWe  illustrate  this  approach  in  the  context  of the  blind  source  separation  (BSS) \nproblem (see,  e.g.,  [1]).  This problem is described by  Yn  =  HXn + Un ,  where  Xn  is \nan unobserved m-dim source vector at instance n, H  is an unknown mixing matrix, \nand the noise  Un is  Normally distributed with an unknown precision >'1.  The task is \nto construct a source estimate xn from  the observed d-dim data y.  The sources are \nindependent and  non-Normally distributed.  Here  we  assume the high-kurtosis dis(cid:173)\ntribution p(xi) ex:  cosh-\\xf /2) , which  is  appropriate for  modeling speech sources. \nOne important but heretofore  unresolved problem in  BSS  is  determining the num(cid:173)\nber m of sources from data.  Another is to avoid overfitting the mixing matrix.  Both \nproblems, typical to ML  algorithms, can be remedied using  VB. \nIt is the non-Normal nature ofthe sources that renders the source posterior p(X I Y) \nintractable even before a Bayesian treatment.  We use a Normal variational posterior \nq(X)  =  TIn N(xn  lPn' r n)  with  instance-dependent  mean  and  precision.  The \nmixing matrix posterior q(H) then emerges as Normal.  For simplicity, >. is optimized \nrather than integrated out.  The reSUlting  VB-BSS  algorithm runs as follows: \n\n\fA Variational Baysian Framework for Graphical Models \n\n215 \n\no \n\n-1000 \n\n-2000 \n\n-3000 \n\nlog PrIm) \n\nsource reconstruction errOr \n\no  ' \n\n-5 \n\n-10 \n\n-15 \n\n-4000  2 \n\n4 \n\n8 \n\n10  12 \n\n6 \n\nm \n\n-roO~--~5~--~1-0--~15 \n\nSNR(dB) \n\nFigure 2:  Application of VB  to  blind source separation algorithm  (see  text). \n\nE-step.  Optimize the variational mean Pn  by iterating to convergence, for  each n, \nthe  fixed-point  equation  XfI:T(Yn  - HPn)  - tanhpn/2  =  C- I Pn ,  where  C  is  the \nsource  covariance conditioned on the data.  The variational precision matrix turns \nout to be n-independent:  r n  =  A. T AA. + 1/2 + C- I . \nM-step.  Update the mean and precision of the posterior q(H)  (rules omitted). \nThis algorithm was applied to ll-dim data generated by linearly mixing 5 lOOmsec(cid:173)\nlong speech and music signals obtained from commercial CDs.  Gaussian noise were \nadded at different  SNR levels.  A uniform structure prior p( m)  =  1/ K  for  m  ~ K \nwas  used.  The  resulting  posterior  over  the  number  of sources  (Fig.  2  (left))  is \npeaked at the correct value m  =  5.  The sources  were then reconstructed from  test \ndata via p(x  I y, Y).  The  log  reconstruction  error  is  plotted  vs.  SNR  in  Fig.  2 \n(right,  solid).  The  ML  error  (which  includes  no  model  averaging)  is  also  shown \n(dashed)  and is  larger, reflecting overfitting. \n\n5  Conclusion \n\nThe VB framework is  applicable to a large class of graphical models.  In fact,  it may \nbe integrated with the junction tree algorithm to produce general inference engines \nwith  minimal  overhead  compared  to  ML  ones.  Dirichlet,  Normal  and  Wishart \nposteriors are  not  special  to  models  treated here  but emerge  as  a  general feature. \nCurrent research efforts include applications to multinomial models and to learning \nthe structure of complex dynamic probabilistic networks. \n\nAcknowledgements \nI thank  Matt  Beal,  Peter Dayan,  David  Mackay,  Carl  Rasmussen,  and especially  Zoubin \nGhahramani,  for  important discussions. \n\nReferences \n[1)  Attias,  H.  (1999).  Independent  Factor Analysis.  Neural  Computation 11,  803-85l. \n[2)  Bishop,  C.M.  (1999).  Variational  Principal  Component  Analysis.  Proc.  9th  ICANN. \n[3)  Chickering,  D.M.  &  Heckerman,  D.  (1997) .  Efficient  approximations for  the  marginal \nlikelihood of Bayesian  networks  with  hidden  variables.  Machine  Learning 29,  181-212. \n[4)  Hinton,  G.E.  &  Van  Camp,  D.  (1993).  Keeping  neural  networks simple  by  minimizing \nthe description  length of the weights.  Proc.  6th  COLT,  5-13. \n[5)  Jaakkola,  T.  &  Jordan,  M.L  (1997).  Bayesian  logistic  regression:  A variational  ap(cid:173)\nproach.  Statistics  and  Artificial Intelligence  6  (Smyth,  P.  &  Madigan,  D.,  Eds). \n[6)  Neal,  R.M.  &  Hinton,  G.E.  (1998).  A view  of the  EM  algorithm  that justifies  incre(cid:173)\nmental,  sparse,  and other variants.  Learning  in  Graphical  Models,  355-368  (Jordan,  M.L, \nEd).  Kluwer  Academic  Press,  Norwell,  MA. \n[7)  Richardson,  S.  &  Green,  P.J.  (1997).  On  Bayesian  analysis  of mixtures  with  an  un(cid:173)\nknown  number of components.  Journal  of the  Royal  Statistical  Society  B,  59,  731-792. \n[8)  Saul,  L.K.,  Jaakkola,  T.,  &  Jordan,  M.I.  (1996).  Mean  field  theory  of sigmoid  belief \nnetworks.  Journal  of Artificial Intelligence  Research 4,  61-76. \n[9)  Waterhouse,  S.,  Mackay,  D.,  &  Robinson,  T.  (1996).  Bayesian  methods for  mixture of \nexperts.  NIPS-8 (Touretzky,  D.S. et aI, Eds).  MIT  Press. \n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}