{"title": "Variational Inference for Bayesian Mixtures of Factor Analysers", "book": "Advances in Neural Information Processing Systems", "page_first": 449, "page_last": 455, "abstract": null, "full_text": "Variational  Inference for  Bayesian \n\nMixtures of Factor Analysers \n\nZoubin  Ghahramani  and  Matthew J.  Beal \n\nGatsby Computational Neuroscience Unit \n\nUniversity  College  London \n\n17 Queen Square, London WC1N  3AR, England \n\n{zoubin,m.beal}Ggatsby.ucl.ac.uk \n\nAbstract \n\nWe  present an algorithm that infers the model structure of a  mix(cid:173)\nture of factor  analysers using  an efficient  and  deterministic  varia(cid:173)\ntional  approximation to  full  Bayesian  integration  over  model  pa(cid:173)\nrameters.  This  procedure  can  automatically  determine  the  opti(cid:173)\nmal  number  of components  and  the  local  dimensionality  of each \ncomponent  (Le.  the  number  of  factors  in  each  factor  analyser) . \nAlternatively  it  can  be  used  to  infer  posterior  distributions  over \nnumber of components and dimensionalities.  Since  all  parameters \nare integrated out the method is  not prone to overfitting.  Using a \nstochastic  procedure for  adding  components  it is  possible  to  per(cid:173)\nform  the variational optimisation incrementally and to avoid  local \nmaxima.  Results show that the method works very well in practice \nand  correctly  infers  the  number  and  dimensionality  of  nontrivial \nsynthetic examples. \nBy  importance  sampling  from  the  variational  approximation  we \nshow  how  to  obtain  unbiased  estimates  of the  true  evidence,  the \nexact predictive density,  and the KL divergence between the varia(cid:173)\ntional posterior and the true posterior, not only in this  model  but \nfor  variational approximations in general. \n\n1 \n\nIntroduction \n\nFactor  analysis  (FA)  is  a  method  for  modelling  correlations  in  multidimensional \ndata.  The model  assumes that each p-dimensional data vector y  was  generated by \nfirst  linearly  transforming a  k  < p  dimensional  vector  of unobserved  independent \nzero-mean unit-variance Gaussian sources, x, and then adding a p-dimensional zero(cid:173)\nmean Gaussian noise vector, n, with diagonal covariance matrix \\}!:  i.e.  y  =  Ax+n. \nIntegrating  out  x  and  n,  the  marginal  density  of y  is  Gaussian  with  zero  mean \nand  covariance  AA T  +  \\}!.  The  matrix  A is  known  as  the  factor  loading  matrix. \nGiven  data with  a  sample  covariance matrix I:,  factor  analysis  finds  the  A and  \\}! \nthat optimally fit  I:  in the maximum likelihood sense.  Since  k  < p,  a  single factor \nanalyser can be seen as a  reduced parametrisation of a full-covariance  Gaussian. 1 \n\nIFactor analysis and its relationship to principal components analysis  (peA) and mix(cid:173)\n\nture models is  reviewed in  (10). \n\n\f450 \n\nZ.  Ghahramani and M. J.  Heal \n\nA mixture of factor analysers (MFA)  models the density for y  as a weighted average \nof factor  analyser densities \n\ns \n\nP(yjA, q,,7r)  =  LP(sj7r)P(yjs,AS, '11), \n\n(1) \n\ns=1 \n\nwhere 7r  is  the vector of mixing proportions,  s  is  a  discrete indicator variable,  and \nA S  is  the factor  loading  matrix for  factor  analyser  s  which  includes  a  mean  vector \nfor  y. \nBy  exploiting  the factor  analysis  parameterisation of covariance  matrices,  a  mix(cid:173)\nture of factor analysers can be used to fit  a mixture of Gaussians to correlated high \ndimensional data without  requiring O(P2)  parameters or undesirable compromises \nsuch as  axis-aligned covariance matrices.  In an MFA  each Gaussian cluster has in(cid:173)\ntrinsic dimensionality k  (or ks  if the dimensions are allowed to vary across clusters). \nConsequently,  the  mixture  of factor  analysers  simultaneously  addresses  the  prob(cid:173)\nlems of clustering and local dimensionality reduction.  When '11  is  a  multiple of the \nidentity the model becomes  a  mixture of probabilistic PCAs.  Tractable maximum \nlikelihood  procedure for  fitting  MFA  and  MPCA  models  can  be  derived  from  the \nExpectation Maximisation algorithm  [4,  11]. \nThe  maximum  likelihood  (ML)  approach  to  MFA  can  easily  get  caught  in  local \nmaxima.2  Ueda et al.  [12]  provide an effective deterministic  procedure for  avoiding \nlocal  maxima by  considering  splitting  a  factor  analyser  in  one  part  of space  and \nmerging two in a  another part.  But splits and merges have to be considered simul(cid:173)\ntaneously because the number of factor analysers has to stay the same since adding \na factor  analyser is  always expected to increase the training likelihood. \nA  fundamental  problem  with  maximum likelihood  approaches  is  that  they fail  to \ntake into  account  model  complexity  (Le.  the  cost  of coding the  model  parameter(cid:173)\ns) .  So  more  complex models  are  not penalised,  which  leads  to  overfitting and the \ninability  to  determine the best  model size  and  structure  (or  distributions  thereof) \nwithout resorting to costly cross-validation procedures.  Bayesian approaches over(cid:173)\ncome  these  problems  by  treating  the  parameters  0 as  unknown  random  variables \nand averaging over the ensemble of models they define: \n\nP(Y) =  /  dO  P(YjO)P(O). \n\n(2) \n\nP(Y) is  the  evidence for  a  data set Y =  {yl, .. . ,yN}.  Integrating out parameters \npenalises  models  with  more  degrees  of freedom  since  these  models  can  a  priori \nmodel a larger range of data sets.  All information inferred from the data about the \nparameters is  captured  by  the  posterior  distribution  P(OjY)  rather  than  the  ML \npoint estimate 0. 3 \nWhile  Bayesian  theory  deals  with  the  problems  of  overfitting  and  model  selec(cid:173)\ntion/averaging, in practice it is often computationally and analytically intractable to \nperform the required integrals.  For  Gaussian mixture models  Markov  chain  Monte \nCarlo  (MCMC)  methods  have  been  developed  to  approximate  these  integrals  by \nsampling  [8,  7].  The main  criticism of MCMC  methods  is  that they are slow  and \n\n2 Technically,  the log  likelihood  is  not  bounded above  if no constraints  are  put on the \ndeterminant of the  component  covariances.  So  the real  ML  objective for  MFA  is  to find \nthe highest  finite  local  maximum of the likelihood. \n\n3We  sometimes  use  ()  to  refer  to  the  parameters  and  sometimes  to  all  the  unknown \nquantities (parameters and hidden variables).  Formally the only difference between the two \nis  that the number of hidden variables  grows  with N,  whereas  the number of parameters \nusually does  not. \n\n\fVariational Inference for Bayesian Mixtures of Factor Analysers \n\n451 \n\nit is  usually difficult  to assess convergence.  Furthermore, the posterior density over \nparameters is  stored as  a set of samples,  which  can be inefficient. \nAnother approach to Bayesian integration for  Gaussian mixtures  [9]  is  the  Laplace \napproximation which  makes a  local  Gaussian approximation around a  maximum  a \nposteriori parameter estimate.  These approximations are based on large data limits \nand can be poor, particularly for  small data sets (for which, in principle, the advan(cid:173)\ntages of Bayesian integration over ML are largest).  Local  Gaussian approximations \nare  also  poorly  suited to  bounded  or  positive  parameters such  as  the  mixing  pro(cid:173)\nportions of the  mixture model.  Finally,  it is  difficult  to see  how  this approach can \nbe applied to online incremental changes to model structure. \nIn  this  paper  we  employ  a  third  approach  to  Bayesian  inference:  variational  ap(cid:173)\nproximation.  We form  a lower bound on the log evidence using Jensen's inequality: \n\n1:  ==  In P(Y) = In /  dO  P(Y, 0)  ~ /  dO  Q(O) In P6~~~) ==  F, \n\n(3) \n\nwhich  we  seek  to  maximise.  Maximising  F  is  equivalent  to  minimising  the  KL(cid:173)\ndivergence  between  Q(O)  and P(OIY),  so  a  tractable Q  can  be  used  as  an approx(cid:173)\nimation  to the intractable  posterior.  This  approach  draws  its roots from  one  way \nof deriving  mean  field  approximations  in  physics,  and  has  been  used  recently  for \nBayesian inference  [13,  5,  1]. \nThe  variational method has  several  advantages over MCMC  and  Laplace  approxi(cid:173)\nmations.  Unlike MCMC,  convergence can be assessed easily by monitoring F.  The \napproximate  posterior  is  encoded  efficiently  in  Q(O) .  Unlike  Laplace  approxima(cid:173)\ntions,  the form  of Q  can  be tailored to  each  parameter  (in  fact  the  optimal form \nof Q  for  each parameter falls  out of the optimisation), the approximation is  global, \nand  Q  optimises an objective function.  Variational methods  are generally  fast,  F \nis guaranteed to increase monotonically and transparently incorporates model com(cid:173)\nplexity.  To  our knowledge,  no  one has  done a  full  Bayesian analysis of mixtures of \nfactor  analysers. \nOf course,  vis-a-vis  MCMC,  the  main  disadvantage  of variational  approximations \nis  that they  are  not  guaranteed to find  the  exact  posterior in  the limit.  However, \nwith  a  straightforward application  of sampling,  it is  possible  to take the result  of \nthe variational optimisation and use it to sample from the exact posterior and exact \npredictive density.  This is  described in  section 5. \nIn the  remainder of this  paper we  first  describe  the mixture of factor  analysers in \nmore  detail  (section  2).  We  then  derive  the variational approximation  (section  3). \nWe  show empirically that the model can infer both the number of components and \ntheir intrinsic dimensionalities,  and is  not prone to overfitting  (section 6).  Finally, \nwe  conclude in section  7. \n\n2  The Model \n\nStarting from  (1),  the evidence for the Bayesian MFA  is  obtained by averaging the \nlikelihood under priors for the parameters (which have their own hyperparameters): \n\nP(Y) \n\n/  d7rP(7rIa:) /  dvP(vla,b) /  dA P(Alv), \n\ng [.t, P(s\u00b7I1r) J dx\u00b7P(xn)p(ynlx\u00b7,sn,A', q;)]. \n\n(4) \n\n\f452 \n\nZ.  Ghahramani and M.  J.  Beal \n\nHere {a, a, b, \"Ill}  are hyperparameters4 ,  v  are precision parameters (Le. inverse vari(cid:173)\nances)  for  the  columns  of A.  The  conditional independence  relations between  the \nvariables in this model  are shown graphically in the usual  belief network represen(cid:173)\ntation in Figure 1. \n\nWhile  arbitrary  choices  could  be  made  for  the \npriors on the first line of (4), choosing priors that \nare conjugate to the likelihood terms on the sec(cid:173)\nond  line  of  (4)  greatly  simplifies  inference  and \ninterpretability.5  So  we  choose  P(7rJa)  to  be \nsymmetric  Dirichlet,  which  is  conjugate  to  the \nmultinomial P(sJ7r). \nThe  prior for  the factor  loading matrix plays  a \nkey  role  in  this  model.  Each  component  of the \nmixture  has  a  Gaussian  prior  P(ABJVB),  where \neach element of the vector VB  is  the precision of \na column of A.  IT one of these precisions vi -t 00, \nthen the outgoing weights for factor  Xl  will go to \nzero,  which  allows  the  model  to  reduce  the  in(cid:173)\ntrinsic  dimensionality  of X  if the  data does  not \nwarrant this  added  dimension.  This  method  of \nintrinsic dimensionality reduction has been used \nby  Bishop  [2]  for  Bayesian  peA, and  is  closely \nrelated to MacKay and Neal's  method for  auto(cid:173)\nmatic relevance determination (ARD) for  inputs \nto a  neural network [6]. \n\n:''!:~\",~ .................. 1 \n\nFigure  1:  Generative  model  for \nvariational  Bayesian  mixture  of \nfactor  analysers.  Circles  denote \nrandom  variables,  solid  rectangles \ndenote  hyperparameters,  and  the \ndashed  rectangle  shows  the  plate \n(i.e.  repetitions)  over  the data. \n\nTo avoid overfitting it is important to integrate out all parameters whose cardinality \nscales  with  model  complexity  (Le.  number  of components  and  their  dimensionali(cid:173)\nties).  We  therefore also integrate out the precisions using Gamma priors, P(vJa, b). \n\n3  The Variational  Approximation \n\nApplying  Jensen's inequality repeatedly to the log  evidence  (4)  we  lower  bound it \nusing the following factorisation of the distribution of parameters and hidden  vari(cid:173)\nables:  Q(A)Q(7r, v)Q(s, x).  Given this factorisation several additional factorisations \nfallout of the conditional independencies  in the model  resulting in the  variational \nobjective function: \n\nF= jd-n;Q(-n;) In PJ7;~) + t, j  dv'Q(v') lIn P6;~~) b)  + jdA'Q(A') In P6~~~') 1 \n+ t, .t, Q(s\") [j d-n;  Q(-n;) In Pci~:~~) + j  dx\"Q(x\"Js\") In Q~~:~\") \n\n+ jdABQ(AB) j  dxnQ(xnJsn)lnp(ynJxn,sn,AB, \"Ill)] \n\n(5) \n\nThe variational posteriors Q('), as given in the Appendix, are derived by performing \na free-form extremisation of F  w.r.t. Q. It is not difficult to show that these extrema \nare indeed maxima of F.  The optimal posteriors Q are of the same conjugate forms \nas the priors.  The model hyperparameters which govern the priors can be estimated \nin the same fashion  (see the Appendix). \n\n4We  currently do  not integrate out  1lJ',  although this can  also  be done. \n5Conjugate  priors  have  the same effect  as  pseudo-observations. \n\n\fVariational lriference for Bayesian Mixtures of Factor Analysers \n\n453 \n\n4  Birth and  Death \nWhen optimising F , occasionally one finds  that for  some  s:  Ln Q(sn) = O.  These \nzero responsibility components are the result of there being insufficient support from \nthe local  data to overcome the  dimensional  complexity  prior on  the factor  loading \nmatrices.  So  components  of the  mixture  die  of natural  causes  when  they  are  no \nlonger needed.  Removing these redundant components increases F . \nComponent  birth  does  not  happen  spontaneously,  so  we  introduce  a  heuristic. \nWhenever  F  has  stabilised  we  pick  a  parent-component  stochastically  with  prob(cid:173)\nability  proportional to e-f3F\u2022  and  attempt to split  it  into two;  Fa  is  the s-specific \ncontribution  to  F  with  the  last  bracketed  term  in  (5)  normalised  by  Ln Q(sn). \nThis works better than both cycling through components and picking them at ran(cid:173)\ndom as it concentrates attempted births on components that are faring poorly.  The \nparameter distributions of the two  Gaussians  created from  the split  are initialised \nby  partitioning the responsibilities  for  the  data,  Q(sn),  along  a  direction  sampled \nfrom the parent's distribution.  This usually causes F  to decrease, so by monitoring \nthe future  progress of F  we  can reject this attempted birth if F  does  not recover. \nAlthough it is  perfectly possible to start the model with many components and let \nthem die, it is computationally more efficient to start with one component and allow \nit to spawn more when  necessary. \n\n5  Exact Predictive Density, True Evidence, and  KL \n\nBy importance sampling from the variational approximation we can obtain unbiased \nestimates  of three important quantities:  the exact  predictive  density,  the  true log \nevidence [\"  and the  KL  divergence  between the variational posterior and the true \nposterior.  Letting 0 = {A, 7r}, we sample Oi  '\" Q (0).  Each such sample is an instance \nof a  mixture  of factor  analysers  with  predictive  density  given  by  (1).  We  weight \nthese  predictive  densities  by  the  importance  weights  Wi  =  P(Oi, Y)/Q(Oi),  which \nare easy to evaluate.  This results in a  mixture of mixtures of factor  analysers,  and \nwill  converge to the exact predictive density,  P(ylY), as long as Q(O)  > 0 wherever \nP(OIY)  > O.  The true log evidence can be similarly estimated by [, =  In(w),  where \n(.)  denotes  averaging over  the importance samples.  Finally,  the  KL  divergence  is \ngiven by:  KL(Q(O)IIP(OIY))  =  In(w)  - (In w). \nThis procedure has three significant properties.  First, the same importance weights \ncan  be  used  to  estimate  all  three  quantities.  Second,  while  importance  sampling \ncan work very poorly in high dimensions for ad hoc proposal distributions, here the \nvariational optimisation is  used in a  principled manner to pick Q  to be  a  good  ap(cid:173)\nproximation to P  and therefore hopefully a good proposal distribution.  Third, this \nprocedure can  be  applied  to any  variational approximation.  A  detailed  exposition \ncan be found  in  [3]. \n\n6  Results \nExperiment  1:  Discovering  the  number  of components.  We  tested  the \nmodel on synthetic  data generated from  a  mixture of 18  Gaussians with 50  points \nper cluster (Figure 2,  top left).  The variational algorithm has little difficulty finding \nthe correct number of components and the birth heuristics are successful at avoiding \nlocal  maxima.  After  finding  the  18  Gaussians  repeated  splits  are  attempted  and \nrejected.  Finding a  distribution over number of components using F  is  also simple. \n\nExperiment 2:  The shrinking spiral.  We  used the dataset of 800 data points \nfrom  a  shrinking spiral from  [12]  as  another  test of how  well  the  algorithm could \n\n\f454 \n\nZ.  Ghahramani and M.  J.  Beal \n\nFigure 2:  (top) Exp 1:  The frames from left to right are the data, and the 2 S.D.  Gaussian \nellipses  after  7,  14,  16  and  22  accepted  births.  (bottom)  Exp  2:  Shrinking  spiral  data \nand 1 S.D.  Gaussian ellipses after  6, 9, 12, and 17 accepted births.  Note that the number \nof Gaussians increases from  left to right. \n\nnumber \nof points \nper cluster \n\nintrinsic dlmensionalnies \n7 \n\n2 \n\n4 \n\n3 \n\n-7600 \n\n- 76OOQ \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n8 \n8 \n16 \n32 \n64 \n128 \n\nI \nI  1 \n1 \n1 \n1 \n1 \n\nI \n6 \n7 \n7 \n\n2 \n\n4 \n\n2 \n\n3 \n3 \n3 \n\n3 \n4 \n4 \n\n2 \n2 \n2 \n\n2 \n\n2 \n2 \n2 \n2 \n\nFigure 3:  (left)  Exp  2: :F as function  of iteration for  the spiral problem on  a typical run. \nDrops in :F constitute component births.  Thick lines are  accepted attempts, thin lines are \nrejected attempts.  (middle)  Exp  3:  Means  of the factor  loading  matrices.  These results \nare  analogous  to those given  by  Bishop  [2]  for  Bayesian peA. (right)  Exp  3:  Table with \nlearned  number  of  Gaussians  and  dimension ali ties  as  training  set  size  increases.  Boxes \nrepresent  model components that capture several  of the clusters. \n\nescape local maxima and how robust it was to initial conditions  (Figure 2, bottom). \nAgain local maxima did not pose a problem and the algorithm always found between \n12-14 Gaussians  regardless of whether it was  initialised with  0 or 200.  These runs \ntook about 3-4 minutes on a 500MHz Alpha EV6 processor.  A plot of:F shows that \nmost of the compute time is  spent on accepted moves  (Figure 3, left). \nExperiment 3:  Discovering the local dimensionalities.  We generated a syn(cid:173)\nthetic  data set of 300  data points in each of 6  Gaussians with intrinsic dimension(cid:173)\nalities  (7432 2 1)  embedded in 10 dimensions.  The variational Bayesian approach \ncorrectly inferred both the number of Gaussians and their intrinsic dimensionalities \n(Figure 3, middle).  We varied the number of data points and found that as expected \nwith fewer  points the data could not provide evidence for  as many components and \nintrinsic dimensions  (Figure 3,  right). \n\n7  Discussion \nSearch over model structures for  MFAs is computationally intractable if each factor \nanalyser is allowed to have different intrinsic dimensionalities.  In this paper we have \nshown  that the  variational  Bayesian approach can be  used  to  efficiently  infer  this \nmodel structure while  avoiding overfitting and other deficiencies of ML  approaches. \nOne attraction of our variational method,  which  can be exploited in other models, \nis  that  once  a  factorisation  of Q is  assumed  all  inference  is  automatic  and  exact. \nWe  can also use :F to get a distribution over structures if desired.  Finally we  derive \n\n\fVariational Inference for Bayesian Mixtures of Factor Analysers \n\n455 \n\na  generally  applicable  importance sampler  that gives  us  unbiased  estimates of the \ntrue  evidence,  the  exact  predictive  density,  and  the  KL  divergence  between  the \nvariational posterior and the true posterior. \nEncouraged by the results on synthetic data, we have applied the Bayesian mixture \nof factor  analysers  to  a  real-world  unsupervised  digit  classification  problem.  We \nwill  report the results of these experiments in a  separate article. \nAppendix:  Optimal Q  Distributions and  Hyperparameters \nQ(xnlsn) '\" N(xn,\", E S ) \n\nQ(A~) \"\"' N(X:, Eq ,S) \n\nQ(vl) \"\"'  Q(ai,bl) \n\nQ(1r)  '\" D(wu) \n\nIn Q(sn) =  [1jJ(wus )  -1jJ(w)] +  2 In IEs I +  (In p(ynlxn, sn, AS , 'IT))  +  c \n\n1 \n\nxn,s=EsxsT'lT-lyn,  X:=  ['IT-l\"tQ(sn)ynxn 'STEq,S] ,  ai=a+~, bi=b+~:t(A~12) \nE s - 1=  (AST 'IT -1 AS)  +  I,  Eq ,S -~ 'IT;ql  L Q(sn)(xnxnT) +diag(vS),  wUs =  ~ + L Q(sn) \nwhere  {N, Q, D}  denote Normal,  Gamma and Dirichlet  distributions  respectively,  (-)  de(cid:173)\nnotes  expectation  under  the  variational  posterior,  and  1jJ (x)  is  the  digamma  function \n1jJ(x)  ==  tx lnr(x).  Note  that  the  optimal  distributions  Q(AS)  have  block  diagonal  co(cid:173)\n\nq=l \nN \n\nn=l \n\nn=l \n\nn=l \n\nN \n\nq \n\nvariance structure;  even though each  AS  is  a p x  q matrix, its covariance only has  O(pq2) \nparameters.  Differentiating:F with respect to the parameters, a and b, of the precision pri(cid:173)\nor we get fixed point equations 1jJ(a)  =  (In v)+lnb and b =  a/(v).  Similarly the fixed point \nfor  the parameters of the Dirichlet prior is  1jJ(a)  -1jJ(a/S) + 2: [1jJ(wu s )  -1jJ(w)]/S =  o. \nReferences \n[1]  H.  Attias.  Inferring parameters and structure of latent variable models by variational \n\nBayes.  In  Proc.  15th  Conf.  on  Uncertainty  in Artificial Intelligence,  1999. \n\n[2]  C.M.  Bishop.  Variational  PCA.  In  Proc.  Ninth  Int.  Conf.  on  Artificial  Neural  Net(cid:173)\n\nworks.  ICANN, 1999. \n\n[3]  Z.  Ghahramani,  H.  Attias,  and  M.J.  Beal.  Learning  model  structure.  Technical \n\nReport GCNU-TR-1999-006,  (in  prep.)  Gatsby Unit,  Univ.  College  London,  1999. \n\n[4]  Z.  Ghahramani  and  G.E.  Hinton. \n\nThe  EM  algorithm  for  mixtures  of  fac-\ntor  analyzers.  Technical  Report  CRG-TR-96-1  [http://~.gatsby . ucl. ac. uk/ \n~zoubin/papers/tr-96-1.ps.gz], Dept.  of Compo  Sci. , Univ. of Toronto,  1996. \n\n[5]  D.J.C.  MacKay.  Ensemble  learning  for  hidden  Markov  models.  Technical  report, \n\nCavendish  Laboratory, University of Cambridge,  1997. \n\n[6]  R.M.  Neal.  Assessing relevance determination methods using DELVE.  In C.M.  Bish(cid:173)\nop,  editor,  Neural  Networks  and Machine  Learning,  97-129. Springer-Verlag, 1998. \n\n[7]  C.E.  Rasmussen.  The infinite gaussian mixture model.  In  Adv.  Neur.  Inf.  Pmc.  Sys. \n\n12. MIT Press,  2000. \n\n[8]  S.  Richardson  and P.J.  Green.  On  Bayesian  analysis  of mixtures with an  unknown \n\nnumber of components.  J.  Roy.  Stat.  Soc.-Ser.  B,  59(4) :731-758,  1997. \n\n[9]  S.J. Roberts , D.  Husmeier, 1.  Rezek, and W.  Penny. Bayesian approaches to Gaussian \n\nmixture modeling.  IEEE PAMI,  20(11):1133- 1142,  1998. \n\n[10]  S. T . Roweis and Z.  Ghahramani. A unifying review of linear Gaussian models.  Neural \n\nComputation,  11(2):305- 345,  1999. \n\n[11]  M.E.  Tipping  and C.M.  Bishop.  Mixtures of probabilistic  principal  component ana(cid:173)\n\nlyzers.  Neural  Computation,  11(2):443- 482,  1999. \n\n[12]  N.  Ueda, R.  Nakano, Z.  Ghahramani, and G.E.  Hinton.  SMEM algorithm for mixture \n\nmodels.  In  Adv.  Neur.  Inf.  Proc.  Sys.  11. MIT Press,  1999. \n\n[13]  S. Waterhouse,  D.J.C.  Mackay, and T.  Robinson.  Bayesian  methods for  mixtures of \n\nexperts.  In  Adv.  Neur.  Inf.  Proc.  Sys.  1.  MIT Press,  1995. \n\n\f", "award": [], "sourceid": 1672, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Matthew", "family_name": "Beal", "institution": null}]}