{"title": "Variational Inference for Bayesian Mixtures of Factor Analysers", "book": "Advances in Neural Information Processing Systems", "page_first": 449, "page_last": 455, "abstract": null, "full_text": "Variational Inference for Bayesian \n\nMixtures of Factor Analysers \n\nZoubin Ghahramani and Matthew J. Beal \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WC1N 3AR, England \n\n{zoubin,m.beal}Ggatsby.ucl.ac.uk \n\nAbstract \n\nWe present an algorithm that infers the model structure of a mix(cid:173)\nture of factor analysers using an efficient and deterministic varia(cid:173)\ntional approximation to full Bayesian integration over model pa(cid:173)\nrameters. This procedure can automatically determine the opti(cid:173)\nmal number of components and the local dimensionality of each \ncomponent (Le. the number of factors in each factor analyser) . \nAlternatively it can be used to infer posterior distributions over \nnumber of components and dimensionalities. Since all parameters \nare integrated out the method is not prone to overfitting. Using a \nstochastic procedure for adding components it is possible to per(cid:173)\nform the variational optimisation incrementally and to avoid local \nmaxima. Results show that the method works very well in practice \nand correctly infers the number and dimensionality of nontrivial \nsynthetic examples. \nBy importance sampling from the variational approximation we \nshow how to obtain unbiased estimates of the true evidence, the \nexact predictive density, and the KL divergence between the varia(cid:173)\ntional posterior and the true posterior, not only in this model but \nfor variational approximations in general. \n\n1 \n\nIntroduction \n\nFactor analysis (FA) is a method for modelling correlations in multidimensional \ndata. The model assumes that each p-dimensional data vector y was generated by \nfirst linearly transforming a k < p dimensional vector of unobserved independent \nzero-mean unit-variance Gaussian sources, x, and then adding a p-dimensional zero(cid:173)\nmean Gaussian noise vector, n, with diagonal covariance matrix \\}!: i.e. y = Ax+n. \nIntegrating out x and n, the marginal density of y is Gaussian with zero mean \nand covariance AA T + \\}!. The matrix A is known as the factor loading matrix. \nGiven data with a sample covariance matrix I:, factor analysis finds the A and \\}! \nthat optimally fit I: in the maximum likelihood sense. Since k < p, a single factor \nanalyser can be seen as a reduced parametrisation of a full-covariance Gaussian. 1 \n\nIFactor analysis and its relationship to principal components analysis (peA) and mix(cid:173)\n\nture models is reviewed in (10). \n\n\f450 \n\nZ. Ghahramani and M. J. Heal \n\nA mixture of factor analysers (MFA) models the density for y as a weighted average \nof factor analyser densities \n\ns \n\nP(yjA, q,,7r) = LP(sj7r)P(yjs,AS, '11), \n\n(1) \n\ns=1 \n\nwhere 7r is the vector of mixing proportions, s is a discrete indicator variable, and \nA S is the factor loading matrix for factor analyser s which includes a mean vector \nfor y. \nBy exploiting the factor analysis parameterisation of covariance matrices, a mix(cid:173)\nture of factor analysers can be used to fit a mixture of Gaussians to correlated high \ndimensional data without requiring O(P2) parameters or undesirable compromises \nsuch as axis-aligned covariance matrices. In an MFA each Gaussian cluster has in(cid:173)\ntrinsic dimensionality k (or ks if the dimensions are allowed to vary across clusters). \nConsequently, the mixture of factor analysers simultaneously addresses the prob(cid:173)\nlems of clustering and local dimensionality reduction. When '11 is a multiple of the \nidentity the model becomes a mixture of probabilistic PCAs. Tractable maximum \nlikelihood procedure for fitting MFA and MPCA models can be derived from the \nExpectation Maximisation algorithm [4, 11]. \nThe maximum likelihood (ML) approach to MFA can easily get caught in local \nmaxima.2 Ueda et al. [12] provide an effective deterministic procedure for avoiding \nlocal maxima by considering splitting a factor analyser in one part of space and \nmerging two in a another part. But splits and merges have to be considered simul(cid:173)\ntaneously because the number of factor analysers has to stay the same since adding \na factor analyser is always expected to increase the training likelihood. \nA fundamental problem with maximum likelihood approaches is that they fail to \ntake into account model complexity (Le. the cost of coding the model parameter(cid:173)\ns) . So more complex models are not penalised, which leads to overfitting and the \ninability to determine the best model size and structure (or distributions thereof) \nwithout resorting to costly cross-validation procedures. Bayesian approaches over(cid:173)\ncome these problems by treating the parameters 0 as unknown random variables \nand averaging over the ensemble of models they define: \n\nP(Y) = / dO P(YjO)P(O). \n\n(2) \n\nP(Y) is the evidence for a data set Y = {yl, .. . ,yN}. Integrating out parameters \npenalises models with more degrees of freedom since these models can a priori \nmodel a larger range of data sets. All information inferred from the data about the \nparameters is captured by the posterior distribution P(OjY) rather than the ML \npoint estimate 0. 3 \nWhile Bayesian theory deals with the problems of overfitting and model selec(cid:173)\ntion/averaging, in practice it is often computationally and analytically intractable to \nperform the required integrals. For Gaussian mixture models Markov chain Monte \nCarlo (MCMC) methods have been developed to approximate these integrals by \nsampling [8, 7]. The main criticism of MCMC methods is that they are slow and \n\n2 Technically, the log likelihood is not bounded above if no constraints are put on the \ndeterminant of the component covariances. So the real ML objective for MFA is to find \nthe highest finite local maximum of the likelihood. \n\n3We sometimes use () to refer to the parameters and sometimes to all the unknown \nquantities (parameters and hidden variables). Formally the only difference between the two \nis that the number of hidden variables grows with N, whereas the number of parameters \nusually does not. \n\n\fVariational Inference for Bayesian Mixtures of Factor Analysers \n\n451 \n\nit is usually difficult to assess convergence. Furthermore, the posterior density over \nparameters is stored as a set of samples, which can be inefficient. \nAnother approach to Bayesian integration for Gaussian mixtures [9] is the Laplace \napproximation which makes a local Gaussian approximation around a maximum a \nposteriori parameter estimate. These approximations are based on large data limits \nand can be poor, particularly for small data sets (for which, in principle, the advan(cid:173)\ntages of Bayesian integration over ML are largest). Local Gaussian approximations \nare also poorly suited to bounded or positive parameters such as the mixing pro(cid:173)\nportions of the mixture model. Finally, it is difficult to see how this approach can \nbe applied to online incremental changes to model structure. \nIn this paper we employ a third approach to Bayesian inference: variational ap(cid:173)\nproximation. We form a lower bound on the log evidence using Jensen's inequality: \n\n1: == In P(Y) = In / dO P(Y, 0) ~ / dO Q(O) In P6~~~) == F, \n\n(3) \n\nwhich we seek to maximise. Maximising F is equivalent to minimising the KL(cid:173)\ndivergence between Q(O) and P(OIY), so a tractable Q can be used as an approx(cid:173)\nimation to the intractable posterior. This approach draws its roots from one way \nof deriving mean field approximations in physics, and has been used recently for \nBayesian inference [13, 5, 1]. \nThe variational method has several advantages over MCMC and Laplace approxi(cid:173)\nmations. Unlike MCMC, convergence can be assessed easily by monitoring F. The \napproximate posterior is encoded efficiently in Q(O) . Unlike Laplace approxima(cid:173)\ntions, the form of Q can be tailored to each parameter (in fact the optimal form \nof Q for each parameter falls out of the optimisation), the approximation is global, \nand Q optimises an objective function. Variational methods are generally fast, F \nis guaranteed to increase monotonically and transparently incorporates model com(cid:173)\nplexity. To our knowledge, no one has done a full Bayesian analysis of mixtures of \nfactor analysers. \nOf course, vis-a-vis MCMC, the main disadvantage of variational approximations \nis that they are not guaranteed to find the exact posterior in the limit. However, \nwith a straightforward application of sampling, it is possible to take the result of \nthe variational optimisation and use it to sample from the exact posterior and exact \npredictive density. This is described in section 5. \nIn the remainder of this paper we first describe the mixture of factor analysers in \nmore detail (section 2). We then derive the variational approximation (section 3). \nWe show empirically that the model can infer both the number of components and \ntheir intrinsic dimensionalities, and is not prone to overfitting (section 6). Finally, \nwe conclude in section 7. \n\n2 The Model \n\nStarting from (1), the evidence for the Bayesian MFA is obtained by averaging the \nlikelihood under priors for the parameters (which have their own hyperparameters): \n\nP(Y) \n\n/ d7rP(7rIa:) / dvP(vla,b) / dA P(Alv), \n\ng [.t, P(s\u00b7I1r) J dx\u00b7P(xn)p(ynlx\u00b7,sn,A', q;)]. \n\n(4) \n\n\f452 \n\nZ. Ghahramani and M. J. Beal \n\nHere {a, a, b, \"Ill} are hyperparameters4 , v are precision parameters (Le. inverse vari(cid:173)\nances) for the columns of A. The conditional independence relations between the \nvariables in this model are shown graphically in the usual belief network represen(cid:173)\ntation in Figure 1. \n\nWhile arbitrary choices could be made for the \npriors on the first line of (4), choosing priors that \nare conjugate to the likelihood terms on the sec(cid:173)\nond line of (4) greatly simplifies inference and \ninterpretability.5 So we choose P(7rJa) to be \nsymmetric Dirichlet, which is conjugate to the \nmultinomial P(sJ7r). \nThe prior for the factor loading matrix plays a \nkey role in this model. Each component of the \nmixture has a Gaussian prior P(ABJVB), where \neach element of the vector VB is the precision of \na column of A. IT one of these precisions vi -t 00, \nthen the outgoing weights for factor Xl will go to \nzero, which allows the model to reduce the in(cid:173)\ntrinsic dimensionality of X if the data does not \nwarrant this added dimension. This method of \nintrinsic dimensionality reduction has been used \nby Bishop [2] for Bayesian peA, and is closely \nrelated to MacKay and Neal's method for auto(cid:173)\nmatic relevance determination (ARD) for inputs \nto a neural network [6]. \n\n:''!:~\",~ .................. 1 \n\nFigure 1: Generative model for \nvariational Bayesian mixture of \nfactor analysers. Circles denote \nrandom variables, solid rectangles \ndenote hyperparameters, and the \ndashed rectangle shows the plate \n(i.e. repetitions) over the data. \n\nTo avoid overfitting it is important to integrate out all parameters whose cardinality \nscales with model complexity (Le. number of components and their dimensionali(cid:173)\nties). We therefore also integrate out the precisions using Gamma priors, P(vJa, b). \n\n3 The Variational Approximation \n\nApplying Jensen's inequality repeatedly to the log evidence (4) we lower bound it \nusing the following factorisation of the distribution of parameters and hidden vari(cid:173)\nables: Q(A)Q(7r, v)Q(s, x). Given this factorisation several additional factorisations \nfallout of the conditional independencies in the model resulting in the variational \nobjective function: \n\nF= jd-n;Q(-n;) In PJ7;~) + t, j dv'Q(v') lIn P6;~~) b) + jdA'Q(A') In P6~~~') 1 \n+ t, .t, Q(s\") [j d-n; Q(-n;) In Pci~:~~) + j dx\"Q(x\"Js\") In Q~~:~\") \n\n+ jdABQ(AB) j dxnQ(xnJsn)lnp(ynJxn,sn,AB, \"Ill)] \n\n(5) \n\nThe variational posteriors Q('), as given in the Appendix, are derived by performing \na free-form extremisation of F w.r.t. Q. It is not difficult to show that these extrema \nare indeed maxima of F. The optimal posteriors Q are of the same conjugate forms \nas the priors. The model hyperparameters which govern the priors can be estimated \nin the same fashion (see the Appendix). \n\n4We currently do not integrate out 1lJ', although this can also be done. \n5Conjugate priors have the same effect as pseudo-observations. \n\n\fVariational lriference for Bayesian Mixtures of Factor Analysers \n\n453 \n\n4 Birth and Death \nWhen optimising F , occasionally one finds that for some s: Ln Q(sn) = O. These \nzero responsibility components are the result of there being insufficient support from \nthe local data to overcome the dimensional complexity prior on the factor loading \nmatrices. So components of the mixture die of natural causes when they are no \nlonger needed. Removing these redundant components increases F . \nComponent birth does not happen spontaneously, so we introduce a heuristic. \nWhenever F has stabilised we pick a parent-component stochastically with prob(cid:173)\nability proportional to e-f3F\u2022 and attempt to split it into two; Fa is the s-specific \ncontribution to F with the last bracketed term in (5) normalised by Ln Q(sn). \nThis works better than both cycling through components and picking them at ran(cid:173)\ndom as it concentrates attempted births on components that are faring poorly. The \nparameter distributions of the two Gaussians created from the split are initialised \nby partitioning the responsibilities for the data, Q(sn), along a direction sampled \nfrom the parent's distribution. This usually causes F to decrease, so by monitoring \nthe future progress of F we can reject this attempted birth if F does not recover. \nAlthough it is perfectly possible to start the model with many components and let \nthem die, it is computationally more efficient to start with one component and allow \nit to spawn more when necessary. \n\n5 Exact Predictive Density, True Evidence, and KL \n\nBy importance sampling from the variational approximation we can obtain unbiased \nestimates of three important quantities: the exact predictive density, the true log \nevidence [\" and the KL divergence between the variational posterior and the true \nposterior. Letting 0 = {A, 7r}, we sample Oi '\" Q (0). Each such sample is an instance \nof a mixture of factor analysers with predictive density given by (1). We weight \nthese predictive densities by the importance weights Wi = P(Oi, Y)/Q(Oi), which \nare easy to evaluate. This results in a mixture of mixtures of factor analysers, and \nwill converge to the exact predictive density, P(ylY), as long as Q(O) > 0 wherever \nP(OIY) > O. The true log evidence can be similarly estimated by [, = In(w), where \n(.) denotes averaging over the importance samples. Finally, the KL divergence is \ngiven by: KL(Q(O)IIP(OIY)) = In(w) - (In w). \nThis procedure has three significant properties. First, the same importance weights \ncan be used to estimate all three quantities. Second, while importance sampling \ncan work very poorly in high dimensions for ad hoc proposal distributions, here the \nvariational optimisation is used in a principled manner to pick Q to be a good ap(cid:173)\nproximation to P and therefore hopefully a good proposal distribution. Third, this \nprocedure can be applied to any variational approximation. A detailed exposition \ncan be found in [3]. \n\n6 Results \nExperiment 1: Discovering the number of components. We tested the \nmodel on synthetic data generated from a mixture of 18 Gaussians with 50 points \nper cluster (Figure 2, top left). The variational algorithm has little difficulty finding \nthe correct number of components and the birth heuristics are successful at avoiding \nlocal maxima. After finding the 18 Gaussians repeated splits are attempted and \nrejected. Finding a distribution over number of components using F is also simple. \n\nExperiment 2: The shrinking spiral. We used the dataset of 800 data points \nfrom a shrinking spiral from [12] as another test of how well the algorithm could \n\n\f454 \n\nZ. Ghahramani and M. J. Beal \n\nFigure 2: (top) Exp 1: The frames from left to right are the data, and the 2 S.D. Gaussian \nellipses after 7, 14, 16 and 22 accepted births. (bottom) Exp 2: Shrinking spiral data \nand 1 S.D. Gaussian ellipses after 6, 9, 12, and 17 accepted births. Note that the number \nof Gaussians increases from left to right. \n\nnumber \nof points \nper cluster \n\nintrinsic dlmensionalnies \n7 \n\n2 \n\n4 \n\n3 \n\n-7600 \n\n- 76OOQ \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n8 \n8 \n16 \n32 \n64 \n128 \n\nI \nI 1 \n1 \n1 \n1 \n1 \n\nI \n6 \n7 \n7 \n\n2 \n\n4 \n\n2 \n\n3 \n3 \n3 \n\n3 \n4 \n4 \n\n2 \n2 \n2 \n\n2 \n\n2 \n2 \n2 \n2 \n\nFigure 3: (left) Exp 2: :F as function of iteration for the spiral problem on a typical run. \nDrops in :F constitute component births. Thick lines are accepted attempts, thin lines are \nrejected attempts. (middle) Exp 3: Means of the factor loading matrices. These results \nare analogous to those given by Bishop [2] for Bayesian peA. (right) Exp 3: Table with \nlearned number of Gaussians and dimension ali ties as training set size increases. Boxes \nrepresent model components that capture several of the clusters. \n\nescape local maxima and how robust it was to initial conditions (Figure 2, bottom). \nAgain local maxima did not pose a problem and the algorithm always found between \n12-14 Gaussians regardless of whether it was initialised with 0 or 200. These runs \ntook about 3-4 minutes on a 500MHz Alpha EV6 processor. A plot of:F shows that \nmost of the compute time is spent on accepted moves (Figure 3, left). \nExperiment 3: Discovering the local dimensionalities. We generated a syn(cid:173)\nthetic data set of 300 data points in each of 6 Gaussians with intrinsic dimension(cid:173)\nalities (7432 2 1) embedded in 10 dimensions. The variational Bayesian approach \ncorrectly inferred both the number of Gaussians and their intrinsic dimensionalities \n(Figure 3, middle). We varied the number of data points and found that as expected \nwith fewer points the data could not provide evidence for as many components and \nintrinsic dimensions (Figure 3, right). \n\n7 Discussion \nSearch over model structures for MFAs is computationally intractable if each factor \nanalyser is allowed to have different intrinsic dimensionalities. In this paper we have \nshown that the variational Bayesian approach can be used to efficiently infer this \nmodel structure while avoiding overfitting and other deficiencies of ML approaches. \nOne attraction of our variational method, which can be exploited in other models, \nis that once a factorisation of Q is assumed all inference is automatic and exact. \nWe can also use :F to get a distribution over structures if desired. Finally we derive \n\n\fVariational Inference for Bayesian Mixtures of Factor Analysers \n\n455 \n\na generally applicable importance sampler that gives us unbiased estimates of the \ntrue evidence, the exact predictive density, and the KL divergence between the \nvariational posterior and the true posterior. \nEncouraged by the results on synthetic data, we have applied the Bayesian mixture \nof factor analysers to a real-world unsupervised digit classification problem. We \nwill report the results of these experiments in a separate article. \nAppendix: Optimal Q Distributions and Hyperparameters \nQ(xnlsn) '\" N(xn,\", E S ) \n\nQ(A~) \"\"' N(X:, Eq ,S) \n\nQ(vl) \"\"' Q(ai,bl) \n\nQ(1r) '\" D(wu) \n\nIn Q(sn) = [1jJ(wus ) -1jJ(w)] + 2 In IEs I + (In p(ynlxn, sn, AS , 'IT)) + c \n\n1 \n\nxn,s=EsxsT'lT-lyn, X:= ['IT-l\"tQ(sn)ynxn 'STEq,S] , ai=a+~, bi=b+~:t(A~12) \nE s - 1= (AST 'IT -1 AS) + I, Eq ,S -~ 'IT;ql L Q(sn)(xnxnT) +diag(vS), wUs = ~ + L Q(sn) \nwhere {N, Q, D} denote Normal, Gamma and Dirichlet distributions respectively, (-) de(cid:173)\nnotes expectation under the variational posterior, and 1jJ (x) is the digamma function \n1jJ(x) == tx lnr(x). Note that the optimal distributions Q(AS) have block diagonal co(cid:173)\n\nq=l \nN \n\nn=l \n\nn=l \n\nn=l \n\nN \n\nq \n\nvariance structure; even though each AS is a p x q matrix, its covariance only has O(pq2) \nparameters. Differentiating:F with respect to the parameters, a and b, of the precision pri(cid:173)\nor we get fixed point equations 1jJ(a) = (In v)+lnb and b = a/(v). Similarly the fixed point \nfor the parameters of the Dirichlet prior is 1jJ(a) -1jJ(a/S) + 2: [1jJ(wu s ) -1jJ(w)]/S = o. \nReferences \n[1] H. Attias. Inferring parameters and structure of latent variable models by variational \n\nBayes. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence, 1999. \n\n[2] C.M. Bishop. Variational PCA. In Proc. Ninth Int. Conf. on Artificial Neural Net(cid:173)\n\nworks. ICANN, 1999. \n\n[3] Z. Ghahramani, H. Attias, and M.J. Beal. Learning model structure. Technical \n\nReport GCNU-TR-1999-006, (in prep.) Gatsby Unit, Univ. College London, 1999. \n\n[4] Z. Ghahramani and G.E. Hinton. \n\nThe EM algorithm for mixtures of fac-\ntor analyzers. Technical Report CRG-TR-96-1 [http://~.gatsby . ucl. ac. uk/ \n~zoubin/papers/tr-96-1.ps.gz], Dept. of Compo Sci. , Univ. of Toronto, 1996. \n\n[5] D.J.C. MacKay. Ensemble learning for hidden Markov models. Technical report, \n\nCavendish Laboratory, University of Cambridge, 1997. \n\n[6] R.M. Neal. Assessing relevance determination methods using DELVE. In C.M. Bish(cid:173)\nop, editor, Neural Networks and Machine Learning, 97-129. Springer-Verlag, 1998. \n\n[7] C.E. Rasmussen. The infinite gaussian mixture model. In Adv. Neur. Inf. Pmc. Sys. \n\n12. MIT Press, 2000. \n\n[8] S. Richardson and P.J. Green. On Bayesian analysis of mixtures with an unknown \n\nnumber of components. J. Roy. Stat. Soc.-Ser. B, 59(4) :731-758, 1997. \n\n[9] S.J. Roberts , D. Husmeier, 1. Rezek, and W. Penny. Bayesian approaches to Gaussian \n\nmixture modeling. IEEE PAMI, 20(11):1133- 1142, 1998. \n\n[10] S. T . Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural \n\nComputation, 11(2):305- 345, 1999. \n\n[11] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component ana(cid:173)\n\nlyzers. Neural Computation, 11(2):443- 482, 1999. \n\n[12] N. Ueda, R. Nakano, Z. Ghahramani, and G.E. Hinton. SMEM algorithm for mixture \n\nmodels. In Adv. Neur. Inf. Proc. Sys. 11. MIT Press, 1999. \n\n[13] S. Waterhouse, D.J.C. Mackay, and T. Robinson. Bayesian methods for mixtures of \n\nexperts. In Adv. Neur. Inf. Proc. Sys. 1. MIT Press, 1995. \n\n\f", "award": [], "sourceid": 1672, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Matthew", "family_name": "Beal", "institution": null}]}