{"title": "EM Algorithms for PCA and SPCA", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 632, "abstract": "", "full_text": "EM Algorithms for PCA and SPCA \n\nSam Roweis\u00b7 \n\nAbstract \n\nI present an expectation-maximization (EM) algorithm for principal \ncomponent analysis (PCA). The algorithm allows a few eigenvectors and \neigenvalues to be extracted from large collections of high dimensional \ndata. It is computationally very efficient in space and time. It also natu(cid:173)\nrally accommodates missing infonnation. I also introduce a new variant \nof PC A called sensible principal component analysis (SPCA) which de(cid:173)\nfines a proper density model in the data space. Learning for SPCA is also \ndone with an EM algorithm. I report results on synthetic and real data \nshowing that these EM algorithms correctly and efficiently find the lead(cid:173)\ning eigenvectors of the covariance of datasets in a few iterations using up \nto hundreds of thousands of datapoints in thousands of dimensions. \n\n1 Why EM for peA? \nPrincipal component analysis (PCA) is a widely used dimensionality reduction technique in \ndata analysis. Its popularity comes from three important properties. First, it is the optimal \n(in tenns of mean squared error) linear scheme for compressing a set of high dimensional \nvectors into a set of lower dimensional vectors and then reconstructing. Second, the model \nparameters can be computed directly from the data - for example by diagonalizing the \nsample covariance. Third, compression and decompression are easy operations to perfonn \ngiven the model parameters - they require only matrix multiplications. \n\nDespite these attractive features however, PCA models have several shortcomings. One is \nthat naive methods for finding the principal component directions have trouble with high \ndimensional data or large numbers of datapoints. Consider attempting to diagonalize the \nsample covariance matrix of n vectors in a space of p dimensions when n and p are several \nhundred or several thousand. Difficulties can arise both in the fonn of computational com(cid:173)\nplexity and also data scarcity. I Even computing the sample covariance itself is very costly, \nrequiring 0 (np2) operations. In general it is best to avoid altogether computing the sample \n\n\u2022 rowei s@cns . cal tech. edu; Computation & Neural Systems, California Institute of Tech. \nIOn the data scarcity front, we often do not have enough data in high dimensions for the sample \ncovariance to be of full rank and so we must be careful to employ techniques which do not require full \nrank matrices. On the complexity front, direct diagonalization of a symmetric matrix thousands of \nrows in size can be extremely costly since this operation is O(P3) for p x p inputs. Fortunately, several \ntechniques exist for efficient matrix diagonalization when only the first few leading eigenvectors and \neigerivalues are required (for example the power method [10] which is only O(p2\u00bb. \n\n\fEM Algorithms for PCA and SPCA \n\n627 \n\ncovariance explicitly. Methods such as the snap-shot algorithm [7] do this by assuming that \nthe eigenvectors being searched for are linear combinations of the datapoints; their com(cid:173)\nplexity is O(n 3 ). In this note, I present a version of the expectation-maximization (EM) \nalgorithm [1] for learning the principal components of a dataset. The algorithm does not re(cid:173)\nquire computing the sample covariance and has a complexity limited by 0 (knp) operations \nwhere k is the number of leading eigenvectors to be learned. \n\nAnother shortcoming of standard approaches to PCA is that it is not obvious how to deal \nproperly with missing data. Most of the methods discussed above cannot accommodate \nmissing values and so incomplete points must either be discarded or completed using a \nvariety of ad-hoc interpolation methods. On the other hand, the EM algorithm for PCA \nenjoys all the benefits [4] of other EM algorithms in tenns of estimating the maximum \nlikelihood values for missing infonnation directly at each iteration. \n\nFinally, the PCA model itself suffers from a critical flaw which is independent of the tech(cid:173)\nnique used to compute its parameters: it does not define a proper probability model in the \nspace of inputs. This is because the density is not nonnalized within the principal subspace. \nIn other words, if we perfonn PCA on some data and then ask how well new data are fit \nby the model, the only criterion used is the squared distance of the new data from their \nprojections into the principal subspace. A datapoint far away from the training data but \nnonetheless near the principal subspace will be assigned a high \"pseudo-likelihood\" or low \nerror. Similarly, it is not possible to generate \"fantasy\" data from a PCA model. In this note \nI introduce a new model called sensible principal component analysis (SPCA), an obvious \nmodification of PC A, which does define a proper covariance structure in the data space. Its \nparameters can also be learned with an EM algorithm, given below. \n\nIn summary, the methods developed in this paper provide three advantages. They allow \nsimple and efficient computation of a few eigenvectors and eigenvalues when working with \nmany datapoints in high dimensions. They permit this computation even in the presence of \nmissing data. On a real vision problem with missing infonnation, I have computed the 10 \nleading eigenvectors and eigenvalues of 217 points in 212 dimensions in a few hours using \nMATLAB on a modest workstation. Finally, through a small variation, these methods allow \nthe computation not only of the principal subspace but of a complete Gaussian probabilistic \nmodel which allows one to generate data and compute true likelihoods. \n\n2 Whence EM for peA? \nPrincipal component analysis can be viewed as a limiting case of a particular class of linear(cid:173)\nGaussian models. The goal of such models is to capture the covariance structure of an ob(cid:173)\nserved p-dimensional variable y using fewer than the p{p+ 1) /2 free parameters required in \na full covariance matrix. Linear-Gaussian models do this by assuming that y was produced \nas a linear transfonnation of some k-dimensionallatent variable x plus additive Gaussian \nnoise. Denoting the transfonnation by the p x k matrix C, and the ~dimensional) noise \nby v (with covariance matrix R) the generative model can be written as \n\ny = Cx+v \n\nx-N{O,I) v-N(O,R) \n\n(la) \n\nThe latent or cause variables x are assumed to be independent and identically distributed \naccording to a unit variance spherical Gaussian. Since v are also independent and nonnal \ndistributed (and assumed independent of x), the model reduces to a single Gaussian model \n\n2 All vectors are column vectors. To denote the transpose of a vector or matrix I use the notation \nx T . The determinant of a matrix is denoted by IAI and matrix inversion by A -1. The zero matrix \nis 0 and the identity matrix is I. The symbol\", means \"distributed according to\". A multivariate \nnormal (Gaussian) distribution with mean JL and covariance matrix 1:: is written as N (JL, 1::). The \nsame Gaussian evaluated at the point x is denoted N (JL, 1::) Ix-\n\n\f628 \n\nfor y which we can write explicitly: \n\ny \",N (O,CC T + R) \n\nS. Roweis \n\n(lb) \n\nIn order to save parameters over the direct covariance representation in p-space, it is neces(cid:173)\nsary to choose k < p and also to restrict the covariance structure of the Gaussian noise v by \nconstraining the matrix R.3 For example, if the shape of the noise distribution is restricted \nto be axis aligned (its covariance matrix is diagonal) the model is known asfactor analysis. \n\nInference and learning \n\n2.1 \nThere are two central problems of interest when working with the linear-Gaussian models \ndescribed above. The first problem is that of state inference or compression which asks: \ngiven fixed model parameters C and R, what can be said about the unknown hidden states \nx given some observations y? Since the datapoints are independent, we are interested in \nthe posterior probability P (xly) over a single hidden state given the corresponding single \nobservation. This can be easily computed by linear matrix projection and the resulting \ndensity is itself Gaussian: \n\nP( I ) = P(Ylx)P(x) = N(Cx,R)lyN(O,I)lx \n\nxy \n\nP(y) \n\nN(O,CCT+R)ly \n\n(2a) \n\nP (xly) = N ((3y,I - (3C) Ix , \n\n(2b) \nfrom which we obtain not only the expected value (3y of the unknown state but also an \nestimate of the uncertainty in this value in the form of the covariance 1- (3C. Computing \ny from x (reconstruction) is also straightforward: P (ylx) = N (Cx, R) Iy. Finally, \ncomputing the likelihood of any datapoint y is merely an evaluation under (1 b). \n\n(3 = CT(CCT + R)-l \n\nThe second problem is that of learning, or parameter fitting which consists of identifying \nthe matrices C and R that make the model assign the highest likelihood to the observed \ndata. There are a family of EM algorithms to do this for the various cases of restrictions to \nR but all follow a similar structure: they use the inference formula (2b) above in the e-step \nto estimate the unknown state and then choose C and the restricted R in the m-step so as \nto maximize the expected joint likelihood of the estimated x and the observed y. \n\n2.2 Zero noise limit \nPrincipal component analysis is a limiting case of the linear-Gaussian model as the covari(cid:173)\nance of the noise v becomes infinitesimally small and equal in all directions. Mathemati(cid:173)\ncally, PCA is obtained by taking the limit R = limf~O d. This has the effect of making \nthe likelihood of a point y dominated solely by the squared distance between it and its re(cid:173)\nconstruction Cx. The directions of the columns of C which minimize this error are known \nas the principal components. Inference now reduces t04 simple least squares projection: \n\nP (xIY) = N ((3y,I - (3C) Ix , \nP (xly) = N ((CTC)-lC T y, 0) Ix = 6(x - (CTC)-lC T y) \n\n(3 = lim C T (CC T + d)-l \n\n(3b) \nSince the noise has become infinitesimal, the posterior over states collapses to a single \npoint and the covariance becomes zero. \n\n(3a) \n\nf~O \n\n3This restriction on R is not merely to save on parameters: the covariance of the observation noise \nmust be restricted in some way for the model to capture any interesting or informative projections in \nthe state x. If R were not restricted, the learning algorithm could simply choose C = 0 and then \nset R to be the covariance of the data thus trivially achieving the maximum likelihood model by \nexplaining all of the structure in the data as noise. (Remember that since the model has reduced to a \nsingle Gaussian distribution for y we can do no better than having the covariance of our model equal \nthe sample covariance of our data.) \n\n4Recall that if C is p x k with p > k and is rank k then left multiplication by C T (CC T )-l \n(which appears not to be well defined because (CC T ) is not invertible) is exactly eqUivalent to left \nmultiplication by (C T C) -1 CT. The intuition is that even though CCT truly is not invertible, the \ndirections along which it is not invertible are exactly those which C T is about to project out. \n\n\fEM Algorithms for PCA and SPCA \n\n629 \n\n3 An EM algorithm for peA \nThe key observation of this note is that even though the principal components can be com(cid:173)\nputed explicitly, there is still an EM algorithm for learning them. It can be easily derived as \nthe zero noise limit of the standard algorithms (see for example [3, 2] and section 4 below) \nby replacing the usual e-step with the projection above. The algorithm is: \n\n\u2022 e-step: \n\u2022 m-step: \n\nx = (CTC)-lCTy \ncnew = YXT(XXT)-l \n\nwhere Y is a p x n matrix of all the observed data and X is a k x n matrix of the unknown \nstates. The columns of C will span the space of the first k principal components. (To com(cid:173)\npute the corresponding eigenvectors and eigenvalues explicitly, the data can be projected \ninto this k-dimensional subspace and an ordered orthogonal basis for the covariance in the \nsubspace can be constructed.) Notice that the algorithm can be performed online using \nonly a single datapoint at a time and so its storage requirements are only O(kp) + O(k2). \nThe workings of the algorithm are illustrated graphically in figure 1 below. \n\nGaussian Input Data \n\n. -, . \n\n' .. ' ,.'\" \n\n~ 0 \n\n-I \n\n- 2 \n\n'. ';' : \n. ~ , \": \n\n( \n, ' I \" \n\n-~3L --_'7'2-'-' --_~I -~o----c~--:------: \n\nxl \n\nNon-Gaussian Input Data \n\n\" ~ '. \nl.\u00b7.\u00b7 \n\nI ', . \n\n,\n\n',:', \n\n'.' \n. \" \n\n,,' \n\nI \n\n. , . \n\n~ 0 \n\n-I \n\n-2 \n\n~3~---~2-----~I--~O~--~--~---\n\nxl \n\nFigure 1: Examples of iterations of the algorithm. The left panel shows the learning of the first \nprincipal component of data drawn from a Gaussian distribution, while the right panel shows learning \non data from a non-Gaussian distribution. The dashed lines indicate the direction of the leading \neigenvector of the sample covariance. The dashed ellipse is the one standard deviation contour of \nthe sample covariance. The progress of the algorithm is indicated by the solid lines whose directions \nindicate the guess of the eigenvector and whose lengths indicate the guess of the eigenvalue at each \niteration. The iterations are numbered; number 0 is the initial condition. Notice that the difficult \nlearning on the right does not get stuck in a local minimum, although it does take more than 20 \niterations to converge which is unusual for Gaussian data (see figure 2). \n\nThe intuition behind the algorithm is as follows: guess an orientation for the principal \nsubspace. Fix the guessed subspace and project the data y into it to give the values of the \nhidden states x. Now fix the values ofthe hidden states and choose the subspace orientation \nwhich minimizes the squared reconstruction errors of the datapoints. For the simple two(cid:173)\ndimensional example above, I can give a physical analogy. Imagine that we have a rod \npinned at the origin which is free to rotate. Pick an orientation for the rod. Holding the \nrod still, project every datapoint onto the rod, and attach each projected point to its original \npoint with a spring. Now release the rod. Repeat. The direction of the rod represents our \nguess of the principal component of the dataset. The energy stored in the springs is the \nreconstruction error we are trying to minimize. \n\n3.1 Convergence and Complexity \nThe EM learning algorithm for peA amounts to an iterative procedure for finding the sub(cid:173)\nspace spanned by the k leading eigenvectors without explicit computation of the sample \n\n\f630 \n\nS. Roweis \n\ncovariance. It is attractive for small k because its complexity is limited by 0 (knp) per it(cid:173)\neration and so depends only linearly on both the dimensionality of the data and the number \nof points. Methods that explicitly compute the sample covariance matrix have complexities \nlimited by 0 (np2), while methods like the snap-shot method that form linear combinations \nof the data must compute and diagonalize a matrix of all possible inner products between \npoints and thus are limited by O(n2p) complexity. The complexity scaling of the algorithm \ncompared to these methods is shown in figure 2 below. For each dimensionality, a ran(cid:173)\ndom covariance matrix E was generated5 and then lOp points were drawn from N (0, E). \nThe number of floating point operations required to find the first principal component was \nrecorded using MATLAB'S flops function. As expected, the EM algorithm scales more \nfavourably in cases where k is small and both p and n are large. If k ~ p ~ n (we want all \nthe eigenvectors) then all methods are O(p3). \n\nThe standard convergence proofs for EM [I] apply to this algorithm as well, so we can be \nsure that it will always reach a local maximum of likelihood. Furthennore, Tipping and \nBishop have shown [8, 9] that the only stable local extremum is the global maximum at \nwhich the true principal subspace is found; so it converges to the correct result. Another \npossible concern is that the number of iterations required for convergence may scale with \np or n. To investigate this question, I have explicitly computed the leading eigenvector for \nsynthetic data sets (as above, with n = lOp) of varying dimension and recorded the number \nof iterations of the EM algorithm required for the inner product of the eigendirection with \nthe current guess of the algorithm to be 0.999 or greater. Up to 450 dimensions (4500 \ndatapoints), the number of iterations remains roughly constant with a mean of 3.6. The \nratios of the first k eigenvalues seem to be the critical parameters controlling the number of \niterations until convergence (For example, in figure I b this ratio was 1.0001.) \n\nConvergence Behaviour \n\n~~metbod \nSompli Covariance + Dill\u00b7 \nSmtple Covariance only \n\nFigure 2: Time complexity and convergence behaviour of the algorithm. In all cases, the number \nof datapoints n is 10 times the dimensionality p. For the left panel, the number of floating point \noperations to find the leading eigenvector and eigenvalue were recorded. The EM algorithm was \nalways run for exactly 20 iterations. The cost shown for diagonalization of the sample covariance \nuses the MATLAB functions cov and eigs. The snap-shot method is show to indicate scaling only; \none would not normally use it when n > p . In the right hand panel, convergence was investigated \nby explicitly computing the leading eigenvector and then running the EM algorithm until the dot \nproduct of its guess and the true eigendirection was 0.999 or more. The error bars show \u00b1 one \nstandard deviation across many runs. The dashed line shows the number of iterations used to produce \nthe EM algorithm curve ('+') in the left panel. \n\n5First, an axis-aligned covariance is created with the p eigenvalues drawn at random from a uni(cid:173)\n\nform distribution in some positive range. Then (p - 1) points are drawn from a p-dimensional zero \nmean spherical Gaussian and the axes are aligned in space using these points. \n\n\fEM Algorithms for PCA and SPCA \n\n631 \n\n3.2 Missing data \nIn the complete data setting, the values of the projections or hidden states x are viewed as \nthe \"missing information\" for EM. During the e-step we compute these values by projecting \nthe observed data into the current subspace. This minimizes the model error given the \nobserved data and the model parameters. However, if some of the input points are missing \ncertain coordinate values, we can easily estimate those values in the same fashion. Instead \nof estimating only x as the value which minimizes the squared distance between the point \nand its reconstruction we can generalize the e-step to: \n\n\u2022 generalized e-step: For each (possibly incomplete) point y find the unique pair of \npoints x\u00b7 and y. (such that x\u00b7 lies in the current principal subspace and y. lies in \nthe subspace defined by the known information about y) which minimize the norm \nIICx\u00b7 - y\u00b7lI. Set the corresponding column of X to x* and the corresponding \ncolumn ofY to y \u2022. \n\nIf y is complete, then y* = y and x* is found exactly as before. If not, then x* and y* are \nthe solution to a least squares problem and can be found by, for example, Q R factorization \nof a particular constraint matrix. Using this generalized e-step I have found the leading \nprincipal components for datasets in which every point is missing some coordinates. \n\n4 Sensible Principal Component Analysis \nIf we require R to be a multiple \u20acI of the identity matrix (in other words the covariance \nellipsoid of v is spherical) but do not take the limit as E --t 0 then we have a model which \nI shall call sensible principal component analysis or SPCA. The columns of C are still \nknown as the principal components (it can be shown that they are the same as in regular \nPC A) and we will call the scalar value E on the diagonal of R the global noise level. Note \nthat SPCA uses 1 + pk - k(k - 1)/2 free parameters to model the covariance. Once \nagain, inference is done with equation (2b). Notice however, that even though the principal \ncomponents found by SPCA are the same as those for PCA, the mean of the posterior is \nnot in general the same as the point given by the PCA projection (3b). Learning for SPCA \nalso uses an EM algorithm (given below). \n\nBecause it has afinite noise level E, SPCA defines a proper generative model and probability \ndistribution in the data space: \n\nwhich makes it possible to generate data from or to evaluate the actual likelihood of new test \ndata under an SPCA model. Furthermore, this likelihood will be much lower for data far \nfrom the training set even if they are near the principal subspace, unlike the reconstruction \nerror reported by a PCA model. \n\nThe EM algorithm for learning an SPCA model is: \n\n(4) \n\n\u2022 e-step: {3 = CT(CCT + d)-l \n\u2022 m-step: cnew = Y J..L~:E-l \n\nJ..Lx = (3Y \n\n:Ex = nI - n{3C + J..LxJ..L~ \n\nEnew = trace[XXT - CJ..Lx yT]/n2 \n\nTwo subtle points about complexity6 are important to notice; they show that learning for \nSPCA also enjoys a complexity limited by 0 (knp) and not worse. \n\n6 First, since d is diagonal, the inversion in the e-step can be performed efficiently using the \nmatrix inversion lemma: {CC T + d)-l = (I/f - C(I + CTC/f)-ICT /f2 ). Second, since we \nare only taking the trace of the matrix in the m-step, we do not need to compute the fu\\1 sample \ncovariance XXT but instead can compute only the variance along each coordinate. \n\n\f632 \n\nS. Roweis \n\n5 Relationships to previous methods \nThe EM algorithm for PCA, derived above using probabilistic arguments, is closely related \nto two well know sets of algorithms. The first are power iteration methods for solving ma(cid:173)\ntrix eigenvalue problems. Roughly speaking, these methods iteratively update their eigen(cid:173)\nvector estimates through repeated mUltiplication by the matrix to be diagonalized. In the \ncase of PCA, explicitly forming the sample covariance and multiplying by it to perform \nsuch power iterations would be disastrous. However since the sample covariance is in fact \na sum of outer products of individual vectors, we can multiply by it efficiently without ever \ncomputing it. In fact, the EM algorithm is exactly equivalent to performing power iterations \nfor finding C using this trick. Iterative methods for partial least squares (e.g. the NIPALS \nalgorithm) are doing the same trick for regression. Taking the singular value decomposition \n(SVD) of the data matrix directly is a related way to find the principal subspace. If Lanc(cid:173)\nzos or Arnoldi methods are used to compute this SVD, the resulting iterations are similar \nto those of the EM algorithm. Space prohibits detailed discussion of these sophisticated \nmethods, but two excellent general references are [5, 6]. The second class of methods are \nthe competitive learning methods for finding the principal subspace such as Sanger's and \nOja's rules. These methods enjoy the same storage and time complexities as the EM algo(cid:173)\nrithm; however their update steps reduce but do not minimize the cost and so they typically \nneed more iterations and require a learning rate parameter to be set by hand. \n\nAcknowledgements \nI would like to thank John Hopfield and my fellow graduate students for constant and excellent \nfeedback on these ideas. In particular I am grateful to Erik Winfree for significant contributions to the \nmissing data portion of this work, to Dawei Dong who provided image data to try as a real problem, \nas well as to Carlos Brody, San joy Mahajan, and Maneesh Sahani. The work of Zoubin Ghahrarnani \nand Geoff Hinton was an important motivation for this study. Chris Bishop and Mike Tipping are \npursuing independent but yet unpublished work on a virtually identical model. The comments of \nthree anonymous reviewers and many visitors to my poster improved this manuscript greatly. \n\nReferences \n[I] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via \n\nthe EM algorithm. Journal of the Royal Statistical Society series B, 39: 1-38, 1977. \n\n[2] B. S. Everitt. An Introducction to Latent Variable Models. Chapman and Hill, London, 1984. \n[3] Zoubin Ghahramani and Geoffrey Hinton. The EM algorithm for mixtures of factor analyzers. \nTechnical Report CRG-TR -96-1 , Dept. of Computer Science, University of Toronto, Feb. 1997. \n[4] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via an \nEM approach. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in \nNeural Information Processing Systems , volume 6, pages 120-127. Morgan Kaufmann, 1994. \n[5] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University \n\nPress, Baltimore, MD, USA, second edition, 1989. \n\n[6] R. B. Lehoucq, D. C. Sorensen, and C. Yang. Arpack users' guide: Solution of large \nscale eigenvalue problems with implicitly restarted Arnoldi methods. Technical Report \nfrom http://www.caam.rice.edu/software/ARPACK/, Computational and Ap(cid:173)\nplied Mathematics, Rice University, October 1997. \n\n[7] L. Sirovich. Turbulence and the dynamics of coherent structures. Quarterly Applied Mathemat(cid:173)\n\nics, 45(3):561-590, 1987. \n\n[8] Michael Tipping and Christopher Bishop. Mixtures of probabilistic principal component ana(cid:173)\n\nlyzers. Technical Report NCRG/97/003, Neural Computing Research Group, Aston University, \nJune 1997. \n\n[9] Michael Tipping and Christopher Bishop. Probabilistic principal component analysis. Technical \nReport NCRG/97/010, Neural Computing Research Group, Aston University, September 1997. \n[10] J. H. Wilkinson. The AlgebraiC Eigenvalue Problem. Claredon Press, Oxford, England, 1965. \n\n\f", "award": [], "sourceid": 1398, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}]}