{"title": "Automatic Choice of Dimensionality for PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 604, "abstract": null, "full_text": "Automatic choice of dimensionality for peA \n\nThomas P. Minka \n\nMIT Media Lab \n\n20 Ames St, Cambridge, MA 02139 \n\ntpminka@media.mit.edu \n\nAbstract \n\nA central issue in principal component analysis (PCA) is choosing the \nnumber of principal components to be retained. By interpreting PCA as \ndensity estimation, we show how to use Bayesian model selection to es(cid:173)\ntimate the true dimensionality of the data. The resulting estimate is sim(cid:173)\nple to compute yet guaranteed to pick the correct dimensionality, given \nenough data. The estimate involves an integral over the Steifel manifold \nof k-frames, which is difficult to compute exactly. But after choosing an \nappropriate parameterization and applying Laplace's method, an accu(cid:173)\nrate and practical estimator is obtained. In simulations, it is convincingly \nbetter than cross-validation and other proposed algorithms, plus it runs \nmuch faster. \n\n1 \n\nIntroduction \n\nRecovering the intrinsic dimensionality of a data set is a classic and fundamental problem \nin data analysis. A popular method for doing this is PCA or localized PCA. Modeling the \ndata manifold with localized PCA dates back to [4]. Since then, the problem of spacing and \nsizing the local regions has been solved via the EM algorithm and split/merge techniques \n[2, 6, 14,5]. \n\nHowever, the task of dimensionality selection has not been solved in a satisfactory way. \nOn the one hand we have crude methods based on eigenvalue thresholding [4] which are \nvery fast, or we have iterative methods [1] which require excessive computing time. This \npaper resolves the situation by deriving a method which is both accurate and fast. It is \nan application of Bayesian model selection to the probabilistic PCA model developed by \n[12, 15]. \n\nThe new method operates exclusively on the eigenvalues of the data covariance matrix. In \nthe local PCA context, these would be the eigenvalues of the local responsibility-weighted \ncovariance matrix, as defined by [14]. The method can be used to fit different PCA models \nto different classes, for use in Bayesian classification [11]. \n\n2 Probabilistic peA \n\nThis section reviews the results of [15]. The PCA model is that a d-dimensional vector x \nwas generated from a smaller k-dimensional vector w by a linear transformation (H, m) \n\n\fplus a noise vector e: x = Hw + m + e. Both the noise and the principal component \nvector ware assumed spherical Gaussian: \n\nThe observation x is therefore Gaussian itself: \n\np(xIH, m, v) '\" N(m, HHT + vI) \n\n(1) \n\n(2) \n\nThe goal of PCA is to estimate the basis vectors H and the noise variance v from a data set \nD = {Xl, ... , XN }. The probability of the data set is \n\np(DIH,m,v) \n\n(27f)-Nd/2IHHT + vII- N/2 exp(-~tr((HHT + VI)-lS)) (3) \n\nS = I)Xi - m)(xi - m)T \n\nAs shown by [15], the maximum-likelihood estimates are: \n\n1 ~ \n\nA \n\nm= N~xi \n\ni \n\n\"'~ \nA' \nL.\"J=k+l J \n\nd-k \n\nA \n\n_ \n\nV -\n\n(4) \n\n(5) \n\nwhere orthogonal matrix U contains the top k eigenvectors of SIN, diagonal matrix A \ncontains the corresponding eigenvalues, and R is an arbitrary orthogonal matrix. \n\n3 Bayesian model selection \n\nBayesian model selection scores models accord(cid:173)\ning to the probability they assign the observed \ndata [9, 8]. It is completely analogous to Bayesian \nIt automatically encodes a pref(cid:173)\nclassification. \nerence for simpler, more constrained models, as \nillustrated in figure 1. Simple models only fit \na small fraction of data sets, but they assign \ncorrespondingly higher probability to those data \nsets. Flexible models spread themselves out more \nthinly. \n\nThe probability of the data given the model is \ncomputed by integrating over the unknown pa(cid:173)\nrameter values in that model: \n\np(D I M) n. ~\"\"\";\"'\" model \n\n------~--_r~------ D \n\nflexible model \n\nconstrained \nmodel wins model wins \n\nflexible \n\nFigure 1: Why Bayesian model se(cid:173)\nlection prefers simpler models \n\np(DIM) = fo p(DIO)p(OIM)dO \n\n(6) \n\nThis quantity is called the evidence for model M. A useful property of Bayesian model \nselection is that it is guaranteed to select the true model, if it is among the candidates, as \nthe size of the dataset grows to infinity. \n\n3.1 The evidence for probabilistic peA \n\nFor the PCA model, we want to select the subspace dimensionality k. To do this, we com(cid:173)\npute the probability of the data for each possible dimensionality and pick the maximum. For \na given dimensionality, this requires integrating over all PCA parameters (m, H, v) . First \nwe need to define a prior density for these parameters. Assuming there is no information \n\n\fother than the data D, the prior should be as noninformative as possible. A non informative \nprior for m is uniform, and with such a prior we can integrate out m analytically, leaving \np(DIH, v) = N-d/2(27f)-(N-1)d/2IHHT + vII-(N-1)/2 exp( -~tr((HHT +VI)-lS)) \n(7) \n(8) \n\nwhere S = ~)Xi - m)(Xi - m)T \n\nUnlike m, H must have a proper prior since it varies in dimension for different models. \nLet H be decomposed just as in (5): \n\n(9) \nwhere L is diagonal with diagonal elements k The orthogonal matrix U is the basis, L is \nthe scaling (corrected for noise), and R is a rotation within the subspace (which will turn \nout to be irrelevant). A conjugate prior for (U, L, R, v), parameterized by a, is \n\np(U,L,R,v) \n\nex \n\nIHHT +vII-(a+2)/2exp(_~tr((HHT +VI)-l)) \n\n(10) \n\nThis distribution happens to factor into p(U)p(L )p(R)p( v) , which means the variables are \na-priori independent: \n\np(L) \n\nex \n\nILI-(a+2)/2 exp( -::tr(L -1)) \n\n2 \n\np(v) \n\nex v-(a+2)(d-k)/2 exp( _ a(d - k)) \n\n2v \n\np(U)p(R) \n\n(constant-defined in (20\u00bb \n\n(11) \n\n(12) \n\n(13) \n\nThe hyperparameter a controls the sharpness of the prior. For a noninformative prior, \na should be small, making the prior diffuse. Besides providing a convenient prior, the \ndecomposition (9) is important for removing redundant degrees of freedom (R) and for \nseparating H into independent components, as described in the next section. \n\nCombining the likelihood with the prior gives \np(Dlk) =Ck /IHHT +vII-n/2exp(-~tr((HHT +vI)-l(S+aI)))dUdLdv (14) \n\nn = N + 1 + a \n\n(15) \nThe constant Ck includes N-d/2 and the normalizing terms for p(U) , p(L), and p(v) \n(given in [lO])-only p(U) will matter in the end. In this formula R has already been \nintegrated out; the likelihood does not involve R so we just get a multiplicative factor of \nJRP(R) dR = 1. \n\n3.2 Laplace approximation \n\nLaplace's method is a powerful method for approximating integrals in Bayesian statistics \n[8]: \n\n/ \n\nf(())d() \n\n~ f(B)(27f),ows(A)/2IAI- 1/ 2 \n\n(16) \n\n(17) \n\nThe key to getting a good approximation is choosing a good parameterization for () = \n(U, L, v). Since li and v are positive scale parameters, it is best to use l~ = log(li) and \n\n\fv' = log( v). This results in \n\nf. _ NAi + a: \n,- N-1+a: \nd2 10g f((}) I = _ N - 1 + a: \n\n(dlD2 \n\n()=o \n\n2 \n\n(dV')2 \n\n()=o \n\n~ \nN~:=k+1 Aj \nv= n(d-k)-2 \n(18) \nd2 10g f((}) I = _ n(d - k) - 2 (19) \n\n2 \n\nThe matrix U is an orthogonal k-frame and therefore lives on the Stiefel manifold [7], \nwhich is defined by condition (9). The dimension of the manifold is m = dk - k(k + 1) /2, \nsince we are imposing k(k + 1)/2 constraints on a d x k matrix. The prior density for U \nis the reciprocal of the area of the manifold [7]: \n\np(U) = Tk II r((d - i + 1)/2)7f-(d-i+1)/2 \n\nk \n\n(20) \n\ni=l \n\nA useful parameterization of this manifold is given by the Euler vector representation: \n\n(21) \n\nwhere U d is a fixed orthogonal matrix and Z is a skew-symmetric matrix of parameters, \nsuch as \n\nZ = [-~12 Zt/ \n\n- Z13 \n\n-Z23 \n\n~~: 1 \n\n0 \n\n(22) \n\nThe first k rows of Z determine the first k columns of exp(Z), so the free parameters are Zij \nwith i < j and i ::; k; the others are constant. This gives d(d-1)/2-(d-k)(d-k-1)/2 = \nm parameters, as desired. For example, in the case (d = 3, k = 1) the free parameters are \nZ12 and Z13, which define a coordinate system for the sphere. \nAs a function of U, the integrand is simply \n1 \n\np(UID, L, v) ex: exp( -2tr((L -1 - v-1I)UT SU)) \n\n(23) \n\nThe density is maximized when U contains the top k eigenvectors of S. However, the \ndensity is unchanged if we negate any column of U. This means that there are actually \n2k different maxima, and we need to apply Laplace's method to each. Fortunately, these \nmaxima are identical so can simply multiply (16) by 2k to get the integral over the whole \nmanifold. If we set U d to the eigenvectors of S: \n\n(24) \nthen we just need to apply Laplace's method at Z = O. As shown in [10], if we define the \nestimated eigenvalue matrix \n\nuIsu d = N A \n\nthen the second differential at Z = 0 simplifies to \n\nA = [~ VI~-J \n\n2 \n\nd logf((}) Z=Q = - L...J L...J (\\ - \\ \n\nI \n\n~ -1 \n\n2 \n)(Ai - Aj)Ndzij \n\nd \n\nk \n\" \" ~ -1 \ni=l j=i+1 \n\n(25) \n\n(26) \n\nThere are no cross derivatives; the Hessian matrix Az is diagonal. So its determinant is \nthe product of these second derivatives: \n\nk \n\nIAzl = II II (.~j1 - ~i1)(Ai - Aj)N \n\nd \n\ni=l j=i+1 \n\n(27) \n\n\fLaplace's method requires this to be nonsingular, so we must have k < N. The cross(cid:173)\nderivatives between the parameters are all zero: \n\ncP log 1(0) I \n\n= d2 10g 1(0) I \n\n= d2 10g 1(0) I \n\n= 0 \n\ndlidZ \n\n0=0 \n\ndvdZ \n\n0=0 \n\ndlidv \n\n0=0 \n\n(28) \n\nso A is block diagonal and IAI = IAzIIALIIAvl. We know AL from (19), Av from (19), \nand Az from (27). We now have all of the terms needed in (16), and so the evidence \napproximation is \n\np(Dlk) RJ 2kck i \n\nI l-n/2 \n\nv-n(d-k)/2e-nd/2(27r)(m+k+1)/2IAzl-l/2IALI-1/2IAvl-1/2 \n\n(29) \nFor model selection, the only terms that matter are those that strongly depend on k, and \nsince D: is small and N reasonably large we can simplify this to \n\np(Dlk) RJ p(U) (g A;) -NI'iJ- N(,-.)I'(2.)(m+k)I' IAzl-'I' N-'I' \n\n~ Et=k+l Aj \nv = d- k \n\n(30) \n\n(31) \n\nwhich is the recommended formula. Given the eigenvalues, the cost of computing p(D Ik) \nis O(min(d, N)k), which is less than one loop over the data matrix. \n\nA simplification of Laplace's method is the BIC approximation [8]. This approximation \ndrops all terms which do not grow with N, which in this case leaves only \n\np(Dlk) RJ g Aj \n\n( \n\n) \n\n-N/2 \n\nv- N (d-k)/2 N-(m+k)/2 \n\n(32) \n\nBIC is compared to Laplace in section 4. \n\n4 Results \n\nTo test the performance of various algorithms for model selection, we sample data from a \nknown model and see how often the correct dimensionality is recovered. The seven esti(cid:173)\nmators implemented and tested in this study are Laplace's method (30), BIC (32), the two \nmethods of [13] (called RR-N and RR-U), the algorithm in [3] (ER), the ARD algorithm \nof [1], and 5-fold cross-validation (CV). For cross-validation, the log-probability assigned \nto the held-out data is the scoring function. ER is the most similar to this paper, since it \nperforms Bayesian model selection on the same model, but uses a different kind of ap(cid:173)\nproximation combined with explicit numerical integration. RR-N and RR-U are maximum \nlikelihood techniques on models slightly different than probabilistic PCA; the details are \nin [10]. ARD is an iterative estimation algorithm for H which sets columns to zero un(cid:173)\nless they are supported by the data. The number of nonzero columns at convergence is the \nestimate of dimensionality. \nMost of these estimators work exclusively from the eigenvalues of the sample covariance \nmatrix. The exceptions are RR-U, cross-validation, and ARD; the latter two require diag(cid:173)\nonalizing a series of different matrices constructed from the data. In our implementation, \nthe algorithms are ordered from fastest to slowest as RR-N, mc, Laplace, cross-validation, \nRR-U, ARD, and ER (ER is slowest because of the numerical integrations required). \n\n\fThe first experiment tests the data-rich case where \nN > > d. The data is generated from a lO-dimensional \nGaussian distribution with 5 \"signal\" dimensions and \n5 noise dimensions. The eigenvalues of the true co-\nvariance matrix are: \n\nSignal \n\nNoise \n108642 1(x5) \n\nN = 100 \n\nThe number of times the correct dimensionality (k = \n5) was chosen over 60 replications is shown at right. \nThe differences between ER, Laplace, and CV are not \nstatistically significant. Results below the dashed line \nare worse than Laplace with a significance level of \n95%. \n\nER Laplace CV \n\nBIC \n\nARD RRN RRU \n\nThe second experiment tests the case of sparse data \nand low noise: \n\nSignal \n\nNoise \n\n108642 0.1 (xl0) \n\nN= 10 \n\nThe results over 60 replications are shown at right. \nBIC and ER, which are derived from large N approx(cid:173)\nimations, do poorly. Cross-validation also fails, be(cid:173)\ncause it doesn't have enough data to work with. \n\nThe third experiment tests the case of high noise di(cid:173)\nmensionality: \n\nSignal \n\nNoise \n\n10 8 642 0.25 (x95) \n\nN=60 \n\nThe ER algorithm was not run in this case because of \nits excessive computation time for large d. \n\nLaplace CV \n\nARD \n\nRRU \n\nBlC \n\nRAN \n\nThe final experiment tests the robustness to having a \nnon-Gaussian data distribution within the subspace. \nWe start with four sound fragments of 100 samples \neach. To make things especially non-Gaussian, the val(cid:173)\nues in third fragment are squared and the values in the \nfourth fragment are cubed. All fragments are standard(cid:173)\nized to zero mean and unit variance. Gaussian noise in \n20 dimensions is added to get: \n\nSignal \n4 sounds \n\nNoise \n\n0.5 (x20) \n\nN = 100 \n\nThe results over 60 replications of the noise (the sig(cid:173)\nnals were constant) are reported at right. \n\nLaplace ARD \n\nCV \n\nBIC RRN RRU \n\nER \n\n5 Discussion \n\nBayesian model selection has been shown to provide excellent performance when the as(cid:173)\nsumed model is correct or partially correct. The evaluation criterion was the number of \ntimes the correct dimensionality was chosen. It would also be useful to evaluate the trained \nmodel with respect to its performance on new data within an applied setting. In this case, \n\n\fBayesian model averaging is more appropriate, and it is conceivable that a method like \nARD, which encompasses a soft blend between different dimensionalities, might perform \nbetter by this criterion than selecting one dimensionality. \n\nIt is important to remember that these estimators are for density estimation, i.e. accurate \nrepresentation of the data, and are not necessarily appropriate for other purposes like re(cid:173)\nducing computation or extracting salient features. For example, on a database of 301 face \nimages the Laplace evidence picked 120 dimensions, which is far more than one would \nuse for feature extraction. (This result also suggests that probabilistic PCA is not a good \ngenerative model for face images.) \n\nReferences \n\n[1] C. Bishop. Bayesian PCA. In Neural Information Processing Systems 11, pages 382- 388, \n\n1998. \n\n[2] C. Bregler and S. M. Omohundro. Surface learning with applications to lipreading. In NIPS, \n\npages 43- 50, 1994. \n\n[3] R. Everson and S. Roberts. Inferring the eigenvalues of covariance matrices from limited, \n\nnoisy data. IEEE Trans Signal Processing, 48(7):2083- 2091, 2000. \nhttp : //www. robots . ox . ac . uk/-sjrob/Pubs/spectrum . ps . gz. \n\n[4] K. Fukunaga and D. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE \n\nTrans Computers, 20(2):176-183,1971. \n\n[5] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. \n\nIn Neural Information Processing Systems 12, 1999. \n\n[6] Z. Ghahramani and G. Hinton. The EM algorithm for mixtures of factor analyzers. Technical \n\nReport CRG-TR-96-1 , University of Toronto, 1996. \nhttp : //www . gatsby . ucl . ac . uk/-zoubin/pape rs . html. \n\n[7] A. James. Normal multivariate analysis and the orthogonal group. Annals of Mathematical \n\nStatistics, 25(1):40- 75, 1954. \n\n[8] R. E. Kass and A. E. Raftery. Bayes factors and model uncertainty. Technical Report 254, \n\nUniversity of Washington, 1993. \nhttp : //www . st a t . wa shington . edu/t e ch . reports/tr254 . ps . \n\n[9] D. J. C. MacKay. Probable networks and plausible predictions -\n\na review of practical \n\nBayesian methods for supervised neural networks. Network: Computation in Neural Systems, \n6:469- 505, 1995. \nhttp : //wol . r a. phy . cam .a c . uk/mack a y/abstr a cts/ne twork . html. \n\n[10] T. Minka. Automatic choice of dimensionality for PCA. Technical Report 514, MIT Media \n\nLab Vision and Modeling Group, 1999. \nf tp : //whit e chapel . media . mit .edu/pub/tech-reports/TR-514-\nABSTRAC T. html. \n\n[11] B. Moghaddam, T. Jebara, and A. Pentland. Bayesian modeling of facial similarity. In Neural \n\nInformation Processing Systems 11, pages 910-916, 1998. \n\n[12] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE \n\nTrans Pattern Analysis and Machine Intelligence, 19(7):696-710, 1997. \n\n[13] J. J. Rajan and P. J. W. Rayner. Model order selection for the singular value decomposition and \n\nthe discrete Karhunen-Loeve transform using a Bayesian approach. lEE Vision, Image and \nSignal Processing, 144(2):166- 123, 1997. \n\n[14] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. \n\nNeural Computation, 11(2):443-482, 1999. \nhttp : //cit e s ee r . nj . n e c . com/362314 . html. \n\n[15] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. J Royal \n\nStatistical Society B, 61(3), 1999. \n\n\f", "award": [], "sourceid": 1853, "authors": [{"given_name": "Thomas", "family_name": "Minka", "institution": null}]}