{"title": "An Approximate Inference Approach for the PCA Reconstruction Error", "book": "Advances in Neural Information Processing Systems", "page_first": 1035, "page_last": 1042, "abstract": null, "full_text": "An Approximate Inference Approach for the PCA Reconstruction Error\nManfred Opper Electronics and Computer Science University of Southampton Southampton, SO17 1BJ mo@ecs.soton.ac.uk\n\nAbstract\nThe problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efficiently using only two variational parameters. A perturbative correction to the result is computed and an alternative simplified derivation is also presented.\n\n1\n\nIntroduction\n\nThis paper was motivated by recent joint work with Ole Winther on approximate inference techniques (the expectation consistent (EC) approximation [1] related to Tom Minka's EP [2] approach) which allows us to tackle highdimensional sums and integrals required for Bayesian probabilistic inference. I was looking for a nice model on which I could test this approximation. It had to be simple enough so that I would not be bogged down by large numerical simulations. But it had to be nontrivial enough to be of at least modest interest to Machine Learning. With the somewhat unorthodox application of approximate inference to resampling in PCA I hope to be able to stress the following points: Approximate efficient inference techniques can be useful in areas of Machine Learning where one would not necessarily assume that they are applicable. This can happen when the underlying probabilistic model is not immediately visible but shows only up as a result a of mathematical transformation. Approximate inference methods can be highly robust allowing for analytic continuations of model parameters to the complex plane or even noninteger dimensions. It is not always necessary to use a large number of variational parameters in order to get reasonable accuracy. Inference methods could be systematically improved using perturbative corrections. The work was also stimulated by previous joint work with Dorthe Malzahn [3] on resampling estimates for generalization errors of Gaussian process models and Supportvector Machines.\n\n\f\n2\n\nResampling estimators for PCA\n\nPrincipal Component Analysis (PCA) is a well known and widely applied tool for data analysis. The goal is to project data vectors y from a typically high (d-) dimensional space into an optimally chosen lower (q -) dimensional linear space with q << d, thereby minimizing the expected projection error = E ||y - Pq [y]||2 , where Pq [y] denotes the projection. E stands for an expectation over the distribution of the data. In practice where the distribution is not available, one has to work with a data sample D0 consisting of N T vectors yk = (yk (1), yk (2), . . . , yk (d)) , k = 1, . . . , N . We arrange these vectors into a (d N ) data matrix Y = (y1 , y2 , . . . , yN ). Assuming centered data, the optimal subspace 1 is spanned by the eigenvectors ul of the d d data covariance matrix C = N YYT corresponding to the q largest eigenvalues k . We will assume that these correspond to all eigenvectors k > above some threshold value . After computing the PCA projection, one would be interested in finding out if the computed subspace represents the data well by estimating the average projection error on novel data y (ie not contained in D0 ) which are drawn from the same distribution. Fixing the projection Pq , the error can be rewritten as y E= E Tr yT ul uT l\nl <\n\n(\n\n1)\n\nwhere the expectation is only over y and the training data are fixed. The training error Et = 2 can be obtained without knowledge of the distribution but will usually l l < only give an optimistically biased estimate for E . 2.1 A resampling estimate for the error\n\nNew artificial data samples D of arbitrary size can be created by resampling a number of data points from D0 with or without replacement. A simple choice would be to choose all data independently with the same probability 1/N , but other possibilities can also be implemented within our formalism. Thus, some yi in D0 may appear multiple times in D and others not at all. The idea of performing PCA on resampled data sets D and testing on the remaining data D0 \\D, motivates the following definition of a resample averaged reconstruction error y yT 1 Er = ED Tr i yi ul uT (2) l N0\n/ i D ;l <\n\nas a proxy for E . ED is the expectation over the resampling process. This is an estimator of the bootstrap type [3,4]. N0 is the expected number of data in D0 which are not contained in the random set D. The rest of the paper will discuss a method for efficiently approximating (2). 2.2 Basic formalism\n\nWe introduce \"occupation numbers\" si which count how many times yi is containd in D. We also introduce two matrices D and C. D is a diagonal random matrix Dii = Di = 1 (si + si ,0 ) C( ) = YDYT . N (3)\n\nC(0) is proportioni l to the covariance matrix of the resampled data. is the sampling rate, a i.e. N = ED [ si ] is the expexted number of data in D (counting multiplicities). The\n\n\f\nrole of will be explained later. Using , we can generate expressions that can be used in (2) to sum over the data which are not contained in the set D 1j T C (0) = sj ,0 yj yj . (4) N In the following k and uk will always denote eigenvalues and eigenvectors of the data dependent (i.e. random) covariance matrix C(0). The desired averages can be constructed from the d d matrix Green's function G() = (C(0) + I)\n-1\n\n=\n\nk\n\nuk uT k k + \n\n(5)\n\nUsing the well known representation of the Dirac distribution given by (x) = lim0+ (x-i) where i = -1 and denotes the imaginary part, we get 1\n 0\n\nlim+\n\n1 \n\nG( - i ) =\n\nk\n\nuk uT (k + ) . k\n\n(6)\n\nHence, we have Er = where 1 r () = lim 0+ 1 ED N0 j sj ,0 Tr y\nT j yj G(- 0 Er\n\n +\n0+\n\nd\n\n\n\nr (\n\n)\n\n(7)\n\n - i ) \n\n(8)\n\n0 defines the error density from all eigenvalues > 0 and Er is the contribution from the eigenspace with k = 0. The latter can also be easily expressed from G as j yT 1 0 Er = lim ED sj ,0 Tr j yj G() (9) 0 N0\n\nWe can also compute the resample averaged density of eigenvalues using () = 1 lim N 0+ ED [Tr G(- - i )] (10)\n\n3\n\nA Gaussian probabilistic model\n\nThe matrix Green's function for > 0 can be generated from a Gaussian partition function Z . This is a well known construction in statistical physics, and has also been used within the NIPS community to study the distribution of eigenvalues for an average case analysis of PCA [5]. Its use for computing the expected reconstruction error is to my knowledge new.\n1 With the (N N ) kernel matrix K = N YT Y we define the Gaussian partition function - ( d x 1 T K-1 Z= x exp x +D 11) 2 - . d 1 1T d = |K| 2 d/2 (2 )(N -d)/2 z exp z (C( ) + I) z (12) 2\n\n\f\nx is an N dimensional integration variable. The equality can be easily shown by expressing the integrals as determinants. 1 The first representation (11) is useful for computing the resampling average and the second one connects directly to the definition of the matrix Green's function G. Note, that by its dependence on the kernel matrix K, a generalization to d = dimensional feature spaces and kernel PCA is straightforward. The partition function can then be understood as a certain Gaussian process expectation. We will not discuss this point further. The free energy F = - ln Z enables us to generate the following quantities -2 ln Z =0 -2 =\nN 1j T s ,0 Tr yj yj G() N =1 j\n\n(13)\n\n ln Z d = + Tr G() (14) where we have used (4) for (13). (13) will be used for the computation of (8) and (14) applies to the density of eigenvalues. Note that the definition of the partition function Z requires that > 0, whereas the application to the reconstruction error (7) needs negative values = - < 0. Hence, an analytic continuation of end results must be performed.\n\n4\n\nResampling average and replicas\n\n(13) and (14) show that we can compute the desired resampling averages from the expected free energy -ED [ln Z ]. This can be expressed using the \"replica trick\" of statistical physics (see e.g. [6]) using 1 (15) ED [ln Z ] = lim ln ED [Z n ] , n0 n where one attempts an approximate computation of ED [Z n ] for integer n and uses a continuation to real numbers at the end. The n times replicated and averaged partition function (11) can be written in the form d ( n) . n Z = ED [Z ] = x 1 (x) 2 (x) (16) . where we set x = (x1 , . . . , xn ) and e -n 1a 1 (x) = ED xp xT Dxa 2 =1 a\n\n\n2 (x)\n\n- = exp\n\n1a xT K-1 xa 2 =1 a\n\nn\n\n( 17)\n\nThe unaveraged partition function Z (11) is Gaussian, but the averaged Z (n) is not and usually intractable.\n\n5\n\nApproximate inference\n\nTo approximate Z (n) , we will use the EC approximation recently introduced by Opper & Winther [1]. For this method we need two auxiliary distributions\nT 1 1 - 1 0 xT x 1 (x)e-1 x x p0 (x) = e2 , (18) Z1 Z0 where 1 and 0 are \"variational\" parameters to be optimized. p1 tries to mimic the intractable p(x) 1 (x) 2 (x), replacing the multivariate Gaussian 2 by a simpler, i.e.\n\np1 (x) =\n\nIf K has zero eigenvalues, a division of Z by |K| 2 is necessary. This additive renormalization of the free energy - ln Z will not influence the subsequent computations.\n\n1\n\n1\n\n\f\ntractable diagonal one. One may think of using a general diagonal matrix 1 , but we will restrict ourselves in the present case to the simplest case of a spherical Gaussian with a single parameter 1 . The strategy is to split Z (n) into a product of Z1 and a term that has to be further approximated: d T Z (n) = Z1 x p1 (x) 2 (x) e1 x x (19) d T ( n) Z1 x p0 (x) 2 (x) e1 x x ZE C (1 , 0 ) . The approximation replaces the intractable average over p1 by a tractable one over p0 . To optmize 1 and 0 we argue as follows: We try to make p0 as close as possible to p1 by matching the moments xT x 1 = xT x 0. The index denotes the distribution which is used for averaging. By this step, 0 becomes a function of 1 . Second, since the true partition function Z (n) is independent of 1 , we expect that a good approximation to Z (n) should be stationary with respect to variations of 1 . Both conditions can be expressed by the ( n) requirement that ln ZE C (1 , 0 ) must be stationary with respect to variations of 1 and 0 . Within this EC approximation we can carry out the replica limit ED [ln Z ] ln ZE C = ( n) 1 limn0 n ln ZE C and get after some calculations - l d - 1 xT (D+(0 -)I)x 2 - ln ZE C = -ED n (20) xe d d -1 T 1 1T - ln x e- 2 x (K +I)x + ln x e- 2 0 x x where we have set = 0 - 1 . Since the first Gaussian integral factorises, we can now perform the resampling average in (20) relatively easy for the case when all sj 's in s (3) are independent. Assuming e.g. Poisson probabilities p(s) = e- ! gives a good s approximation for the case of resampling N points with replacement. The variational equations which make (20) stationary are 1 = i 1 1 1k ED = 0 - + Di 0 N 1 + k 0\n\n(21)\n\nwhere k are the eigenvalues of the matrix K. The variational equations have to be solved in the region = - < 0 where the original partition function does not exist. The resulting parameters 0 and will usually come out as complex numbers.\n\n6\n\nExperiments\n\nBy eliminating the parameter 0 from (21) it is possible to reduce the numerical computations to solving a nonlinear equation for a single complex parameter which can be solved easily and fast by a Newton method. While the analytical results are based on Poisson statistics, the simulations of random resampling was performed by choosing a fixed number (equal to the expected number of the Poisson distribution) of data at random with replacement. The first experiment was for a set of data generated at random from a spherical Gaussian. To show that resampling maybe useful, we give on on the left hand side of Figure 1 the reconstruction error as a function of the value of below which eigenvalues are dicarded.\n\n\f\n25\n\n20\n\n15\n\n10\n\n5\n\n0 0\n\n1\n\neigenvalue \n\n2\n\n3\n\n4\n\n5\n\nFigure 1: Left: Errors for PCA on N = 32 spherically Gaussian data with d = 25 and = 3. Smooth curve: approximate resampled error estimate, upper step function: true error. Lower step function: Training error. Right: Comparison of EC approximation (line) and simulation (histogramme) of the resampled density of eigenvalues for N = 50 spherically Gaussian data of dimensionality d = 25. The sampling rate was = 3. The smooth function is the approximate resampling error (3 oversampled to leave not many data out of the samples) from our method. The upper step function gives the true reconstruction error (easy to calculate for spherical data) from (1). The lower step function is the training error. The right panel demonstrates the accuracy of the approximation on a similar set of data. We compare the analytically approximated density of states with the results of a true resampling experiment, where eigenvalues for many samples are counted into small bins. The theoretical curve follows closely the experiment. Since the good accuracy might be attributed to the high symmetry of the toy data, we have also performed experiments on a set of N = 100 handwritten digits with d = 784. The results in Figure 2 are promising. Although the density of eigenvalues is more accurate than the resampling error, the latter comes still out reasonable.\n\n7\n\nCorrections\n\nI will show next that the EC approximation can be augmented by a perturbation expansion. Going back to (19), we can write w ( n) d d d T T T 1 Z k = x p1 (x) 2 (x) e1 x x = x 2 (x) e 2 x x e-ik x (k ) Nn Z1 (2 ) d T . here (k ) = x p1 (x)eik x is the characteristic function of the density p1 (18). ln (k ) is the cumulant generating function. Using the symmetries of the density p1 , we can perform a power series expansion of ln (k ), which starts with a quadratic term (second cumulant) M2 T ln (k ) = - k k + R(k ) , (22) 2 where M2 = xT xa 1. It can be shown that if we neglect R(k ) (containing the higher order a cumulants) and carry out the integral over k , we end up replacing p1 by a simpler Gaussian p0 with matching moments M2 , i.e. the EC approximation. Higher order corrections to the free energy -ED [ln Z ] = - ln ZE C + F1 + . . . can be obtained perturbatively by M2 T writing (k ) = e- 2 k k (1 + R(k ) + . . .). This expansion is similar in spirit to Edgeworth\n\n\f\n20\n\n15\n\n10\n\n5\n\n0 0\n\neigenvalue \n0.5 1 1.5\n\nFigure 2: Left: Resampling error ( = 1) for PCA on a set of 100 handwritten digits (\"5\") with d = 784. The approximation (line) for = 1 is compared with simulations of the random resampling. Right: Resampled density of eigenvalues for the same data set. Only the nonzero eigenvalues are shown. expansions in statistics. The present case is more complicated by the extra dimensions introduced by the replicating of variables and the limit n 0. After a lengthy calculation one finds for the lowest order correction (containing the monomials in k of order 4) to the free energy: 2 2 iK -1 1 0 -1 F1 = - ED -1 + I ii - 1 (23) 0 4 0 - + Di I illustrate the effect of F1 on a correction to the reconstruction error in the \"zero subspace\" using (9) and (13) for the digit data as a function of . Resampling used the Poisson approximation.The left panel of Figure 3 demonstrates that the true correction is fairly small. The right panel shows that the lowest order term F1 accounts for a major part of the true correction when < 3. The strong underestimation for larger needs further investigation.\n\n8\n\nThe calculation without replicas\n\nKnowing with hindsight how the final EC result (20) looks like, we can rederive it using another method which does not rely on the \"replica trick\". We first write down an exact expression for - ln Z before averaging. Expressing Gaussian integrals by determinants yields d d -1 1T - 1 xT (D+(0 -)I)x 2 - ln x e- 2 x (K +I)x + (24) - ln Z = - ln xe d T 1 1 + ln x e- 2 0 x x + ln det(I + r) 2 1 K i -1 -1 where the matrix r has elements rij = - 0 -0+Di + I -I . The 0 \nj\n\nEC approximation is obtained by simply neglecting r. Corrections to this are found by expanding k (-1)k+1 r ( Tr k 25) ln det (I + r) = Tr ln (I + r) = k\n=1\n\n\f\nCorrection to resampling error 22 20 18 16 14 12 10 0.2 8 6 4 2 0 0.5 1 1.5 2 2.5 Resampling rate 3 3.5 4 0 0 0.5 1 1.5 2 2.5 Resampling rate 3 3.5 4 0.4 Resampled reconstruction error ( = 0) 0.6\n\n0 Figure 3: Left: Resampling error Er from the = 0 subspace as a function of resampling rate for the digits data. The approximation (lower line) is compared with simulations of the random resampling (upper line). Right: The difference between approximation and simulations (upper curve) and its estimate (lower curve) from the perturbative correction (23).\n\nThe first order term in the expansion (25) vanishes after averaging (see (21)) and the second order term gives exactly the correction of the cumulant method (23).\n\n9\n\nOutlook\n\nIt will be interesting to extend the perturbative framework for the computation of corrections to inference approximations to other, more complex models. However, our results indicate that the use and convergence of such perturbation expansion needs to be critically investigated and that the lowest order may not always give a clear indication of the accuracy of the approximation. The alternative derivation for our simple model could present an interesting ground for testing these ideas. Acknowledgments I would like to thank Ole Winther for the great collaboration on the EC approximation. References\n[1] Manfred Opper and Ole Winther. Expectation consistent free energies for approximate inference. In NIPS 17, 2005. [2] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI 2001, pages 362369, 2001. [3] D. Malzahn and M. Opper. An approximate analytical approach to resampling averages. Journal of Machine Learning Research, pages 11511173, 2003. [4] B. Efron, R. J. Tibshirani. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability 57, Chapman & Hall, 1993. [5] D. C. Hoyle and M. Rattray Limiting form of the sample covariance matrix eigenspectrum in PCA and kernel PCA. In NIPS 16, 2003. [6] A. Engel and C. Van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001).\n\n\f\n", "award": [], "sourceid": 2878, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}]}