{"title": "On the Convergence of Eigenspaces in Kernel Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1649, "page_last": 1656, "abstract": null, "full_text": "On the Convergence of Eigenspaces in Kernel Principal Component Analysis\n\nLaurent Zwald Departement de Mathematiques, Universite Paris-Sud, ^ Bat. 425, F-91405 Orsay, France Laurent.Zwald@math.u-psud.fr\n\nGilles Blanchard Fraunhofer First (IDA), Kekulestr. 7, D-12489 Berlin, Germany blanchar@first.fhg.de\n\nAbstract\nThis paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces.\n\n1\n\nIntroduction.\n\nPrincipal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in finding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is fixed in advance, say D an assumption that we will make throughout the present paper the solution of this problem is obtained by considering the projection on the span SD of the first D eigenvectors of the covariance matrix. Here by 'first D eigenvectors' we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the first D eigenvectors will be called \"D-eigenspace\" for short when there is no risk of confusion. The introduction of the 'Kernel trick' has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction\n\n\f\nerror of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide finite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classification, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gine [7] (see also [4]) using tools of operator perturbation theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the first D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the first D eigenvectors. Our contribution in the present work is thus to adapt the operator perturbation result of [7] to D-eigenspaces.\n\n to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a first bound based on controlling separately the first D eigenvectors, and depending on the dimension D. In section 3 we explain why the first bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix.\n\n2\n\nFirst result.\n\nNotation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the\n\n\f\nfeature mapping (x) = k (x, ). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of (X ) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable (X ). To simplify notation we assume that nonzero eigenvalues 1 > 2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let 1 , 2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{1 , . . . , D }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its first D eigenvectors. We denote PS , P b the orthogonal projectors on D SD these spaces. A first bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The first step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supxX k (x, x) M , with probability greater than 1 - e- , 1 . 2M + Cn - C 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simplified Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues 1 > 2 > . . . For an integer r such that r > 0, let r = r r-1 where r = 1 (r - r+1 ). 2 Let B H S (H) be another symmetric operator such that B < r /2 and (A + B ) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B )) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B )). Then, these projectors satisfy: Pr (A) - Pr (A + B ) 2B . r\n\nRemark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let r , r denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that r , r > 0. Then Pr - Pr 2 = 2(1 - r , r 2) 2(1 - r , r ) = r - r 2 .\n\n\f\nNow, the orthogonal projector on the direct sum of the first D eigenspaces is the sum D r =1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 - e- the following holds: 1 D 4 , r P M -1 + SD - P SD b r 2 n =1 provided that n 16M 2 is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap D enters into account. 1 +\n2\n\n2\n\n- (sup1rD r 2 ) . The disadvantage of this bound\n\n3\n\nImproved Result.\n\nWe first prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the first D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues 1 > 2 > . . . Let D > 0 be an integer such that 1 D > 0, D = 2 (D - D+1 ). Let B H S (H) be another symmetric operator such that B < D /2 and (A + B ) is still a positive operator. Let P D (A) (resp. P D (A + B )) denote the orthogonal projector onto the subspace spanned by the first D eigenvectors A (resp. (A + B )). Then these satisfy: P D (A) - P D (A + B ) This then gives rise to our main result on KPCA: Theorem 4 Assume that supxX k (x, x) M . Let SD , SD be the subspaces spanned by the first D eigenvectors of C , resp. Cn defined earlier. Denoting 1 > 2 > . . . the eigenvalues of C , if D > 0 is such that D > 0, put D = 1 (D - D+1 ) and 2 1 . 2M BD = + D 2\n2 Then provided that n BD , the following bound holds with probability at least 1 - e- : B P D . (2) SD - P SD b n\n\nB . D\n\n(1)\n\nThe important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only or even a worse bound since D = D D-1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal\n\nThis entails in particular g SD + h, g SD , h SD , h\n\nHk\n\n 2BD n- 2 g\n\n1\n\nHk\n\n.\n\n(3)\n\n\f\nbounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the first D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually first empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the \"true\" centered KPCA.\n\n4\n\nConclusion and Discussion\n\nIn this paper, finite sample size confidence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a first step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the first D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D-eigenspaces themselves since the target space has to be unequivocally defined, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778).\n\nA Appendix: proofs.\nProof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n completness. Cn - C = n i=1 CXi - E [CX ] with CX = (X) (X ) = 1 k (X, X ) M . We can apply the bounded difference inequality to the variable Cn - C ,\n\n\f\n so that with probability greater than 1 - e- , Cn - C E [ Cn - C ] + 2M 2n . 1n 1 Moreover, by Jensen's inequality1E [ Cn - C ] E n = 1 CXi C E [CX ] 2 2 , and - i= n 1 simple calculations leads to E n i=1 CXi - E [CX ] 2 X - E [CX ] 2 nE 2 4 PM . This concludes the proof of lemma 1. n roof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (infinite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a infinite dimensional setting is that H S (H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, T H S (H), S Lc (H, H), T S and S T belong to H S (H) with TS T S\nop\n\nand\n\nST T\n\nS\n\nop\n\n.\n\n(4)\n\nThe spectrum of an Hilbert-Schmidt operator T is denoted (T ) and the sequence of eigenvalues in non-increasing order is denoted (T ) = (1 (T ) 2 (T ) . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in infinite dimensional setting[1] yields that: (A) - (A + B ) 2 B implying in particular that i > 0, |i (A) - i (A + B )| D . 2 (6) D . 2 (5)\n\nResults found in [5] p.39 yield the formula 1 D D P (A) - P (A + B ) = - (RA (z ) - RA+B (z ))dz Lc (H, H) . 2i\n\n(7)\n\nwhere RA (z ) = (A - z I d)-1 is the resolvent of A, provided that is a simple closed curve in C enclosing exactly the first D eigenvalues of A and (A + B ). Moreover, the same reference (p.60) states that for in the complementary of (A), RA ( ) op = dist( , (A))\n-1\n\n.\n\n(8)\n\nThe proof of the theorem now relies on the simple choice for the closed curve in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > 2 , intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B . Moreover, the eigenvalues of A (resp. A + B ) enclosed by are exactly 1 (A), . . . , D (A) (resp. 1 (A + B ), . . . , D (A + B )). Moreover, for z , T (z ) = RA (z ) - RA+B (z ) = -RA+B (z )B RA (z ) belongs to H S (H) and depends continuously on z by (4). Consequently, b 1 (RA - RA+B )( (t)) | (t)|dt . P D (A) - P D (A + B ) 2 a N Let SN = n=0 (-1)n (RA (z )B )n RA (z ). RA+B (z ) = (I d + RA (z )B )-1 RA (z ) and, for z and L > D , RA (z )B op RA (z ) op B D 1 , 2 dist(z , (A)) 2\n\n\f\n\n\nL\n\nL D 0 D\n\nD\n\n2\n\n1\n\nD+1\n\nD 2\n\nD 2\n\nD 2\n\nL\n\nimply that SN - RA+B (z ) (uniformly for z ). Using property (4), since B H S (H), SN B RA (z ) - RA+B (z )B RA (z ) = RA+B (z ) - RA (z ) . Finally, n (-1)n (RA (z )B )n RA (z ) RA (z ) - RA+B (z ) =\n1 .\n\n.\n\nop\n\nwhere the series converges in H S (H), uniformly in z . Using again property (4) and (8) implies n n Bn + (RA - RA+B )( (t)) RA ( (t)) np 1 B n o distn+1 ( (t), (A))\n1 1 D 2\n\nFinally, since for L > D , B \nD D\n\nb 1 B P (A) - P (A + B ) | (t)|dt . 2 ( (t), (A)) a dist Splitting the last integral into four parts according to the definition of the contour , we obtain b 2arctan( L ) 1 1 (A) - (D (A) - D ) D , | (t)|dt + +2 dist2 ( (t), (A)) D L L2 a and letting L goes to infinity leads to the result. P\n\n\n\ndist( (t),(A)) , 2\n\nroof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n BD it implies PSD - PSD 2 . Let f SD : f = PSD (f ) + PSD (f ) . b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) Hk 2 2 PSD - PSD 2 PSD (f ) Hk . b 3 L athering the different inequalities, Theorem 4 is proved. G emma 5 Let F and G be two vector subspaces of H such that PF - PG op 1 . Then 2 the following bound holds: 4 f G , PF (f ) 2 PF - PG 2p PF (f ) 2 . H o H 3\n\n\f\nProof of Lemma 5.\n\nFor f G, we have PG (f ) = f , hence PF - PG op f 2 2 P = PF - PG op 2 F (f ) 2 + PF (f ) 2\n\nPF (f ) 2 = f - PF (f ) 2 = (PG - PF )(f ) 2 g\n\nathering the terms containing PF (f ) 2 on the left-hand side and using PF - PG op 2 R 4 leads to the conclusion. 1/\n\neferences\n[1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in infinite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 16491656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191227, 1998. [7] V. Koltchinskii and E. Gine. Random matrix approximation of spectra of integral operators. Bernoulli, 6(1):113167, 2000. [8] B. Scholkopf, A. J. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:12991319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 11591166. Morgan Kaufmann, 2000.\n\n\f\n", "award": [], "sourceid": 2762, "authors": [{"given_name": "Laurent", "family_name": "Zwald", "institution": null}, {"given_name": "Gilles", "family_name": "Blanchard", "institution": null}]}