{"title": "The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum", "book": "Advances in Neural Information Processing Systems", "page_first": 383, "page_last": 390, "abstract": null, "full_text": "The Stability of Kernel Principal \n\nComponents Analysis and its Relation to \n\nthe Process Eigenspectrum \n\nJohn Shawe-Taylor \n\nRoyal Holloway \n\nUniversity of London \njohn\u00a9cs.rhul.ac.uk \n\nChristopher K. I. Williams \n\nSchool of Informatics \n\nUniversity of Edinburgh \n\nc.k.i.williams\u00a9ed.ac.uk \n\nAbstract \n\nIn this paper we analyze the relationships between the eigenvalues \nof the m x m Gram matrix K for a kernel k(\u00b7, .) corresponding to a \nsample Xl, ... ,Xm drawn from a density p(x) and the eigenvalues \nof the corresponding continuous eigenproblem. We bound the dif(cid:173)\nferences between the two spectra and provide a performance bound \non kernel peA. \n\n1 \n\nIntroduction \n\nOver recent years there has been a considerable amount of interest in kernel methods \nfor supervised learning (e.g. Support Vector Machines and Gaussian Process predic(cid:173)\nt ion) and for unsupervised learning (e.g. kernel peA, Sch61kopf et al. (1998)). In \nthis paper we study the stability of the subspace of feature space extracted by kernel \npeA with respect to the sample of size m, and relate this to the feature space that \nwould be extracted in the infinite sample-size limit. This analysis essentially \"lifts\" \ninto (a potentially infinite dimensional) feature space an analysis which can also \nbe carried out for peA, comparing the k-dimensional eigenspace extracted from \na sample covariance matrix and the k-dimensional eigenspace extracted from the \npopulation covariance matrix, and comparing the residuals from the k-dimensional \ncompression for the m-sample and the population. \nEarlier work by Shawe-Taylor et al. (2002) discussed the concentration of spectral \nproperties of Gram matrices and of the residuals of fixed projections. However, \nthese results gave deviation bounds on the sampling variability the eigenvalues of \nthe Gram matrix, but did not address the relationship of sample and population \neigenvalues, or the estimation problem of the residual of peA on new data. \nThe structure the remainder of the paper is as follows. In section 2 we provide \nbackground on the continuous kernel eigenproblem, and the relationship between \nthe eigenvalues of certain matrices and the expected residuals when projecting into \nspaces of dimension k. Section 3 provides inequality relationships between the \nprocess eigenvalues and the expectation of the Gram matrix eigenvalues. Section 4 \npresents some concentration results and uses these to develop an approximate chain \nof inequalities. In section 5 we obtain a performance bound on kernel peA, relating \nthe performance on the training sample to the expected performance wrt p(x). \n\n\f2 Background \n\n2.1 The kernel eigenproblern \n\nFor a given kernel function k(\u00b7,\u00b7) the m x m Gram matrix K has entries k(Xi,Xj), \ni, j = 1, ... ,m, where {Xi: i = 1, ... ,m} is a given dataset. For Mercer kernels K \nis symmetric positive semi-definite. We denote the eigenvalues of the Gram matrix \nas Al 2: A2 .. . 2: Am 2: 0 and write its eigendecomposition as K = zAz' where A \nis a diagonal matrix of the eigenvalues and Z' denotes the transpose of matrix Z. \nThe eigenvalues are also referred to as the spectrum of the Gram matrix. \nWe now describe the relationship between the eigenvalues of the Gram matrix and \nthose of the underlying process. For a given kernel function and density p(x) on a \nspace X, we can also write down the eigenfunction problem \n\nIx k(x,Y)P(X)\u00a2i(X) dx = AiC/Ji(Y)\u00b7 \n\n(1) \n\nthat the eigenfunctions are orthonormal with respect \n\nNote \ni.e. \nJ x (Pi(x)p(x)\u00a2j (x)dx = 6ij. Let the eigenvalues be ordered so that Al 2: A2 2: .... \nThis continuous eigenproblem can be approximated in the following way. Let \n{Xi: i = 1, . .. , m} be a sample drawn according to p(x). Then as pointed out in \nWilliams and Seeger (2000), we can approximate the integral with weight function \np(x) by an average over the sample points, and then plug in Y = Xj for j = 1, ... ,m \nto obtain the matrix eigenproblem. \n\nto p(x), \n\nThus we see that J.1i d;j ~ Ai is an obvious estimator for the ith eigenvalue of the \ncontinuous problem. The theory of the numerical solution of eigenvalue problems \n(Baker 1977, Theorem 3.4) shows that for a fixed k, J.1k will converge to Ak in the \nlimit as m -+ 00. \nFor the case that X is one dimensional, p(x) is Gaussian and k(x, y) = exp -b(x(cid:173)\ny)2, there are analytic results for the eigenvalues and eigenfunctions of equation (1) \nas given in section 4 of Zhu et al. (1998). A plot in Williams and Seeger (2000) for \nm = 500 with b = 3 and p(x) '\" N(O, 1/4) shows good agreement between J.1i and Ai \nfor small i, but that for larger i the matrix eigenvalues underestimate the process \neigenvalues. One of the by-products of this paper will be bounds on the degree of \nunderestimation for this estimation problem in a fully general setting. \nKoltchinskii and Gine (2000) discuss a number of results including rates of conver(cid:173)\ngence of the J.1-spectrum to the A-spectrum. The measure they use compares the \nwhole spectrum rather than individual eigenvalues or subsets of eigenvalues. They \nalso do not deal with the estimation problem for PCA residuals. \n\n2.2 Projections, residuals and eigenvalues \n\nThe approach adopted in the proofs of the next section is to relate the eigenvalues \nto the sums of squares of residuals. Let X be a random variable in d dimensions, \nand let X be a d x m matrix containing m sample vectors Xl, ... , X m . Consider \nthe m x m matrix M = XIX with eigendecomposition M = zAz'. Then taking \nX = Z VA we obtain a finite dimensional version of Mercer's theorem. To set the \nscene, we now present a short description of the residuals viewpoint. \nThe starting point is the singular value decomposition of X = UY',Z' , where U \nand Z are orthonormal matrices and Y', is a diagonal matrix containing the singular \n\n\fvalues (in descending order). We can now reconstruct the eigenvalue decomposition \nof M = X'X = Z~U'U~Z' = zAz', where A = ~2. But equally we can construct \na d x d matrix N = X X' = U~Z' Z~U' = u Au', with the same eigenvalues as M. \nWe have made a slight abuse of notation by using A to represent two matrices of \npotentially different dimensions, but the larger is simply an extension of the smaller \nwith O's. Note that N = mCx , where Cx is the sample correlation matrix. \nLet V be a linear space spanned by k linearly independent vectors. Let Pv(x) \n(PV(x)) be the projection of x onto V (space perpendicular to V), so that IlxW = \nIIPv(x)112 + IIPv(x)112. Using the Courant-Fisher minimax theorem it can be proved \n(Shawe-Taylor et al., 2002, equation 4) that \n\nm \n\nm \n\nm \n\nm \n\nk \n\ni=l \n\nm \n\nm \n\nL IIxjl12 - L )...i(M) = min L IlPv(xj)112. \nj=l \n\nL )...i(M) \ni=k+1 \nHence the subspace spanned by the first k eigenvectors is characterised as that for \nwhich the sum of the squares of the residuals is minimal. We can also obtain similar \nresults for the population case, e.g. L7=1 Ai = maXdim(V)=k lE[IIPv (x) 112]. \n2.3 Residuals in feature space \n\ndim(V)=k \n\nj=l \n\n(2) \n\nFrequently, we consider all of the above as occurring in a kernel defined feature \nspace, so that wherever we have written a vector x we should have put 'l/J(x), \nwhere 'l/J is the corresponding feature map 'l/J : x E X f---t 'l/J(x) E F to a feature \nspace F. Hence, the matrix M has entries Mij = ('l/J(Xi),'l/J(Xj)). The kernel \nfunction computes the composition of the inner product with the feature maps, \nk(x, z) = ('l/J(x) , 'l/J(z)) = 'l/J(x)''l/J(z) , which can in many cases be computed without \nexplicitly evaluating the mapping 'l/J. We would also like to evaluate the projections \ninto eigenspaces without explicitly computing the feature mapping 'l/J . This can be \ndone as follows. Let Ui be the i-th singular vector in the feature space, that is \nthe i-th eigenvector of the matrix N, with the corresponding singular value being \nO\"i = ~ and the corresponding eigenvector of M being Zi. The projection of an \ninput x onto Ui is given by \n\n'l/J(X)'Ui = ('l/J(X)'U)i = ('l/J(x)' X Z)W;l = k'ZW;l, \n\nwhere we have used the fact that X = U~Z' and k j = 'l/J(x)''l/J(Xj) = k(x,xj). \nOur final background observation concerns the kernel operator and its eigenspaces. \nThe operator in question is \n\nK(f)(x) = Ix k(x, z)J(z)p(z)dz. \n\nProvided the operator is positive semi-definite, by Mercer's theorem we can de(cid:173)\ncompose k(x,z) as a sum of eigenfunctions, k(x,z) = L :1 AiC!Ji(X) \u00a2i(Z) = \n('l/J(x), 'l/J(z)), where the functions (\u00a2i(X)) ~l form a complete orthonormal basis \nwith respect to the inner product (j, g)p = Ix J(x)g(x)p(x)dx and 'l/J(x) is the \nfeature space mapping \n\n'l/J : x --+ (1Pi(X)):l = ( A\u00a2i(X)):l E F. \n\nNote that \u00a2i(X) has norm 1 and satisfies Ai\u00a2i(x) = Ix k(x, z)\u00a2i(z)p(z)dz (equation \n1) , so that \n\nAi = r k(y, Z) \u00a2i(Y)\u00a2i (Z)p(Z)p(y)dydz. \n\niX2 \n\n(3) \n\n\flE [llPr(1jJ(x)) 112] \n\nIf we let cf>(x) = (cPi(X)):l E F, we can define the unit vector U i E F corresponding \nto Ai by Ui = Ix cPi(x)cf>(x)p(x)dx. For a general function J(x) we can similarly \ndefine the vector f = Ix J(x)cf>(x)p(x)dx. Now the expected square of the norm of \nthe projection Pr(1jJ(x)) onto the vector f (assumed to be of norm 1) of an input \n1jJ(x) drawn according to p(x) is given by \n\n= L IlPr(1jJ(x))Wp(x)dx = L (f'1jJ(X))2 p(x)dx \n= L L L J(y) cf>(y)'1jJ (x)p(y)dyJ(z)cf> (z)'1jJ (x)p(z)dzp(x)dx \n= L3 J(y)J(z) t, A cPj(Y)cPj(x)p(y)dy ~ v>:ecPe(z)cPe(x)p(z)dzp(x)dx \n= L2 J(y)J(z) j~l AcPj(y)p(y)dyv'):ecPe(z)p(z)dz Ix cPj(x)cPe(x)p(x)dx \n= L2 J(y)J(z) ~ AjcPj (Y)cPj (z)p(y)dyp(z)dz \n= r J(y)J(z)k(y , z)p(y)p(z)dydz. \niX2 \n\nSince all vectors f in the subspace spanned by the image of the input space in F \ncan be expressed in this fashion, it follows using (3) that the sum of the finite case \ncharacterisation of eigenvalues and eigenvectors is replaced by an expectation \n\nAk = max min lE[llPv (1jJ(x)) 112], \n\ndim(V)=k O#vEV \n\n(4) \n\nwhere V is a linear subspace of the feature space F. Similarly, \n\nk \nL:Ai \ni=l \n\n00 \n\nmax \n\ndim(V)=k \n\nlE [llPv(1jJ(x)) 112] = lE [111jJ(x)112] -\n\nmin \n\ndim(V)=k \n\nlE [IIPv(1jJ(x))112] , \n\n(5) \n\nwhere Pv(1jJ(x)) (PV(1jJ(x))) is the projection of 1jJ(x) into the subspace V (the \nprojection of 1jJ(x) into the space orthogonal to V). \n\n2.4 Plan of campaign \n\nWe are now in a position to motivate the main results ofthe paper. We consider the \ngeneral case of a kernel defined feature space with input space X and probability \ndensity p(x). We fix a sample size m and a draw of m examples S = (Xl, X2 , ... , xm ) \naccording to p. Further we fix a feature dimension k. Let Vk be the space spanned by \nthe first k eigenvectors of the sample kernel matrix K with corresponding eigenvalues \n'\\1, '\\2,\"\" '\\k, while Vk is the space spanned by the first k process eigenvectors with \ncorresponding eigenvalues A1 , A2 , ... , Ak' Similarly, let E[J(x)] denote expectation \nwith respect to the sample, E[J(x)] = ~ 2:::1 J(Xi), while as before lE[\u00b7] denotes \nexpectation with respect to p. \nWe are interested in the relationships between the following quantities: \nE [IIPVk (x)11 2] = ~ 2:7=1 ~i = 2:7=1 ILi , (ii) \n\n(i) \nlE [IIPVk(X)112] = 2:7=1 Ai (iii) \n\n\flE [IIPVk (x)11 2] and (iv) IE [IIPVk (x)11 2] . Bounding the difference between the first \nand second will relate the process eigenvalues to the sample eigenvalues, while the \ndifference between the first and third will bound the expected performance of the \nspace identified by kernel PCA when used on new data. \nOur first two observations follow simply from equation (5), \n\nIE [IIPYk (x) 112] \n\nand \n\nlE [IIPVk (x) 11 2] \n\nA \n\n[ \n\nk \n\n2] \nAi ~ lE IIPVk (x) II , \n\n1 l: A \n-\nm i=l \nk \nl: Ai ~ lE [IIPYk (x)11 2] . \ni=l \n\n(6) \n\n(7) \n\nOur strategy will be to show that the right hand side of inequality (6) and the left \nhand side of inequality (7) are close in value making the two inequalities approxi(cid:173)\nmately a chain of inequalities. We then bound the difference between the first and \nlast entries in the chain. \n\n3 A veraging over Samples and Population Eigenvalues \nThe sample correlation matrix is ex = ~XXI with eigenvalues ILl ~ IL2\u00b7\u00b7\u00b7 ~ ILd. \nIn the notation of the section 2 ILi = (l/m),\\i ' The corresponding population \ncorrelation matrix has eigenvalues Al ~ A2 ... ~ Ad and eigenvectors ul , . .. , U d. \nAgain by the observations above these are the process eigenvalues. Let lE.n [.] denote \naverages over random samples of size m . \n\nThe following proposition describes how lE.n [ILl ] is related to Al and lE.n [ILd] is related \nto Ad. It requires no assumption of Gaussianity. \nProposition 1 (Anderson, 1963, pp 145-146) lE.n [ILd ~ Al and lE.n[ILd] :s: Ad' \nProof: By the results of the previous section we have \n\nWe now apply the expectation operator lE.n to both sides. On the RHS we get \n\nlE.nIE [llFul (x )11 2] = lE [llFul (x)112] = Al \n\nby equation (5), which completes the proof. Correspondingly ILd is characterized by \nILd = mino#c IE [llFc(Xi) 11 2] (minor components analysis). D \nInterpreting this result, we see that lE.n [ILl] overestimates AI, while lE.n [ILd] under(cid:173)\nestimates Ad. \nProposition 1 can be generalized to give the following result where we have also \nallowed for a kernel defined feature space of dimension N F :s: 00. \nProposition 2 Using the above notation, for any k, 1 :s: k :s: m , lE.n [L:~=l ILi] ~ \nL:~=l Ai and lE.n [L::k+l ILi] :s: L:~k+l Ai\u00b7 \nProof: Let Vk be the space spanned by the first k process eigenvectors. Then from \nthe derivations above we have \n\nk \n\nl:ILi = v: ::~=k IE [11Fv('I/J(x))W] ~ IE [llFvk('I/J(x ))1 12]. \ni=l \n\n\fAgain, applying the expectation operator Em to both sides of this equation and \ntaking equation (5) into account, the first inequality follows. To prove the second we \nturn max into min, Pinto pl. and reverse the inequality. Again taking expectations \nof both sides proves the second part. 0 \nApplying the results obtained in this section, it follows that Em [ILl] will overestimate \nA1, and the cumulative sum 2::=1 Em [ILi ] will overestimate 2::=1 Ai. At the other \nend, clearly for N F \n\n::::: k > m, ILk == 0 is an underestimate of Ak. \n\n4 Concentration of eigenvalues \n\nWe now make use of results from Shawe-Taylor et al. (2002) concerning the concen(cid:173)\ntration of the eigenvalue spectrum of the Gram matrix. We have \nTheorem 3 Let K(x, z) be a positive semi-definite kernel function on a space X, \nand let p be a probability density function on X. Fix natural numbers m and 1 :::; \nk < m and let S = (Xl, ... ,Xm) E xm be a sample of m points drawn according to \np. Then for all t > 0, \n\np{ I ~~~k(S)_Em [~~9(S)] 1 :::::t} \n\n:::; 2exp(-~:m), \n\nwhere ~~k (S) is the sum of the largest k eigenvalues of the matrix K(S) with entries \nK(S)ij = K(Xi,Xj) and R2 = maxxEX K(x, x). \nThis follows by a similar derivation to Theorem 5 in Shawe-Taylor et al. (2002). \nOur next result concerns the concentration of the residuals with respect to a fixed \nsubspace. For a subspace V and training set S, we introduce the notation \n\nFv(S) = t [llPv('IjJ(x)) 112] . \n\nTheorem 4 Let p be a probability density function on X. Fix natural numbers m \nand a subspace V and let S = (Xl' ... ' Xm) E xm be a sample of m points drawn \naccording to a probability density function p. Then for all t > 0, \n::::: t} :::; 2exp (~~~) . \n\nP{Fv(S) - Em [Fv(S)] 1 \n\nThis is theorem 6 in Shawe-Taylor et al. (2002). \nThe concentration results of this section are very tight. In the notation of the earlier \nsections they show that with high probability \n\nand \n\nk L Ai ~ t [IIPVk ('IjJ(x))W] , \n\ni = l \n\n(9) \n\nwhere we have used Theorem 3 to obtain the first approximate equality and Theo(cid:173)\nrem 4 with V = Vk to obtain the second approximate equality. \nThis gives the sought relationship to create an approximate chain of inequalities \n\n~ IE [IIPVk('IjJ(x))112] = L Ai::::: IE [IIPVk ('IjJ(X)) 112] . (10) \n\nk \n\ni = l \n\n\fThis approximate chain of inequalities could also have been obtained using Propo(cid:173)\nsition 2. It remains to bound the difference between the first and last entries in this \nchain. This together with the concentration results of this section will deliver the \nrequired bounds on the differences between empirical and process eigenvalues, as \nwell as providing a performance bound on kernel peA. \n\n5 Learning a projection matrix \n\nThe key observation that enables the analysis bounding the difference between \nt [IIPvJ!p(X)) 11 2] and IE [IIPvJ'I/J(x)) 11 2] is that we can view the projection norm \nIIPvJ'I/J(x))1 12 as a linear function of pairs offeatures from the feature space F. \nProposition 5 The projection norm IIPVk ('I/J(X)) 11 2 is a linear function j in a fea(cid:173)\nture space F for which the kernel function is given by k(x, z) = k(x , Z)2. Further(cid:173)\nmore the 2-norm of the function j is Vk. \nProof: Let X = Uy:.Z' be the singular value decomposition of the sample matrix X \nin the feature space. The projection norm is then given by j(x) = IIPVk ('I/J(X)) 11 2 = \n'I/J(x)'UkUk'I/J(x), where Uk is the matrix containing the first k columns of U. Hence \nwe can write \n\nIIPvJ'I/J(x))11 2 = l: (Xij'I/J(X) i'I/J(X)j = l: (Xij1p(X)ij, \n\nNF \n\nNF \n\nwhere 1p is the projection mapping into the feature space F consisting of all pairs \nof F features and (Xij = (UkUk)ij. The standard polynomial construction gives \n\nij=l \n\nij=l \n\nk(x, z) \n\nNF \n\nl: 'I/J(X)i'I/J(Z)i'I/J(X)j'I/J(z)j = l: ('I/J(X)i'I/J(X)j)('I/J(Z)i'I/J(Z)j) \n\nNF \n\ni,j=l \n\ni,j=l \n\nIt remains to show that the norm of the linear function is k. The norm satisfies \n(note that II . IIF denotes the Frobenius norm and U i the columns of U) \n\ni~' a1j ~ IIU,U;II} ~ (~\",U;, t, Ujuj) F ~ it, (U;Uj)' ~ k \n\nIlill' \n\nas required. D \nWe are now in a position to apply a learning theory bound where we consider a \nregression problem for which the target output is the square of the norm of the \nsample point 11'I/J(x)112. We restrict the linear function in the space F to have norm \nVk. The loss function is then the shortfall between the output of j and the squared \nnorm. \nUsing Rademacher complexity theory we can obtain the following theorems: \n\nTheorem 6 If we perform peA in the feature space defined by a kernel k(x, z) \nthen with probability greater than 1 - 6, for all 1 :::; k :::; m, if we project new data \n\n\fonto the space 11k , the expected squared residual is bounded by \n\n,\\,>. :<: IE [ IIPt; (\"'(x)) II' 1 < '~'~k [ ~ \\>l(S) + 7# ,----------------, \n\n+R2 ~ln C:) \n\nl max [~.x