{"title": "Kernel PCA and De-Noising in Feature Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 536, "page_last": 542, "abstract": null, "full_text": "Kernel peA and De-Noising in Feature Spaces \n\nSebastian Mika, Bernhard Scholkopf, Alex Smola \n\nKlaus-Robert Muller, Matthias Scholz, Gunnar Riitsch \nGMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany \n\n{mika, bs, smola, klaus, scholz, raetsch} @first.gmd.de \n\nAbstract \n\nKernel PCA as a nonlinear feature extractor has proven powerful as a \npreprocessing step for classification algorithms. But it can also be con(cid:173)\nsidered as a natural generalization of linear principal component anal(cid:173)\nysis. This gives rise to the question how to use nonlinear features for \ndata compression, reconstruction, and de-noising, applications common \nin linear PCA. This is a nontrivial task, as the results provided by ker(cid:173)\nnel PCA live in some high dimensional feature space and need not have \npre-images in input space. This work presents ideas for finding approxi(cid:173)\nmate pre-images, focusing on Gaussian kernels, and shows experimental \nresults using these pre-images in data reconstruction and de-noising on \ntoy examples as well as on real world data. \n\n1 peA and Feature Spaces \nPrincipal Component Analysis (PC A) (e.g. [3]) is an orthogonal basis transformation. \nThe new basis is found by diagonalizing the centered covariance matrix of a data set \n{Xk E RNlk = 1, ... ,f}, defined by C = ((Xi -\nnates in the Eigenvector basis are called principal components. The size of an Eigenvalue \n>. corresponding to an Eigenvector v of C equals the amount of variance in the direction \nof v. Furthermore, the directions of the first n Eigenvectors corresponding to the biggest \nn Eigenvalues cover as much variance as possible by n orthogonal directions. In many ap(cid:173)\nplications they contain the most interesting information: for instance, in data compression, \nwhere we project onto the directions with biggest variance to retain as much information \nas possible, or in de-noising, where we deliberately drop directions with small variance. \n\n(Xk))T). The coordi(cid:173)\n\n(Xk))(Xi -\n\nClearly, one cannot assert that linear PCA will always detect all structure in a given data set. \nBy the use of suitable nonlinear features, one can extract more information. Kernel PCA \nis very well suited to extract interesting nonlinear structures in the data [9]. The purpose of \nthis work is therefore (i) to consider nonlinear de-noising based on Kernel PCA and (ii) to \nclarify the connection between feature space expansions and meaningful patterns in input \nspace. Kernel PCA first maps the data into some feature space F via a (usually nonlinear) \nfunction explicitly. A Mercer kernel is a function k(x, y) which for all data \nsets {Xi} gives rise to a positive matrix Kij = k(Xi' Xj) [6]. One can show that using \nk instead of a dot product in input space corresponds to mapping the data with some * \nto a feature space F [1], i.e. k(x,y) = (**(x) . **(y)). Kernels that have proven useful \ninclude Gaussian kernels k(x, y) = exp( -llx - Yll2 Ie) and polynomial kernels k(x, y) = \n(x\u00b7y)d. Clearly, all algorithms that can be formulated in terms of dot products, e.g. Support \nVector Machines [1], can be carried out in some feature space F without mapping the data \nexplicitly. All these algorithms construct their solutions as expansions in the potentially \ninfinite-dimensional feature space. \n\nThe paper is organized as follows: in the next section, we briefly describe the kernel PCA \nalgorithm. In section 3, we present an algorithm for finding approximate pre-images of \nexpansions in feature space. Experimental results on toy and real world data are given in \nsection 4, followed by a discussion of our findings (section 5). \n\n2 Kernel peA and Reconstruction \nTo perform PCA in feature space, we need to find Eigenvalues A > 0 and Eigenvectors \nV E F\\{O} satisfying AV = GV with G = (**(Xk)**(Xk)T).1 Substituting G into the \nEigenvector equation, we note that all solutions V must lie in the span of **-images of the \ntraining data. This implies that we can consider the equivalent system \n\nA( **(Xk) . V) = (**(Xk) . GV) for all k = 1, ... ,f \n\n(1) \n\nand that there exist coefficients Q1 , ... ,Ql such that \n\n(2) \nSubstituting C and (2) into (1), and defining an f x f matrix K by Kij := (**(Xi)\u00b7 **(Xj)) = \nk( Xi, X j), we arrive at a problem which is cast in terms of dot products: solve \n\nV = L i =l Qi(Xi) \n\nl \n\nfAa = Ko. \n\n(3) \nwhere 0. = (Q1, ... ,Ql)T (for details see [7]). Normalizing the solutions V k , i.e. (V k . \nVk) = 1, translates into Ak(o.k .o.k) = 1. To extract nonlinear principal components \nfor the **-image of a test point X we compute the projection onto the k-th component by \n13k := (V k . ** (X)) = 2:f=l Q~ k(x, Xi). Forfeature extraction, we thus have to evaluate f \nkernel functions instead of a dot product in F, which is expensive if F is high-dimensional \n(or, as for Gaussian kernels, infinite-dimensional). To reconstruct the **-image of a vector \nX from its projections 13k onto the first n principal components in F (assuming that the \nEigenvectors are ordered by decreasing Eigenvalue size), we define a projection operator \nPn by \n\n(4) \n\nk=l \n\nIf n is large enough to take into account all directions belonging to Eigenvectors with non(cid:173)\nzero Eigenvalue, we have Pn(Xi) = **(Xi). Otherwise (kernel) PCA still satisfies (i) that \nthe overall squared reconstruction error 2:i II Pn(Xi) - **(xdll 2 is minimal and (ii) the \nretained variance is maximal among all projections onto orthogonal directions in F. In \ncommon applications, however, we are interested in a reconstruction in input space rather \nthan in F. The present work attempts to achieve this by computing a vector z satisfying \n**(z) = Pn(x). The hope is that for the kernel used, such a z will be a good approxi(cid:173)\nmation of X in input space. However. (i) such a z will not always exist and (ii) if it exists, \n\nI For simplicity, we assume that the mapped data are centered in F. Otherwise, we have to go \n\nthrough the same algebra using ~(x) := **(x) -\n\n(**(x;\u00bb). \n\n\f538 \n\nS. Mika et al. \n\nit need be not unique. 2 As an example for (i), we consider a possible representation of F. \nOne can show [7] that ** can be thought of as a map ** (x) = k( x, .) into a Hilbert space 1{ k \nof functions Li ai k( Xi, .) with a dot product satisfying (k( x, .) . k(y, .)) = k( x, y). Then \n1{k is caHed reproducing kernel Hilbert space (e.g. [6]). Now, for a Gaussian kernel, 1{k \ncontains aHlinear superpositions of Gaussian bumps on RN (plus limit points), whereas \nby definition of ** only single bumps k(x,.) have pre-images under <1>. When the vector \nP n( x) has no pre-image z we try to approximate it by minimizing \n\n(5) \nThis is a special case of the reduced set method [2]. Replacing terms independent of z by \n0, we obtain \n\np(z) = 1I**(z) - Pn(x) 112 \n\np(z) = 1I**(z)112 - 2(**(z) . Pn(x)) + 0 \n\n(6) \n\nSubstituting (4) and (2) into (6), we arrive at an expression which is written in terms of \ndot products. Consequently, we can introduce a kernel to obtain a formula for p (and thus \nV' % p) which does not rely on carrying out ** explicitly \n\np(z) = k(z, z) - 2 L f3k L a~ k(z, Xi) + 0 \n\nn \n\nl \n\nk=l \n\ni=l \n\n(7) \n\n3 Pre-Images for Gaussian Kernels \nTo optimize (7) we employed standard gradient descent methods. If we restrict our attention \nto kernels of the form k(x, y) = k(llx - Y1l2) (and thus satisfying k(x, x) == const. for all \nx), an optimal z can be determined as foHows (cf. [8]): we deduce from (6) that we have \nto maximize \n\np(z) = (**(z) . Pn(x)) + 0' = L Ii k(z, Xi) + 0' \n\nl \n\ni=l \n\n(8) \n\nwhere we set Ii = L~=l f3ka: (for some 0' independent of z). For an extremum, the \ngradient with respect to z has to vanish: V' %p(z) = L~=l lik'(llz - xi112)(Z - Xi) = O. \nThis leads to a necessary condition for the extremum: z = Li tJixd Lj tJj, with tJi \n,ik'(llz - xiIl 2). For a Gaussian kernel k(x, y) = exp( -lix - Yll2 jc) we get \n\nz= \n\nL~=l Ii exp( -liz - xil1 2 jC)Xi \nLi=l liexp(-llz - xil12jc) \n\nl \n\n~ \n\nWe note that the denominator equals (**(z) . P n(X)) (cf. (8\u00bb. Making the assumption that \nPn(x) i= 0, we have (**(x) . Pn(x)) = (Pn(x) . Pn(x)) > O. As k is smooth, we \nconclude that there exists a neighborhood of the extremum of (8) in which the denominator \nof (9) is i= O. Thus we can devise an iteration scheme for z by \n\nZt+l = \n\nL~=l Ii exp( -llzt - xill 2 jC)Xi \nLi=l li exp(-lI z t - xil1 2jc) \n\nl \n\n(10) \n\nNumerical instabilities related to (**(z) . Pn(x)) being smaH can be dealt with by restart(cid:173)\ning the iteration with a different starting value. Furthermore we note that any fixed-point \nof (10) will be a linear combination of the kernel PCA training data Xi. If we regard (10) \nin the context of clustering we see that it resembles an iteration step for the estimation of \n\n2If the kernel allows reconstruction of the dot-product in input space, and under the assumption \nthat a pre-image exists, it is possible to construct it explicitly (cf. [7]). But clearly, these conditions \ndo not hold true in general. \n\n\fKernel peA and De-Noising in Feature Spaces \n\n539 \n\nthe center of a single Gaussian cluster. The weights or 'probabilities' T'i reflect the (anti-) \ncorrelation between the amount of cP (x) in Eigenvector direction Vk and the contribution \nof CP(Xi) to this Eigenvector. So the 'cluster center' z is drawn towards training patterns \nwith positive T'i and pushed away from those with negative T'i, i.e. for a fixed-point Zoo the \ninfluence of training patterns with smaner distance to x wi11 tend to be bigger. \n\n4 Experiments \nTo test the feasibility of the proposed algorithm, we run several toy and real world ex(cid:173)\nperiments. They were performed using (10) and Gaussian kernels of the form k(x, y) = \nexp( -(llx - YI12)/(nc)) where n equals the dimension of input space. We mainly focused \non the application of de-noising, which differs from reconstruction by the fact that we are \nallowed to make use of the original test data as starting points in the iteration. \n\nToy examples: In the first experiment (table 1), we generated a data set from eleven Gaus(cid:173)\nsians in RIO with zero mean and variance u 2 in each component, by selecting from each \nsource 100 points as a training set and 33 points for a test set (centers of the Gaussians ran(cid:173)\ndomly chosen in [-1, 1]10). Then we applied kernel peA to the training set and computed \nthe projections 13k of the points in the test set. With these, we carried out de-noising, yield(cid:173)\ning an approximate pre-image in RIO for each test point. This procedure was repeated for \ndifferent numbers of components in reconstruction, and for different values of u. For the \nkernel, we used c = 2u2 \u2022 We compared the results provided by our algorithm to those of \nlinear peA via the mean squared distance of an de-noised test points to their correspond(cid:173)\ning center. Table 1 shows the ratio of these values; here and below, ratios larger than one \nindicate that kernel peA performed better than linear peA. For almost every choice of \nnand u, kernel PeA did better. Note that using alllO components, linear peA is just a \nbasis transformation and hence cannot de-noise. The extreme superiority of kernel peA \nfor small u is due to the fact that all test points are in this case located close to the eleven \nspots in input space, and linear PeA has to cover them with less than ten directions. Ker(cid:173)\nnel PeA moves each point to the correct source even when using only a sman number of \ncomponents. \n\nn=1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\n9 \n\n10.22 \n0.99 \n1.07 \n1.23 \n\n31.32 \n1.12 \n1.26 \n1.39 \n\n21.51 \n1.18 \n1.44 \n1.54 \n\n0.05 2058.42 1238.36 846.14 565.41 309.64 170.36 125.97 104.40 92.23 \n0.1 \n40.07 63.41 \n6.32 \n5.09 \n0.2 \n0.4 \n2.47 \n2.34 \n0.8 \n2.25 \n2.39 \nTable 1: De-noising Gaussians in RIO (see text). Performance ratios larger than one indi(cid:173)\ncate how much better kernel PeA did, compared to linear PeA, for different choices of the \nGaussians' std. dev. u, and different numbers of components used in reconstruction. \n\n2:i.5:i \n2.73 \n2.08 \n1.96 \n\n29.64 \n3.72 \n2.22 \n2.10 \n\n29.24 \n1.50 \n1.64 \n1.7U \n\n27.66 \n2.11 \n1.91 \n1.8U \n\nTo get some intuitive understanding in a low-dimensional case, figure 1 depicts the results \nof de-noising a half circle and a square in the plane, using kernel peA, a nonlinear au(cid:173)\ntoencoder, principal curves, and linear PeA. The principal curves algorithm [4] iteratively \nestimates a curve capturing the structure of the data. The data are projected to the closest \npoint on a curve which the algorithm tries to construct such that each point is the average \nof all data points projecting onto it. It can be shown that the only straight lines satisfying \nthe latter are principal components, so principal curves are a generalization of the latter. \nThe algorithm uses a smoothing parameter whic:h is annealed during the iteration. In the \nnonlinear autoencoder algorithm, a 'bottleneck' 5-layer network is trained to reproduce the \ninput values as outputs (i.e. it is used in autoassociative mode). The hidden unit activations \nin the third layer form a lower-dimensional representation of the data, closely related to \n\n\f540 \n\nS. Mika et al. \n\nPCA (see for instance [3]). Training is done by conjugate gradient descent. In all algo(cid:173)\nrithms, parameter values were selected such that the best possible de-noising result was \nobtained. The figure shows that on the closed square problem, kernel PeA does (subjec(cid:173)\ntively) best, followed by principal curves and the nonlinear autoencoder; linear PeA fails \ncompletely. However, note that all algorithms except for kernel PCA actually provide an \nexplicit one-dimensional parameterization of the data, whereas kernel PCA only provides \nus with a means of mapping points to their de-noised versions (in this case, we used four \nkernel PCA features, and hence obtain a four-dimensional parameterization). \n\nkernel PCA \n\nnonlinear autoencoder \n\nPrincipal Curves \n\nlinear PCA \n\n~It%i 1J;:qf~~ \n\nI::f;~.;~,: '~ll \n\nFigure 1: De-noising in 2-d (see text). Depicted are the data set (small points) and its \nde-noised version (big points, joining up to solid lines). For linear PCA, we used one \ncomponent for reconstruction, as using two components, reconstruction is perfect and thus \ndoes not de-noise. Note that all algorithms except for our approach have problems in \ncapturing the circular structure in the bottom example. \n\nUSPS example: To test our approach on real-world data, we also applied the algorithm \nto the USPS database of 256-dimensional handwritten digits. For each of the ten digits, \nwe randomly chose 300 examples from the training set and 50 examples from the test \nset. We used (10) and Gaussian kernels with c = 0.50, equaling twice the average of \nthe data's variance in each dimensions. In figure 4, we give two possible depictions of \n\n1111 II iI 111111 fill \n0\"(1' . ___ \u2022 \u2022 \n111111111111 \n11m \n\nFigure 2: Visualization of Eigenvectors (see \ntext). Depicted are the 2\u00b0, ... , 28 -th Eigen(cid:173)\nvector (from left to right). First row: linear \nPeA, second and third row: different visual(cid:173)\nizations for kernel PCA. \n\nthe Eigenvectors found by kernel PCA, compared to those found by linear PCA for the \nUSPS set. The second row shows the approximate pre-images of the Eigenvectors V k , \nk = 2\u00b0, ... ,28, found by our algorithm. In the third row each image is computed as \nfollows: Pixel i is the projection of the N). In addition, if the structure to be extracted is nonlinear, then linear peA \nmust necessarily fail, as we have illustrated with toy examples. \n\nThese methods, along with depictions of pre-images of vectors in feature space, provide \nsome understanding of kernel methods which have recently attracted increasing attention. \nOpen questions include (i) what kind of results kernels other than Gaussians will provide, \n\n\f542 \n\nS. Mika et al. \n\nFigure 4: De-Noising of USPS data (see text). The left half shows: top: the first occurrence \nof each digit in the test set, second row: the upper digit with additive Gaussian noise (0' = \n0.5), following five rows: the reconstruction for linear PCA using n = 1,4,16,64,256 \ncomponents, and, last five rows: the results of our approach using the same number of \ncomponents. In the right half we show the same but for 'speckle' noise with probability \np = 0.4. \n\n(ii) whether there is a more efficient way to solve either (6) or (8), and (iii) the comparison \n(and connection) to alternative nonlinear de-noising methods (cf. [5]). \n\nReferences \n\n[1] B. Boser, I. Guyon, and V.N. Vapnik. A training algorithm for optimal margin clas(cid:173)\n\nsifiers. In D. Haussler, editor, Proc. COLT, pages 144-152, Pittsburgh, 1992. ACM \nPress. \n\n[2] C.J.c. Burges. Simplified support vector decision rules. In L. Saitta, editor, Prooceed(cid:173)\n\nings, 13th ICML, pages 71-77, San Mateo, CA, 1996. \n\n[3] K.I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. Wiley, New \n\nYork,1996. \n\n[4] T. Hastie and W. Stuetzle. Principal curves. JASA, 84:502-516,1989. \n[5] S. Mallat and Z. Zhang. Matching Pursuits with time-frequency dictionaries. IEEE \n\nTransactions on Signal Processing, 41(12):3397-3415, December 1993. \n\n[6] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & \n\nTechnical, Harlow, England, 1988. \n\n[7] B. Scholkopf. Support vector learning. Oldenbourg Verlag, Munich, 1997. \n[8] B. Scholkopf, P. Knirsch, A. Smola, and C. Burges. Fast approximation of support vec(cid:173)\n\ntor kernel expansions, and an interpretation of clustering as approximation in feature \nspaces. In P. Levi et. a1., editor, DAGM'98, pages 124 - 132, Berlin, 1998. Springer. \n\n[9] B. Scholkopf, A.J. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel \n\neigenvalue problem. Neural Computation, 10:1299-1319,1998. \n\n\f", "award": [], "sourceid": 1491, "authors": [{"given_name": "Sebastian", "family_name": "Mika", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Matthias", "family_name": "Scholz", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}]}*