{"title": "Sparse Kernel Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": null, "full_text": "Sparse Kernel \n\nPrincipal Component Analysis \n\nMichael E. Tipping \n\nMicrosoft Research \n\nSt George House, 1 Guildhall St \n\nCambridge CB2 3NH, U.K. \nmtipping~microsoft.com \n\nAbstract \n\n'Kernel' principal component analysis (PCA) is an elegant non(cid:173)\nlinear generalisation of the popular linear data analysis method, \nwhere a kernel function implicitly defines a nonlinear transforma(cid:173)\ntion into a feature space wherein standard PCA is performed. Un(cid:173)\nfortunately, the technique is not 'sparse', since the components \nthus obtained are expressed in terms of kernels associated with ev(cid:173)\nery training vector. This paper shows that by approximating the \ncovariance matrix in feature space by a reduced number of exam(cid:173)\nple vectors, using a maximum-likelihood approach, we may obtain \na highly sparse form of kernel PCA without loss of effectiveness. \n\n1 \n\nIntroduction \n\nPrincipal component analysis (PCA) is a well-established technique for dimension(cid:173)\nality reduction, and examples of its many applications include data compression, \nimage processing, visualisation, exploratory data analysis, pattern recognition and \ntime series prediction. Given a set of N d-dimensional data vectors X n , which we \ntake to have zero mean, the principal components are the linear projections onto \nthe 'principal axes', defined as the leading eigenvectors of the sample covariance \nmatrix S = N-1Z=:=lXnX~ = N-1XTX, where X = (Xl,X2, ... ,XN)T is the \nconventionally-defined 'design' matrix. These projections are of interest as they \nretain maximum variance and minimise error of subsequent linear reconstruction. \n\nHowever, because PCA only defines a linear projection of the data, the scope of \nits application is necessarily somewhat limited. This has naturally motivated vari(cid:173)\nous developments of nonlinear 'principal component analysis' in an effort to model \nnon-trivial data structures more faithfully, and a particularly interesting recent in(cid:173)\nnovation has been 'kernel PCA' [4]. \n\nKernel PCA, summarised in Section 2, makes use of the 'kernel trick', so effectively \nexploited by the 'support vector machine', in that a kernel function k(\u00b7,\u00b7) may \nbe considered to represent a dot (inner) product in some transformed space if it \nsatisfies Mercer's condition -\nif it is the continuous symmetric kernel of a \npositive integral operator. This can be an elegant way to 'non-linearise' linear \n\ni.e. \n\n\fprocedures which depend only on inner products of the examples. \n\nApplications utilising kernel PCA are emerging [2], but in practice the approach \nsuffers from one important disadvantage in that it is not a sparse method. Com(cid:173)\nputation of principal component projections for a given input x requires evaluation \nof the kernel function k(x, xn) in respect of all N 'training' examples Xn. This is \nan unfortunate limitation as in practice, to obtain the best model, we would like to \nestimate the kernel principal components from as much data as possible. \n\nHere we tackle this problem by first approximating the covariance matrix in feature \nspace by a subset of outer products of feature vectors, using a maximum-likelihood \ncriterion based on a 'probabilistic PCA' model detailed in Section 3. Subsequently \napplying (kernel) PCA defines sparse projections. Importantly, the approximation \nwe adopt is principled and controllable, and is related to the choice of the number of \ncomponents to 'discard' in the conventional approach. We demonstrate its efficacy \nin Section 4 and illustrate how it can offer similar performance to a full non-sparse \nkernel PCA implementation while offering much reduced computational overheads. \n\n2 Kernel peA \n\nAlthough PCA is conventionally defined (as above) in terms of the covariance, or \nouter-product, matrix, it is well-established that the eigenvectors of XTX can be \nobtained from those of the inner-product matrix XXT. If V is an orthogonal ma(cid:173)\ntrix of column eigenvectors of XXT with corresponding eigenvalues in the diagonal \nmatrix A, then by definition (XXT)V = VA. Pre-multiplying by X T gives: \n\n(XTX)(XTV) = (XTV)A. \n\n(1) \nFrom inspection, it can be seen that the eigenvectors of XTX are XTV, with eigen(cid:173)\nvalues A. Note, however, that the column vectors XTV are not normalised since \nfor column i, llTXXTlli = AillTlli = Ai, so the correctly normalised eigenvectors of \nXTX, and thus the principal axes of the data, are given by Vpca = XTVA -'. \nThis derivation is useful if d > N, when the dimensionality of x is greater than \nthe number of examples, but it is also fundamental for implementing kernel PCA. \nIn kernel PCA, the data vectors Xn are implicitly mapped into a feature space by \na set of functions {ifJ} : Xn -+ 4>(xn). Although the vectors 4>n = 4>(xn) in the \nfeature space are generally not known explicitly, their inner products are defined \nby the kernel: 4>-:n4>n = k(xm, xn). Defining cp as the (notional) design matrix in \nfeature space, and exploiting the above inner-product PCA formulation, allows the \n, S4> = N- l L:n 4>n4>~, to be \neigenvectors of the covariance matrix in feature spacel \nspecified as: \n\n1 \n\nVkpca=cpTVA-', \n\n(2) \nwhere V, A are the eigenvectors/values of the kernel matrix K, with (K)mn = \nk(xm,xn). Although we can't compute Vkpca since we don't know cp explicitly, we \ncan compute projections of arbitrary test vectors x* -+ 4>* onto Vkpca in feature \nspace: \n\n1 \n\n4>~Vkpca = 4>~cpTVA -~ = k~VA-~, \n\n(3) \nwhere k* is the N -vector of inner products of x* with the data in kernel space: \n(k)n = k(x*,xn). We can thus compute, and plot, these projections - Figure 1 \ngives an example for some synthetic 3-cluster data in two dimensions. \n\nlHere, and in the rest of the paper, we do not 'centre' the data in feature space, \nalthough this may be achieved if desired (see [4]). In fact, we would argue that when using \na Gaussian kernel, it does not necessarily make sense to do so. \n\n\f0.218 \n.' \n\n0.203 \n.' \n\n0.191 \n\n-.I.-\n:!.'-:~ \n\n. \n\n. \n\n0.057 \n\n0.053 \n\n0.051 \n\n:fe\u00b7\u00b7\u00b7 \n. \n'\"F \n\n. \n~~: ~ \n\n-.. \n. . \n. \n. \n. \n. . . \n.. : \n\n. \n. \n. \n\n. \n..-\n. . \n\n. \n\n0.047 \n\n0.043 \n\n0.036 \n\n~.I:.' . \n. ' . . \" . \n\n. \n\nFigure 1: Contour plots of the first nine principal component projections evaluated over a \nregion of input space for data from 3 Gaussian clusters (standard deviation 0.1; axis scales \nare shown in Figure 3) each comprising 30 vectors. A Gaussian kernel, exp( -lIx-x'11 2 /r2), \nwith width r = 0.25, was used. The corresponding eigenvalues are given above each \nprojection. Note how the first three components 'pick out' the individual clusters [4]. \n\n3 Probabilistic Feature-Space peA \n\nOur approach to sparsifying kernel peA is to a priori approximate the feature space \nsample covariance matrix Sq, with a sum of weighted outer products of a reduced \nnumber of feature vectors. \n(The basis of this technique is thus general and its \napplication not necessarily limited to kernel peA.) This is achieved probabilistically, \nby maximising the likelihood of the feature vectors under a Gaussian density model \n\u00a2 ~ N(O, C) , where we specify the covariance C by: \n\nC = (721 + L Wi\u00a2i\u00a2r = (721 + c)TWC), \n\nN \n\ni=1 \n\n(4) \n\nwhere W1 ... WN are the adjustable weights, W is a matrix with those weights on \nthe diagonal, and (72 is an isotropic 'noise' component common to all dimensions \nof feature space. Of course, a naive maximum of the likelihood under this model \nis obtained with (72 = a and all Wi = 1/ N. However, if we fix (72, and optimise \nonly the weighting factors Wi, we will find that the maximum-likelihood estimates \nof many Wi are zero, thus realising a sparse representation of the covariance matrix. \n\nThis probabilistic approach is motivated by the fact that if we relax the form of the \nmodel, by defining it in terms of outer products of N arbitrary vectors Vi (rather \nthan the fixed training vectors), i.e. C = (721+ l:~1 WiViV'[, then we realise a form \nof 'probabilistic peA' [6]. That is, if {Ui' Ai} are the set of eigenvectors/values of Sq\" \nthen the likelihood under this model is maximised by Vi = Ui and Wi = (Ai _(72)1/2, \nfor those i for which Ai > (72. For Ai :::; (72, the most likely weights Wi are zero. \n\n\f3.1 Computations in feature space \n\nWe wish to maximise the likelihood under a Gaussian model with covariance given \nby (4). Ignoring terms independent of the weighting parameters, its log is given by: \n\n(5) \n\nComputing (5) requires the quantities ICI and (VC-1rP, which for infinite dimen(cid:173)\nsionality feature spaces might appear problematic. However, by judicious re-writing \nof the terms of interest, we are able to both compute the log-likelihood (to within \na constant) and optimise it with respect to the weights. First, we can write: \n\nlog 1(T21 + 4)TW4) I = D log (T2 + log IW- 1 + (T-24)4)TI + log IWI. \n\n(6) \n\nThe potential problem of infinite dimensionality, D, of the feature space now en(cid:173)\nters only in the first term, which is constant if (T2 is fixed and so does not affect \nmaximisation. The term in IWI is straightforward and the remaining term can be \nexpressed in terms of the inner-product (kernel) matrix: \n\nW- 1 + (T-24)4)T = W-1 + (T-2K, \n\n(7) \n\nwhere K is the kernel matrix such that (K)mn = k(xm , xn). \nFor the data-dependent term in the likelihood, we can use the Woodbury matrix \ninversion identity to compute the quantities rP~C-lrPn: \n\nrP~((T21 + 4)W4)T)-lrPn = rP~ [(T-21 - (T-44)(W- 1 + (T-24)T4\u00bb)-14)TJ rPn' \n\n= (T-2k(xn, xn) - (T-4k~(W-l + (T-2K)-lkn , \n\n(8) \n\nwith k n = [k(xn, xt), k(xn, X2), ... ,k(xn, XN )r\u00b7 \n\n3.2 Optimising the weights \n\nTo maximise the log-likelihood with respect to the Wi, differentiating (5) gives us: \n\n{)C = ! (A.TC-14)T4)C-1A.. _ NA.TC-1A..) \n\n{)Wi \n\n2 \n\n'1', \n\n'1', \n\n'1', \n\n'1'\" \n\n= 2~2 (t M~i + N};,ii - NWi) , \n\n, \n\nn=l \n\nwhere};, and I-Ln are defined respectively by \n\n};, = (W- 1 + (T-2K)-1, \nI-Ln = (T-2};'kn. \n\nSetting (10) to zero gives re-estimation equations for the weights: \n\nnew N-1 '\"\" 2 + ~ \nL