{"title": "On a Connection between Kernel PCA and Metric Multidimensional Scaling", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 681, "abstract": null, "full_text": "On a Connection between Kernel PCA \nand Metric Multidimensional Scaling \n\nChristopher K. I. WilliaIns \n\nDivision of Informatics \n\nThe University of Edinburgh \n\n5 Forrest Hill, Edinburgh EH1 2QL, UK \n\nc.k.i.williams~ed.ac.uk \n\nhttp://anc.ed.ac.uk \n\nAbstract \n\nIn this paper we show that the kernel peA algorithm of Sch6lkopf \net al (1998) can be interpreted as a form of metric multidimensional \nscaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it \ndepends only on Ilx - yll. This leads to a metric MDS algorithm \nwhere the desired configuration of points is found via the solution \nof an eigenproblem rather than through the iterative optimization \nof the stress objective function. The question of kernel choice is \nalso discussed. \n\n1 \n\nIntroduction \n\nSuppose we are given n objects, and for each pair (i,j) we have a measurement \nof the \"dissimilarity\" Oij between the two objects. \nIn multidimensional scaling \n(MDS) the aim is to place n points in a low dimensional space (usually Euclidean) \nso that the interpoint distances dij have a particular relationship to the original \ndissimilarities. In classical scaling we would like the interpoint distances to be equal \nto the dissimilarities. For example, classical scaling can be used to reconstruct a \nmap of the locations of some cities given the distances between them. \nIn metric MDS the relationship is of the form dij ~ f(Oij) where f is a specific \nfunction. In this paper we show that the kernel peA algorithm of Sch6lkopf et al \n[7] can be interpreted as performing metric MDS if the kernel function is isotropic. \nThis is achieved by performing classical scaling in the feature space defined by the \nkernel. \nThe structure of the remainder of this paper is as follows: In section 2 classical and \nmetric MDS are reviewed, and in section 3 the kernel peA algorithm is described. \nThe link between the two methods is made in section 4. Section 5 describes ap(cid:173)\nproaches to choosing the kernel function, and we finish with a brief discussion in \nsection 6. \n\n\f2 Classical and metric MDS \n\n2.1 Classical scaling \n\nGiven n objects and the corresponding dissimilarity matrix, classical scaling is an \nalgebraic method for finding a set of points in space so that the dissimilarities are \nwell-approximated by the interpoint distances. The classical scaling algorithm is \nintroduced below by starting with the locations of n points, constructing a dis(cid:173)\nsimilarity matrix based on their Euclidean distances, and then showing how the \nconfiguration of the points can be reconstructed (as far as possible) from the dis(cid:173)\nsimilarity matrix. \nLet the coordinates of n points in p dimensions be denoted by Xi, i = 1, ... ,n. These \ncan be collected together in a n x p matrix X . The dissimilarities are calculated \nby 8;j = (Xi - Xj)T(Xi - Xj). Given these dissimilarities, we construct the matrix \nA such that aij = -! 8;j' and then set B = H AH, where H is the centering \nmatrix H = In - ~l1T . With 8;j = (Xi - Xj)T(Xi - Xj), the construction of B \nleads to bij = (Xi - xF(xj - x), where x = ~ L~=l Xi. In matrix form we have \nB = (HX)(HX)T, and B is real, symmetric and positive semi-definite. Let the \neigendecomposition of B be B = V A V T , where A is a diagonal matrix and V is a \nmatrix whose columns are the eigenvectors of B. If p < n, there will be n - p zero \neigenvaluesl . If the eigenvalues are ordered Al ~ A2 ~ ... ~ An ~ 0, then B = \nVpApVpT, where Ap = diag(Al, ... ,Ap) and Vp is the n x p matrix whose columns \ncorrespond to the first p eigenvectors of B, with the usual normalization so that \nthe eigenvectors have unit length. The matrix X of the reconstructed coordinates \nof the points can be obtained as X = VpAJ, with B = X XT. Clearly from the \ninformation in the dissimilarities one can only recover the original coordinates up \nto a translation, a rotation and reflections of the axes; the solution obtained for X \nis such that the origin is at the mean of the n points, and that the axes chosen by \nthe procedure are the principal axes of the X configuration. \nIt may not be necessary to uses all p dimensions to obtain a reasonable approxi(cid:173)\nmation; a configuration X in k-dimensions can be obtained by using the largest k \neigenvalues so that X = VkA~ . These are known as the principal coordinates of X \nin k dimensions. The fraction of the variance explained by the first k eigenvalues is \nL~=l Ad L~=l Ai\u00b7 \nClassical scaling as explained above works on Euclidean distances as the dissimilar(cid:173)\nities. However, one can run the same algorithm with a non-Euclidean dissimilarity \nmatrix, although in this case there is no guarantee that the eigenvalues will be \nnon-negative. \n\n! \n\n\"\" \n\nA \n\n1 \n\n'\" \n\n..... \n\nClassical scaling derives from the work of Schoenberg and Young and Householder \nin the 1930's. Expositions of the theory can be found in [5] and [2]. \n\n2.1.1 Opthnality properties of classical scaling \n\nMardia et al [5] (section 14.4) give the following optimality property ofthe classical \nscaling solution. \n\n1 In fact if the points are not in \"general position\" the number of zero eigenvalues will \nbe greater than n - p. Below we assume that the points are in general position, although \nthe arguments can easily be carried through with minor modifications if this is not the \ncase. \n\n\fTheorem 1 Let X denote a configuration of points in ffi.P , with interpoint distances \nc5ri = (Xi - Xi)T (Xi - Xi). Let L be a p x p rotation matrix and set L = (L1' L 2), \nwhere L1 is p x k for k < p. Let X = X L 1, the projection of X onto a k-dimensional \nsubspace of ffi.P , and let dri = (Xi - Xi) T (Xi - Xi). Amongst all projections X = X L 1, \nthe quantity \u00a2 = Li,i (c5ri - dri) is minimized when X is projected onto its principal \ncoordinates in k dimensions. For all i, j we have dii :::; c5ii . The value of \u00a2 for the \nprincipal coordinate projection is \u00a2 = 2n(Ak+1 + ... + Ap). \n\n2.2 Relationships between classical scaling and peA \n\nThere is a well-known relationship between PCA and classical scaling; see e.g. Cox \nand Cox (1994) section 2.2.7. \n\nPrincipal components analysis (PCA) is concerned with the eigendecomposition of \nthe sample covariance matrix S = ~ XT H X. It is easy to show that the eigenvalues \nof nS are the p non-zero eigenvalues of B. To see this note that H2 = Hand \nthus that nS = (HX)T(HX). Let Vi be a unit-length eigenvector of B so that \nBVi = AiVi. Premultiplying by (HX)T yields \n\n(HX)T(HX)(HXf V i = Ai(Hx)T Vi \n\n(1) \nso we see that Ai is an eigenvalue of nS. Yi = (H X)T Vi is the corresponding \neigenvector; note that Y; Yi = Ai. Centering X and projecting onto the unit vector \nYi = X;1/2Yi we obtain \n\n(2) \nThus we see that the projection of X onto the eigenvectors of nS returns the classical \nscaling solution. \n\nHXYi = X;1/2 HX(HXf Vi = AY2 v i . \n\n2.3 Metric MDS \nThe aim of classical scaling is to find a configuration of points X so that the in(cid:173)\nterpoint distances dii well approximate the dissimilarities c5ii . In metric MDS this \ncriterion is relaxed, so that instead we require \n\n(3) \nwhere f is a specified (analytic) function. For this definition see, e.g. Kruskal and \nWish [4] (page 22), where polynomial transformations are suggested. \n\nA straightforward way to carry out metric MDS is to define a error function (or \nstress) \n\n(4) \n\nwhere the {wii} are appropriately chosen weights. One can then obtain deriva(cid:173)\ntives of S with respect to the coordinates of the points that define the dii'S and \nuse gradient-based (or more sophisticated methods) to minimize the stress. This \nmethod is known as least-squares scaling. An early reference to this kind of method \nis Sammon (1969) [6], where wii = 1/c5ii and f is the identity function. \nNote that if f(c5ii ) has some adjustable parameters () and is linear with respect to () 2, \nthen the function f can also be adapted and the optimal value for those parameters \ngiven the current dij's can be obtained by (weighted) least-squares regression. \n\n2 f can still be a non-linear function of its argument. \n\n\fCritchley (1978) [3] (also mentioned in section 2.4.2 of Cox and Cox) carried out \nmetric MDS by running the classical scaling algorithm on the transformed dissim(cid:173)\nilarities. Critchley suggests the power transformation f(oij) = 00 (for J.L > 0) . If \nthe dissimilarities are derived from Euclidean distances, we note that the kernel \nk(x,y) = -llx-ylli3 is conditionally positive definite (CPD) if f3::; 2 [1]. When the \nkernel is CPD, the centered matrix will be positive definite. Critchley's use of the \nclassical scaling algorithm is similar to the algorithm discussed below, but crucially \nthe kernel PCA method ensures that the matrix B derived form the transformed \ndissimilarities is non-negative definite, while this is not guaranteed by Critchley's \ntransformation for arbitrary J.L. \n\nA further member of the MDS family is nonmetric MDS (NMDS), also known as \nordinal scaling. Here it is only the relative rank ordering between the d's and the o's \nthat is taken to be important; this constraint can be imposed by demanding that \nthe function f in equation 3 is monotonic. This constraint makes sense for some \nkinds of dissimilarity data (e.g. from psychology) where only the rank orderings \nhave real meaning. \n\n3 Kernel PCA \n\nIn recent years there has been an explosion of work on kernel methods. For super(cid:173)\nvised learning these include support vector machines [8], Gaussian process predic(cid:173)\ntion (see, e.g. [10]) and spline methods [9]. The basic idea of these methods is to use \nthe \"kernel trick\". A point x in the original space is re-represented as a point \u00a2(x) \nin a Np-dimensional feature space3 F, where \u00a2(x) = (\u00a21(X),\u00a22(X), ... ,\u00a2NF(X)). \nWe can think of each function \u00a2j(-) as a non-linear mapping. The key to the kernel \ntrick is to realize that for many algorithms, the only quantities required are of the \nform 4 \u00a2(Xi).\u00a2(Xj) and thus if these can be easily computed by a non-linear function \nk(Xi,Xj) = \u00a2(Xi).\u00a2(Xj) we can save much time and effort. \nSch6lkopf, Smola and Miiller [7] used this trick to define kernel peA. One could \ncompute the covariance matrix in the feature space and then calculate its eigen(cid:173)\nvectors/eigenvalues. However, using the relationship between B and the sample \ncovariance matrix S described above, we can instead consider the n x n matrix K \nwith entries Kij = k(Xi,Xj) for i,j = 1, .. . ,no If Np > n using K will be more \nefficient than working with the covariance matrix in feature space and anyway the \nlatter would be singular. \nThe data should be centered in the feature space so that L~=l \u00a2(Xi) = o. This \nis achieved by carrying out the eigendecomposition of K = H K H which gives the \ncoordinates of the approximating points as described in section 2.2. Thus we see \nthat the visualization of data by projecting it onto the first k eigenvectors is exactly \nclassical scaling in feature space. \n\n4 A relationship between kernel PCA and metric MDS \n\nWe consider two cases. In section 4.1 we deal with the case that the kernel is \nisotropic and obtain a close relationship between kernel PCA and metric MDS. If \nthe kernel is non-stationary a rather less close relationship is derived in section 4.2. \n\n3For some kernels NF = 00. \n4We denote the inner product of two vectors as either a .h or aTh . \n\n\fIsotropic kernels \n\n4.1 \nA kernelfunction is stationary if k(Xi' Xj) depends only on the vector T = Xi -Xj. A \nstationary covariance function is isotropic if k(Xi,Xj) depends only on the distance \n8ij with 8;j = T.T, so that we write k(Xi,Xj) = r(8ij ). Assume that the kernel is \nscaled so that r(O) = 1. An example of an isotropic kernel is the squared exponential \nor REF (radial basis function) kernel k(Xi' Xj) = exp{ -O(Xi - Xj)T(Xi - Xj)}, for \nsome parameter 0 > O. \nConsider the Euclidean distance in feature space 8;j = (\u00a2(Xi) - \u00a2(Xj))T(\u00a2(Xi) -\n\u00a2(Xj)). With an isotropic kernel this can be re-expressed as 8;j = 2(1 - r(8ij )). \nThus the matrix A has elements aij = r(8ij ) - 1, which can be written as A = \nK - 11 T. It can be easily verified that the centering matrix H annihilates 11 T, so \nthat HAH = HKH. \nWe see that the configuration of points derived from performing classical scaling \non K actually aims to approximate the feature-space distances computed as 8ij = \nJ2(1- r(8ij )). As the 8ij's are a non-linear function of the 8ij's this procedure \n(kernel MDS) is an example of metric MDS. \n\nRemark 1 Kernel functions are usually chosen to be conditionally positive definite, \nso that the eigenvalues of the matrix k will be non-negative. Choosing arbitrary \nfunctions to transform the dissimilarities will not give this guarantee. \n\nRemark 2 In nonmetric MDS we require that dij ~ f(8ij ) for some monotonic \nfunction f. If the kernel function r is monotonically decreasing then clearly 1 - r \nis monotonically increasing. However, there are valid isotropic kernel (covariance) \nfunctions which are non-monotonic (e.g. the exponentially damped cosine r(8) = \ncoo cos(w8); see [11] for details) and thus we see that f need not be monotonic in \nkernel MDS. \n\nRemark 3 One advantage of PCA is that it defines a mapping from the original \nspace to the principal coordinates, and hence that if a new point x arrives, its \nprojection onto the principal coordinates defined by the original n data points can be \ncomputed5 . The same property holds in kernel PCA, so that the computation of the \nprojection of \u00a2(x) onto the rth principal direction in feature space can be computed \nusing the kernel trick as L:~=1 o:i k(x, Xi), where or is the rth eigenvector of k (see \nequation 4.1 in [7]). This projection property does not hold for algorithms that \nsimply minimize the stress objective function; for example the Sammon \"mapping\" \nalgorithm [6] does not in fact define a mapping. \n\n4.2 Non-stationary kernels \n\nSometimes non-stationary kernels (e.g. k(Xi,Xj) = (1 + Xi.Xj)m for integer m) \nare used. For non-stationary kernels we proceed as before and construct 8;j = \n(\u00a2(Xi)-\u00a2(Xj))T(\u00a2(Xi)-\u00a2(Xj)). We can again show that the kernel MDS procedure \noperates on the matrix H K H. However, the distance 8ij in feature space is not a \nfunction of 8ij and so the relationship of equation 3 does not hold. The situation can \nbe saved somewhat if we follow Mardia et al (section 14.2.3) and relate similarities \n\n5Note that this will be, in general, different to the solution found by doing peA on the \n\nfull data set of n + 1 points. \n\n\fI \n\nbola_O \n, - , bela=4 \n- - bela =10 \n'''' bela-20 \n\n.#, .. - .. -\"--:;:-:;: -\n\n-\n\n-\n\n::: ......\u2022... \n\n~(-:::/ .... \n\n/ \n\n\"\", \n\n.,' \n\nI \nI \n... \n'.:/ \n\n500 \n\n1000 \n\nk \n\n1500 \n\n2000 \n\n2500 \n\nFigure 1: The plot shows 'Y as a function of k for various values of (3 = () /256 for \nthe USPS test set. \n\nto dissimilarities through Jlj = Cii + Cjj - 2Cij, where Cij denotes the similarity \nbetween items i and j in feature space. Then we see that the similarity in feature \nspace is given by Cij = \u00a2(Xi).\u00a2(Xj) = k(Xi' Xj). For kernels (such as polynomial \nkernels) that are functions of Xi.Xj (the similarity in input space), we see then that \nthe similarity in feature space is a non-linear function of the similarity measured in \ninput space. \n\n5 Choice of kernel \n\nHaving performed kernel MDS one can plot the scatter diagram (or Shepard dia(cid:173)\ngram) of the dissimilarities against the fitted distances. We know that for each pair \nthe fitted distance d ij ::; Jij because of the projection property in feature space. The \nsum of the residuals is given by 2n E~=k+l Ai where the {Ai} are the eigenvalues of \nk = H K H. (See Theorem 1 above and recall that at most n of the eigenvalues of \nthe covariance matrix in feature space will be non-zero.) Hence the fraction of the \nsum-squared distance explained by the first k dimensions is 'Y = E:=1 Ad E~=1 Ai. \nOne idea for choosing the kernel would be to fix the dimensionality k and choose \nr(\u00b7) so that 'Y is maximized. Consider the effect of varying () in the RBF kernel \n\nk(Xi , Xj) =exp{-()(xi-xjf(Xi-Xj)}. \n\n(5) \nAs () -+ 00 we have Jlj = 2(1- c5(i,j)) (where c5(i,j) is the Kronecker delta), which \nare the distances corresponding to a regular simplex. Thus K -+ In, H K H = H \nand'Y = k/(n -1). Letting () -+ 0 and using e-oz ~ 1- ()z for small (), we can show \nthat Kij = 1 - ()c5lj as () -+ 0, and thus that the classical scaling solution is obtained \nin this limit. \n\nExperiments have been run on the US Postal Service database of handwritten digits, \nas used in [7]. The test set of 2007 images was used. The size of each image is 16 x 16 \npixels, with the intensity of the pixels scaled so that the average variance over all 256 \ndimensions is 0.5. In Figure 1 'Y is plotted against k for various values of (3 = () /256. \nBy choosing an index k one can observe from Figure 1 what fraction of the variance \nis explained by the first k eigenvalues. The trend is that as () decreases more and \n\n\fmore variance is explained by fewer components, which fits in with the idea above \nthat the () -t 00 limit gives rise to the regular simplex case. Thus there does not \nseem to be a non-trivial value of () which minimizes the residuals. \n\n6 Discussion \n\nThe results above show that kernel PCA using an isotropic kernel function can be \ninterpreted as performing a kind of metric MDS. The main difference between the \nkernel MDS algorithm and other metric MDS algorithms is that kernel MDS uses \nthe classical scaling solution in feature space. The advantage of the classical scal(cid:173)\ning solution is that it is computed from an eigenproblem, and avoids the iterative \noptimization of the stress objective function that is used for most other MDS so(cid:173)\nlutions. The classical scaling solution is unique up to the unavoidable translation, \nrotation and reflection symmetries (assuming that there are no repeated eigenval(cid:173)\nues). Critchley's work (1978) is somewhat similar to kernel MDS, but it lacks the \nnotion of a projection into feature space and does not always ensure that the matrix \nB is non-negative definite. \n\nWe have also looked at the question of adapting the kernel so as to minimize the sum \nof the residuals. However, for the case investigated this leads to a trivial solution. \n\nAcknowledgements \n\nI thank David Willshaw, Matthias Seeger and Amos Storkey for helpful conversations, and \nthe anonymous referees whose comments have helped improve the paper. \n\nReferences \n\n[1] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. \n\nSpringer-Verlag, New York, 1984. \n\n[2] T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman and Hall, London, \n\n1994. \n\n[3] F. Critchley. Multidimensionsal scaling: a short critique and a new method. In L. C. A \nCorsten and J. Hermans, editors, COMPSTAT 1978. Physica-Verlag, Vienna, 1978. \n[4] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, Beverly \n\nHills, 1978. \n\n[5] Mardia, K V. and Kent, J. T. and Bibby, J. M. Multivariate Analysis. Academic \n\nPress, 1979. \n\n[6] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Trans. on \n\nComputers, 18:401-409, 1969. \n\n[7] B. Scholkopf, A. Smola, and K-R. Muller. Nonlinear component analysis as a kernel \n\neigenvalue problem. Neural Computation, 10:1299- 1319, 1998. \n\n[8] V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, \n\n1995. \n\n[9] G. Wahba. Spline models for observational data. Society for Industrial and Applied \nMathematics, Philadelphia, PA, 1990. CBMS-NSF Regional Conference series in \napplied mathematics. \n\n[10] C. K I. Williams and D . Barber. Bayesian classification with Gaussian processes. \nIEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342- 1351, \n1998. \n\n[11] A. M. Yaglom. Correlation Theory of Stationary and Related Random Functions \n\nVolume I:Basic Results. Springer Verlag, 1987. \n\n\f", "award": [], "sourceid": 1873, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}]}