{"title": "The Kernel Trick for Distances", "book": "Advances in Neural Information Processing Systems", "page_first": 301, "page_last": 307, "abstract": null, "full_text": "The Kernel Trick for Distances \n\nBernhard SchOikopf \nMicrosoft Research \n1 Guildhall Street \nCambridge, UK \n\nbs@kyb.tuebingen.mpg.de \n\nAbstract \n\nA method is described which, like the kernel trick in support vector ma(cid:173)\nchines (SVMs), lets us generalize distance-based algorithms to operate \nin feature spaces, usually nonlinearly related to the input space. This \nis done by identifying a class of kernels which can be represented as \nnorm-based distances in Hilbert spaces. It turns out that common kernel \nalgorithms, such as SVMs and kernel PCA, are actually really distance \nbased algorithms and can be run with that class of kernels, too. \nAs well as providing a useful new insight into how these algorithms \nwork, the present work can form the basis for conceiving new algorithms. \n\n1 Introduction \n\nOne of the crucial ingredients of SVMs is the so-called kernel trick for the computation of \ndot products in high-dimensional feature spaces using simple functions defined on pairs of \ninput patterns. This trick allows the formulation of nonlinear variants of any algorithm that \ncan be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. \nAlthough the mathematical result underlying the kernel trick is almost a century old [6], it \nwas only much later [1, 3,13] that it was made fruitful for the machine learning community. \nKernel methods have since led to interesting generalizations of learning algorithms and to \nsuccessful real-world applications. The present paper attempts to extend the utility of the \nkernel trick by looking at the problem of which kernels can be used to compute distances \nin feature spaces. Again, the underlying mathematical results, mainly due to Schoenberg, \nhave been known for a while [7]; some of them have already attracted interest in the kernel \nmethods community in various contexts [11, 5, 15]. \nLet us consider training data (Xl, yd, ... , (xm, Ym) E X x y. Here, Y is the set of possible \noutputs (e.g., in pattern recognition, {\u00b11}), and X is some nonempty set (the domain) that \nthe patterns are taken from. We are interested in predicting the outputs y for previously \nunseen patterns x. This is only possible if we have some measure that tells us how (x, y) \nis related to the training examples. For many problems, the following approach works: \ninformally, we want similar inputs to lead to similar outputs. To formalize this, we have to \nstate what we mean by similar. On the outputs, similarity is usually measured in terms of \na loss function. For instance, in the case of pattern recognition, the situation is simple: two \noutputs can either be identical or different. On the inputs, the notion of similarity is more \ncomplex. It hinges on a representation of the patterns and a suitable similarity measure \noperating on that representation. \n\n\fthe one we will \nOne particularly simple yet surprisingly useful notion of (dis)similarity -\nuse in this paper -\nderives from embedding the data into a Euclidean space and utilizing \ngeometrical concepts. For instance, in SVMs, similarity is measured by dot products (i.e. \nangles and lengths) in some high-dimensional feature space F . Formally, the patterns are \nfirst mapped into Fusing \u00a2 : X -t F, x I-t \u00a2(x), and then compared using a dot product \n(\u00a2(x), \u00a2(X')). To avoid working in the potentially high-dimensional space F, one tries to \npick a feature space in which the dot product can be evaluated directly using a nonlinear \nfunction in input space, i.e. by means of the kernel trick \n\nk(x, x') = (\u00a2(x), \u00a2(X')). \n\n(1) \n\nOften, one simply chooses a kernel k with the property that there exists some \u00a2 such that \nthe above holds true, without necessarily worrying about the actual form of \u00a2 -\nalready the \nexistence of the linear space F facilitates a number of algorithmic and theoretical issues. It \nis well established that (1) works out for Mercer kernels [3, 13], or, equivalently, positive \ndefinite kernels [2, 14]. Here and below, indices i and j by default run over 1, ... , m. \n\nDefinition 1 (Positive definite kernel) A symmetric function k : X x X -t IR which for \nall mEN, Xi E X gives rise to a positive definite Gram matrix, i.e. for which for all \nCi E IR we have \n\n\"\"~. CicjKij ~ 0, where Kij := k(Xi, Xj), \nL...J l ,J=1 \n\n(2) \n\nis called a positive definite (pd) kernel. \n\nOne particularly intuitive way to construct a feature map satisfying (1) for such a kernel k \nproceeds, in a nutshell, as follows (for details, see [2]): \n\n1. Define a feature map \n\nX I-t k(., x). \nHere, IRA:' denotes the space of functions mapping X into Ilt \n\n\u00a2 : X -t IRA:', \n\n(3) \n\n2. Turn it into a linear space by forming linear combinations \n\nm \n\ni=1 \n\nm' \n\nj=1 \n\n(4) \n3. Endow it with a dot product (1, g) := 2::1 2:;~1 ai/Jjk(xi,xj), and turn it into a \nHilbert space Hk by completing it in the corresponding norm. \nNote that in particular, by definition ofthe dot product, (k(., x), k(., x')) = k(x, x'), hence, \nin view of (3), we have k(x,x' ) = (\u00a2(X),\u00a2(X')), the kernel trick. This shows that pd \nkernels can be thought of as (nonlinear) generalizations of one of the simplest similarity \nmeasures, the canonical dot product (x, x') , x, x' E IRN. The question arises as to whether \nthere also exi st generalizations of the simplest dissimilarity measure, the di stance Ilx - x'11 2 . \nClearly, the distance 11\u00a2(x) - \u00a2(X') 112 in the feature space associated with a pd kernel k can \nbe computed using the kernel trick (1) as k(x, x) + k(X', x') - 2k(x, x') . Positive definite \nkernels are, however, not the full story: there exists a larger class of kernels that can be \nused as generalized distances, and the following section will describe why. \n\n2 Kernels as Generalized Distance Measures \n\nLet us start by considering how a dot product and the corresponding distance measure are \naffected by a translation of the data, x I-t x - Xo. Clearly, Ilx - x'11 2 is translation invariant \n\n\fwhile (x, x') is not. A short calculation shows that the effect of the translation can be \nexpressed in terms of II. - .11 2 as \n\n((x - xo), (x' - xo)) = ~ (-llx - x/11 2 + Ilx - xol12 + Ilxo - X'W) . \n\n(5) \nNote that this is, just like (x,x /), still a pd kernel: ~i,j CiCj((Xi - xo), (Xj - xo)) = \nII ~i Ci(Xi - xo)112 ~ O. For any choice of Xo E X, we thus get a similarity measure (5) \nassociated with the dissimilarity measure Ilx - x'II. \nThis naturally leads to the question whether (5) might suggest a connection that holds true \nalso in more general cases: what kind of nonlinear dissimilarity measure do we have to \nsubstitute instead of II. - .11 2 on the right hand side of (5) to ensure that the left hand side \nbecomes positive definite? The answer is given by a known result. To state it, we first need \nto define the appropriate class of kernels. \n\nDefinition 2 (Conditionally positive definite kernel) A symmetric function k : X x X -t \nIR which satisfies (2) for all mEN, Xi E X and for all Ci E IR with \n\n~~ Ci = 0, \nL... t =l \n\n(6) \n\nis called a conditionally positive definite (cpd) kernel. \n\nProposition 3 (Connection pd -\non X x X. Then \n\ncpd [2]) Let Xo E X, and let k be a symmetric kernel \n\nk(x, x') := ~ (k(x, x') - k(x, xo) - k(xo, x') + k(xo, xo)) \n\n(7) \n\nis positive definite if and only if k is conditionally positive definite. \n\nThe proof follows directly from the definitions and can be found in [2]. \n\nThis result does generalize (5): the negative squared distance kernel is indeed cpd, for \n~i Ci = 0 implies - ~i,j cicjllxi - xjl12 = - ~i Ci ~j Cj IIxjl12 - ~j Cj ~i cillxil12 + \n2 ~i,j CiCj (Xi, Xj) = 2 ~i,j CiCj (Xi, Xj) = 211 ~i CiXi 112 ~ O. In fact, this implies that all \nkernels of the form \n\nk(x, x') = -llx - x/II/3, 0 < f3 ~ 2 \nare cpd (they are not pd), by application of the following result: \n\n(8) \n\nProposition 4 ([2]) If k : X x X -t] - 00,0] is cpd, then so are - (_k)O< (0 < Q < 1) \nand -log(l - k). \n\nTo state another class of cpd kernels that are not pd, note first that as trivial consequences \nof Definition 2, we know that (i) sums of cpd kernels are cpd, and (ii) any constant b E IR \nis cpd. Therefore, any kernel of the form k + b, where k is cpd and b E IR, is also cpd. In \nparticular, since pd kernels are cpd, we can take any pd kernel and offset it by b and it will \nstill be at least cpd. For further examples of cpd kernels, cf. [2, 14, 4, 11]. \n\nWe now return to the main flow of the argume~t. Proposition 3 allows us to construc5 \nthe feature map for k from that of the pd kernel k. To this end, fix Xo E X and define k \naccording to (7). Due to Proposition 3, k is positive definite. Therefore, we may employ the \nHilbert space representation \u00a2 : X -t H of k (ct. (1\u00bb, satisfying (\u00a2(x), \u00a2(X')) = k(x, x'), \nhence \n11\u00a2(x) - \u00a2(x' )112 = (\u00a2(x) - \u00a2(X'), \u00a2(x) - \u00a2(X')) = k(x, x) + k(X', x') - 2k(x, x'). (9) \n\n\fSubstituting (7) yields \n\n114>(x) - 4>(x' )112 = -k(x, x') + 2 (k(x, x) + k(X', x')) . \n\n1 \n\n(10) \n\nWe thus have proven the following result. \n\nProposition 5 (Hilbert space representation of cpd kernels [7, 2]) Let k be a real(cid:173)\nvalued conditionally positive definite kernel on X, satisfying k(x, x) = 0 for all x E X. \nThen there exists a Hilbert space H of real-valued functions on X, and a mapping \n4> : X -t H, such that \n\n114>(x) - 4>(x' )112 = -k(x, x'). \n\n(11) \n\nlfwe drop the assumption k(x, x) = 0, the Hilbert .space representation reads \n\n114>(x) - 4>(x' )112 = -k(x, x') + 2 (k(x, x) + k(X', x')) . \n\n1 \n\n(12) \n\nIt can be shown that if k(x, x) = 0 for all x E X, then d(x, x') := J -k(x, x') = \n114>(x) - 4>(x' )11 is a semi-metric; it is a metric if k(X,X') f:. 0 for x f:. x' [2]. \nWe next show how to represent general symmetric kernels (thus in particular cpd kernels) \nas symmetric bilinear forms Q in feature spaces. This generalization of the previously \nknown feature space representation for pd kernels comes at a cost: Q will no longer be \na dot product. For our purposes, we can get away with this. The result will give us an \nintuitive understanding of Proposition 3: we can then write k as k(X,X') := Q(4)(x) -\n4>(xo), 4>(x' ) - 4>(xo)). Proposition 3 thus essentially adds an origin in feature space which \ncorresponds to the image 4>(xo) of one point Xo under the feature map. For translation \ninvariant algorithms, we are always allowed to do this, and thus turn a cpd kernel into a pd \none -\n\nin this sense, cpd kernels are \"as good as\" pd kernels. \n\nProposition 6 (Vector space representation of symmetric kernels) Let k be a real(cid:173)\nvalued symmetric kernel on X. Then there exists a linear .space H of real-valued functions \non X , endowed with a symmetric bilinear form Q(., .), and a mapping 4> : X -t H, such \nthat \n\nk(x, x') = Q(4)(x), 4>(x' )). \n\n(13) \n\nProof The proof is a direct modification of the pd case. We use the map (3) and linearly \ncomplete the image as in (4). Define Q(f,g) := L:l LT~1 ad3j k(xi, xj). To see that it \nis well-defined, although it explicitly contains the expansion coefficients (which need not \nbe unique), note that Q(f, g) = LT~1 /3jf(xj), independent of the ai. Similarly, for g, \nnote that Q(f, g) = Li aig(xi), hence it is independent of /3j. The last two equations also \nshow that Q is bilinear; clearly, it is symmetric. \n\u2022 \n\nNote, moreover, that by definition of Q, k is a reproducing kernel for the feature space \n(which is not a Hilbert space): for all functions f (4), we have Q(k(.,x),f) = f(x); in \nparticular, Q(k(., x), k(., x')) = k(x, x'). \n\nRewriting k as k(x, x') := Q(4)(x) - 4>(xo), 4>(x' ) - 4>(xo)) suggests an immediate gen(cid:173)\neralization of Proposition 3: in practice, we might want to choose other points as origins \nin feature space - points that do not have a preimage Xo in input space, such as (usually) \nthe mean of a set of points (cf. [12]). This will be useful when considering kernel PCA. \nCrucial is only that our reference point's behaviour under translations is identical to that \nof individual points. This is taken care of by the constraint on the sum of the Ci in the \nfollowing proposition. The asterisk denotes the complex conjugated transpose. \n\n\fProposition 7 (Exercise 2.23, [2]) Let K be a symmetric matrix, e E ~m be the vector of \nall ones, J the m x m identity matrix, and let c E em satisfy e*c = 1. Then \n\nK := (J - ec*)K(J - ce*) \n\n(14) \n\nis positive definite if and only if K is conditionally positive definite. \n\nProof \n\"~\": suppose K is positive definite, i.e. for any a E em, we have \n\no ~ a* Ka = a* Ka + a*ec* Kce*a - a* Kce*a - a*ec* Ka. \n\n(15) \nIn the case a*e = e*a = 0 (cf. (6\u00bb, the three last terms vanish, i.e. 0 ~ a* Ka, proving \nthat K is conditionally positive definite. \n\"\u00a2=\": suppose K is conditionally positive definite. The map (J - ce*) has its range in \nthe orthogonal complement of e, which can be seen by computing, for any a E em, \n\ne*(J - ce*)a= e*a-e*ce*a = O. \n\n(16) \n\nMoreover, being symmetric and satisfying (J - ce*)2 = (J - ce*), the map (J - ce*) \nis a projection. Thus K is the restriction of K to the orthogonal complement of e, and \nby definition of conditional positive definiteness, that is precisely the space where K is \npositive definite. \n\n\u2022 \n\nThis result directly implies a corresponding generalization of Proposition 3: \n\nProposition 8 (Adding a general origin) Let k be a symmetric kernel, Xl, ... ,Xm E X, \nand let Ci E e satisfy E~l Ci = 1. Then \n\nis positive definite if and only if k is conditionally positive definite. \n\nProof Consider a set of points x~, . . . , x~\" m' E N, x~ EX, and let K be the \n(m + m') x (m + m') Gram matrix based on Xl, .. . , X m , X~, ... , x~,. Apply Proposi(cid:173)\ntion 7 using cm +! = ... = cm +m ' = O. \n\u2022 \n\n(17) \n\nExample 9 (SVMs and kernel peA) (i) The above results show that conditionally posi(cid:173)\ntive definite kernels are a natural choice whenever we are dealing with a translation in(cid:173)\nvariant problem, such as the SVM: maximization of the margin of separation between two \nclasses of data is independent of the origin '.I' position. Seen in this light, it is not surprising \nthat the structure of the dual optimization problem (cf [13}) allows cpd kernels: as noticed \nin [11, 10}, the constraint E~l QiYi = 0 projects out the same sub.lpace as (6) in the \ndefinition of conditionally positive definite kernels. \n\n(ii) Another example of a kernel algorithm that works with conditionally positive definite \nkernels is kernel peA [9}, where the data is centered, thus removing the dependence on the \norigin infeature .Ipace. Formally, this follows from Proposition 7 for Ci = 11m. \n\n\fExample 10 (Parzen windows) One of the simplest distance-based classification algo(cid:173)\nrithms conceivable proceeds as follows. Given m+ points labelled with + 1, m_ points \nlabelled with -1, and a test point \u00a2( x), we compute the mean squared distances between \nthe latter and the two classes, and assign it to the one where this mean is smaller, \n\nWe use the distance kernel trick (Proposition 5) to express the decision function as a kernel \nexpansion in input space: a short calculation shows that \n\ny = sgn (_1_ L k(X,Xi) - _1_ L k(X,Xi) + c) , \n\nm+ Yi=l \n\nm_ Yi=-l \n\n(19) \n\nwith the constant offset c = (1/2m_) L:Yi=-l k(Xi, Xi) - (1/2m+) L:Yi=l k(Xi, Xi). Note \nthatfor some cpd kernels, such as (8), k(Xi, Xi) is always 0, thus c = O. For others, such as \nthe commonly used Gaussian kernel, k(Xi, Xi) is a nonzero constant, in which case c also \nvanishes. \n\nFor normalized Gaussians and other kernels that are valid density models, the resulting \ndecision boundary can be interpreted as the Bayes decision based on two Parzen windows \ndensity estimates of the classes; for general cpd kernels, the analogy is a mere formal one. \n\nExample 11 (Toy experiment) In Fig. J, we illustrate the finding that kernel peA can be \ncarried out using cpd kernels. We use the kernel (8). Due to the centering that is built into \nkernel peA (cf Example 9, (ii), and (5)), the case (3 = 2 actually is equivalent to linear \npeA. As we decrease (3, we obtain increasingly nonlinear feature extractors. \n\nNote, moreover, that as the kernel parameter (3 gets smaller, less weight is put on large \ndistances, and we get more localizedfeature extractors (in the sense that the regions where \nthey have large gradients, i.e. dense sets of contour lines in the plot, get more localized). \n\nFigure 1: Kernel PCA on a toy dataset using the cpd kernel (8); contour plots of the feature \nextractors corresponding to projections onto the first two principal axes in feature space. \nFrom left to right: (3 = 2,1.5,1,0.5. Notice how smaller values of (3 make the feature \nextractors increasingly nonlinear, which allows the identification of the cluster structure. \n\n\f3 Conclusion \n\nWe have described a kernel trick for distances in feature spaces. It can be used to generalize \nall distance based algorithms to a feature space setting by substituting a suitable kernel \nfunction for the squared distance. The class of kernels that can be used is larger than \nthose commonly used in kernel methods (known as positive definite kernels). We have \nargued that this reflects the translation invariance of distance based algorithms, as opposed \nto genuinely dot product based algorithms. SVMs and kernel PCA are translation invariant \nin feature space, hence they are really both distance rather than dot product based. We \nthus argued that they can both use conditionally positive definite kernels. In the case of \nthe SVM, this drops out of the optimization problem automatically [11], in the case of \nkernel PCA, it corresponds to the introduction of a reference point in feature space. The \ncontribution of the present work is that it identifies translation invariance as the underlying \nreason, thus enabling us to use cpd kernels in a much larger class of kernel algorithms, and \nthat it draws the learning community's attention to the kernel trick for distances. \n\nAcknowledgments. Part of the work was done while the author was visiting the Aus(cid:173)\ntralian National University. Thanks to Nello Cristianini, Ralf Herbrich, Sebastian Mika, \nKlaus Miiller, John Shawe-Taylor, Alex Smola, Mike Tipping, Chris Watkins, Bob \nWilliamson, Chris Williams and a conscientious anonymous reviewer for valuable input. \n\nReferences \n[1] M. A. Aizerman, E. M. Braverman, and L. 1. Rozonoer. Theoretical foundations of the potential \nfunction method in pattern recognition learning. Autom. and Remote Contr. , 25:821- 837, 1964. \n[2] C. Berg, J.P.R. Christensen, and P. Ressel. Hannonic Analysis on Semigroups. Springer-Verlag, \n\nNew York, 1984. \n\n[3] B. E. Boser, 1. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi(cid:173)\nfiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational \nLearning Theory, pages 144-152, Pittsburgh, PA, July 1992. ACM Press. \n\n[4] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. \n\nNeural Computation, 7(2):219- 269, 1995. \n\n[5] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-1O, \n\nComputer Science Department, University of California at Santa Cruz, 1999. \n\n[6] J. Mercer. Functions of positive and negative type and their connection with the theory of \n\nintegral equations. Philos. Trans. Roy. Soc. London, A 209:415-446, 1909. \n\n[7] I. J. Schoenberg. Metric spaces and positive definite functions. Trans. Amer. Math. Soc., \n\n44:522- 536, 1938. \n\n[8] B. Sch61kopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods - Support Vector \n\nLearning. MIT Press, Cambridge, MA, 1999. \n\n[9] B. SchDlkopf, A. Smola, and K-R. Miiller. Nonlinear component analysis as a kernel eigenvalue \n\nproblem. Neural Computation, 10:1299- 1319, 1998. \n\n[10] A. Smola, T. FrieB, and B. ScMlkopf. Semiparametric support vector and linear programming \nmachines. In M.S. Keams, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Infonnation \nProcessing Systems 11 , pages 585 - 591, Cambridge, MA, 1999. MIT Press. \n\n[11] A. Smola, B. SchDlkopf, and K-R. Miiller. The connection between regularization operators \n\nand support vector kernels. Neural Networks, 11:637- 649, 1998. \n\n[12] W.S . Torgerson. Theory and Methods of Scaling. Wiley, New York, 1958. \n[13] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. \n[14] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Confer(cid:173)\n\nence Series in Applied Mathematics. SIAM, Philadelphia, 1990. \n\n[15] C. Watkins, 2000. personal communication. \n\n\f", "award": [], "sourceid": 1862, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}