{"title": "Regularization with Dot-Product Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 308, "page_last": 314, "abstract": null, "full_text": "Regularization with Dot-Product Kernels \n\nAlex J. SIDola, Zoltan L. Ovari, and Robert C. WilliaIDson \n\nDepartment of Engineering \n\nAustralian National University \n\nCanberra, ACT, 0200 \n\nAbstract \n\nIn this paper we give necessary and sufficient conditions under \nwhich kernels of dot product type k(x, y) = k(x . y) satisfy Mer(cid:173)\ncer's condition and thus may be used in Support Vector Ma(cid:173)\nchines (SVM), Regularization Networks (RN) or Gaussian Pro(cid:173)\ncesses (GP). In particular, we show that if the kernel is analytic \n(i.e. can be expanded in a Taylor series), all expansion coefficients \nhave to be nonnegative. We give an explicit functional form for the \nfeature map by calculating its eigenfunctions and eigenvalues. \n\n1 \n\nIntroduction \n\nKernel functions are widely used in learning algorithms such as Support Vector Ma(cid:173)\nchines, Gaussian Processes, or Regularization Networks. A possible interpretation \nof their effects is that they represent dot products in some feature space :7, i.e. \n\nk(x,y) = \u00a2(x)\u00b7 \u00a2(y) \n\n(1) \nwhere \u00a2 is a map from input (data) space X into:7. Another interpretation is to \nconnect \u00a2 with the regularization properties of the corresponding learning algorithm \n[8]. Most popular kernels can be described by three main categories: translation \ninvariant kernels [9] \n\n(2) \nkernels originating from generative models (e.g. those of Jaakkola and Haussler, or \nWatkins), and thirdly, dot-product kernels \n\nk(x, y) = k(x - y), \n\nk(x, y) = k(x . y). \n\n(3) \n\nSince k influences the properties of the estimates generated by any of the algorithms \nabove, it is natural to ask which regularization properties are associated with k. \n\nIn [8, 10, 9] the general connections between kernels and regularization properties \nare pointed out, containing details on the connection between the Fourier spectrum \nof translation invariant kernels and the smoothness properties of the estimates. In \na nutshell, the necessary and sufficient condition for k(x - y) to be a Mercer kernel \n(i.e. be admissible for any of the aforementioned kernel methods) is that its Fourier \ntransform be nonnegative. This also allowed for an easy to check criterion for new \nkernel functions. Moreover, [5] gave a similar analysis for kernels derived from \ngenerative models. \n\n\fDot product kernels k(x . y), on the other hand, have been eluding further theo(cid:173)\nretical analysis and only a necessary condition [1] was found, based on geometrical \nconsiderations. Unfortunately, it does not provide much insight into smoothness \nproperties of the corresponding estimate. \n\nOur aim in the present paper is to shed some light on the properties of dot product \nkernels, give an explicit equation how its eigenvalues can be determined, and, finally, \nshow that for analytic kernels that can be expanded in terms of monomials ~n or \nassociated Legendre polynomials P~(~) [4], i.e. \n\nk(x, y) = k(x\u00b7 y) with k(~) = L anC or k(~) = L bnP~(~) \n\n00 \n\n00 \n\n(4) \n\nn=O \n\nn=O \n\na necessary and sufficient condition is an ~ 0 for all n E N if no assumption \nabout the dimensionality of the input space is made (for finite dimensional spaces \nof dimension d, the condition is that bn ~ 0). In other words, the polynomial \nseries expansion in dot product kernels plays the role of the Fourier transform in \ntranslation invariant kernels. \n\n2 Regularization, Kernels, and Integral Operators \n\nLet us briefly review some results from regularization theory, needed for the fur(cid:173)\nther understanding of the paper. Many algorithms (SVM, GP, RN, etc.) can be \nunderstood as minimizing a regularized risk functional \nRreg[f] := Remp[f] + AO[f] \n\n(5) \nwhere Remp is the tmining error of the function f on the given data, A > 0 and O[f] \nis the so-called regularization term. The first term depends on the specific problem \nat hand (classification, regression, large margin algorithms, etc.), A is generally \nadjusted by some model selection criterion, and O[f] is a nonnegative functional \nof f which models our belief which functions should be considered to be simple (a \nprior in the Bayesian sense or a structure in a Structuraillisk Minimization sense). \n\n2.1 Regularization Operators \n\nOne possible interpretation of k is [8] that it leads to regularized risk functionals \nwhere \n\nO[f] = ~IIPfI12 or equivalently (Pk(x, .), Pk(y,')) = k(x, y). \n\n(6) \nHere P is a regularization operator mapping functions f on X into a dot product \nspace (we choose L2(X)), The following theorem allows us to construct explicit \noperators P and it provides a criterion whether a symmetric function k(x, y) is \nsuitable. \n\nTheorem 1 (Mercer [3]) Suppose k E Loo(X2) such that the integml opemtor \nTk : L 2 (X) -t L 2 (X), \n\nTkf(-) := Ix k(\u00b7,x)f(x)dp,(x) \n\n(7) \nis positive. Let \u00abPj E L 2(X) be the eigenfunction of Tk with eigenvalue Aj =I- 0 and \nnormalized such that II \u00abP j II L2 = 1 and let \u00abP j denote its complex conjugate. Then \n\n1. (Aj(T))j E fl. \n2. \u00abPj E Loo(X) and SUPj II\u00abpjIILoo < 00. \n\n\f3. k(x,x') = ~ Aj\u00abPj(X)\u00abPj(x') holds for almost all (x,x'), where the series \n\njEN \n\nconverges absolutely and uniformly for almost all (x, x'). \n\nThis means that by finding the eigensystem (Ai, \u00abPi) of Tk we can also determine \nthe regularization operator P via [8] \n\n(8) \n\nThe eigensystem (Ai, \u00abPi) tells us which functions are considered \"simple\" in terms \nof the operator P. Consequently, in order to determine the regularization properties \nof dot product kernels we have to find their eigenfunctions and eigenvalues. \n\n2.2 Specific Assumptions \n\nBefore we diagonalize Tk for a given kernel we have yet to specify the assumptions \nwe make about the measure J.t and the domain of integration X. Since a suitable \nchoice can drastically simplify the problem we try to keep as much of the symmetries \nimposed by k (x . y) as possible. The predominant symmetry in dot product kernels \nis rotation invariance. Therefore we set choose the unit ball in lRd \n\nX:= Ud := {xix E lRd and IIxl12 ::; I}. \n\n(9) \nThis is a benign assumption since the radius can always be adjusted by rescaling \nk(x\u00b7 y) --+ k((Ox)\u00b7 (Oy)). Similar considerations apply to translation. In some cases \nthe unit sphere in lR: \n\nis more amenable to our analysis. There we choose \nX:= Sd-1 := {xix E lRd and IIxl12 = I}. \n\n(10) \nThe latter is a good approximation of the situation where dot product kernels \nperform best -\nif the training data has approximately equal Euclidean norm (e.g. \nin images or handwritten digits). For the sake of simplicity we will limit ourselves \nto (10) in most of the cases. \nSecondly we choose J.t to be the uniform measure on X. This means that we have to \nsolve the following integral equation: Find functions \u00abPi : L 2 (X) --+ lR together with \ncoefficients Ai such that Tk\u00abPi(X) := Ix k(x\u00b7 y)\u00abpi(y)dy = Ai\u00abPi(X). \n\n3 Orthogonal Polynomials and Spherical Harmonics \n\nBefore we can give eigenfunctions or state necessary and sufficient conditions we \nneed some basic relations about Legendre Polynomials and spherical harmonics. \n\nDenote by Pn(~) the Legendre Polynomials and by P~(~) the associated Legendre \nPolynomials (see e.g. [4] for details). They have the following properties \n\n\u2022 The polynomials Pn(~) and P~(~) are of degree n, and moreover Pn := P~ \n\u2022 The (associated) Legendre Polynomials form an orthogonal basis with \n\nn \n\nn-1 \n\nd \n\nd \n\nr1 \n1-1 Pn(~)Pm(~)(I- ~ ) 2 d~ = I Sd-21 N(d,n/m,n. \nI \n\nISd-11 \n\n2 d-S \n\n1 \n\nI \n\n2.\".d j 2 \n\nHere Sd-1 = I'(d72) denotes the surface of Sd-b and N d, n denotes \nthe multiplicity of spherical harmonics of order n on Sd-b i.e. N(d,n) = \n2ntd-2 (ntd-3). \n\n) \n\n( \n\n(11) \n\n\f\u2022 This admits the orthogonal expansion of any analytic function k(~) on \n\n[-1,1] into P~ by \n\nMoreover, the Legendre Polynomials may be expanded into an orthonormal basis \nof spherical harmonics Y':,j by the Funk-Heeke equation (cf. e.g. [4]) to obtain \n\nIS \n\nI N(d,n) \n\nP~(x' y) = N(~~~) ~ Y:'j(x)Y:'j(y) \n\nwhere Ilxll = Ilyll = 1 and moreover \n\n1 Y:'j(X)Y':',j,(x)dx = On,n,Oj,j\" \n\nSd - l \n\n(13) \n\n(14) \n\n4 Conditions and Eigensystems on Sd- l \n\nSchoenberg [7] gives necessary and sufficient conditions under which a function \nk(x . y) defined on Sd-l satisfies Mercer's condition. In particular he proves the \nfollowing two theorems: \n\nTheorem 2 (Dot Product Kernels in Finite Dimensions) A kernel k(x\u00b7 y) \ndefined on Sd-l x Sd-l satisfies Mercer's condition if and only if its expansion into \nLegendre polynomials P~ has only nonnegative coefficients, i. e. \n\n00 \n\nk(~) = L bnP~(~) with bn :::: O. \n\ni=O \n\n(15) \n\nTheorem 3 (Dot Product Kernels in Infinite Dimensions) A kernel k(x\u00b7y) \ndefined on the unit sphere in a Hilbert space satisfies Mercer's condition if and only \nif its Taylor series expansion has only nonnegative coefficients: \n\n00 \n\nk(~) = L anC with an :::: O. \n\ni=O \n\n(16) \n\nTherefore, all we have to do in order to check whether a particular kernel may be \nused in a SV machine or a Gaussian Process is to look at its polynomial series \nexpansion and check the coefficients. This will be done in Section 5. \n\nBefore doing so note that (16) is a more stringent condition than (15). In other \nwords, in order to prove Mercer's condition for arbitrary dimensions it suffices to \nshow that the Taylor expansion contains only positive coefficients. On the other \nhand, in order to prove that a candidate of a kernel function will never satisfy \nMercer's condition, it is sufficient to show this for (15) where P~ = Pm i.e. for the \nLegendre Polynomials. \n\nWe conclude this section with an explicit representation ofthe eigensystem of k(x\u00b7y). \nIt is given by the following lemma: \n\n\fLemma 4 (Eigensystem of Dot Product Kernels) Denote by k(x\u00b7y) a kernel \non Sd-l x Sd-l satisfying condition (15) of Theorem 2. Then the eigensystem of k \nis given by \n\n'IIn,j = Y,:;'j with eigenvalues An,j = an ~~~~) of multiplicity N(d,n). \n\n(17) \n\nIn other words, N(d,n) determines the regularization properties of k(x\u00b7 y). \n\nProof Using the Funk-Heeke formula (13) we may expand (15) further into Spheri(cid:173)\ncal Harmonics Y:!,j' The latter, however, are orthonormal, hence computing the dot \nproduct of the resulting expansion with Y:!,j (y) over Sd-l leaves only the coefficient \nY:!,j (x) J:(~~~~ which proves that Y:!,j are eigenfunctions of the integral operator Tk . \n\n\u2022 In order to obtain the eigensystem of k(x . y) on Ud we have to expand k into \n\nk(x\u00b7 y) = L:,n=o(llxllllyll)'np~ (~.~) and expand'll into 'II(llxll)'11 (~). \nThe latter is very technical and is thus omitted. See [6] for details. \n\n5 Examples and Applications \n\nIn the following we will analyze a few kernels and state under which conditions they \nmay be used as SV kernels. \n\nExample 1 (Homogeneous Polynomial Kernels k(x, y) = (x\u00b7 y)P) It is well \nknown that this kernel satisfies Mercer's condition for pEN. We will show that for \np \u00a2 N this is never the case. \nThus we have to show that (15) cannot hold for an expansion in terms of Legendre \nPolynomials (d = 3). From [2, 7.126.1J we obtain for k(x, y) = lelP (we need lei to \nmake k well-defined). \n\n1 \n\n. \nZ n even \n-1 n(e)lel ~ - 2Pr (1 + ~ - ~) r G + ~ + ~) f \n. \n\nJ7Tr(p + 1) \n\nP. \n\nP \n\n-\n\n/\n\n(18) \n\nFor odd n the integral vanishes since Pn(-e) = (-I)npn(e). In order to satisfy \n(15), the integral has to be nonnegative for all n. One can see that r (1 + ~ - ~) \nis the only term in (18) that may change its sign. Since the sign of the r function \nalternates with period 1 for x < 0 (and has poles for negative integer arguments) we \ncannot find any p for which n = 2l~ + IJ and n = 2r~ + 11 correspond to positive \nvalues of the integrnl. \n\nExample 2 (Inhomogeneous Polynomial Kernels k(x, y) = (x\u00b7 y + I)P) \nLikewise we might conjecture that k(e) = (1 + e)p is an admissible kernel for all \np> O. Again, we expand k in a series of Legendre Polynomials to obtain [2, 7.127J \n\n1 \n\n/\n\n-1 Pn(e)(e + I)Pde = r(p + 2 + n)r(p + 1 - n)' \n\n2P+lr2(p + 1) \n\n(19) \n\nFor pEN all terms with n > p vanish and the remainder is positive. For noninteger \np, however, (19) may change its sign. This is due to r(p + 1 - n). In particular, \nfor any p \u00a2 N (with p > 0) we have r(p + 1- n) < 0 for n = rp1 + 1. This violates \ncondition (15), hence such kernels cannot be used in SV machines either. \n\n\fExample 3 (Vovk's Real Polynomial k(x,y) = 11~.5(~~K with pEN) This \nkernel can be written as k(~) = E::~ ~n, hence all the coefficients ai = 1 which \nmeans that this kernel can be used regardless of the dimensionality of the input \nspace. Likewise we can analyze the an infinite power series: \nExample 4 (Vovk's Infinite Polynomial k(x,y) = (1- (x\u00b7 y\u00bb-l) This kernel \ncan be written as k(~) = E:=o ~n, hence all the coefficients ai = 1. It suggests poor \ngenemlization properties of that kernel. \nExample 5 (Neural Networks Kernels k(x,y) = tanh(a + (x\u00b7 y))) It \na \nlongstanding open question whether kernels k(~) = tanh(a +~) may be used as SV \nkernels, or, for which sets of pammeters this might be possible. We show that is \nimpossible for any set of pammeters. \n\nis \n\nThe technique is identical to the one of Examples 1 and 2: we have to show that k \nfails the conditions of Theorem 2. Since this is very technical (and is best done by \nusing computer algebm progmms, e.g. Maple), we refer the reader to [6J for details \nand explain for the simpler case of Theorem 3 how the method works. Expanding \ntanh(a +~) into a Taylor series yields \n\n1 \n\n_ \n\n3 \n\n\" cosh' a \n\ntanh a + (: \n\n(:2 tanha _ ~(1- tanh2 a)(I- 3tanh2 a) + 0\u00ab(:4) \n\"cosh' a \n\n(20) \nNow we analyze (20) coefficient-wise. Since all of them have to be nonnegative we \nobtain from the first term a E JO' 00), the third term a E (-00,0], and finally from \nthe fourth term lal E [arctanh 3' arctanh 1]. This leaves us with a E 0, hence under \nno conditions on its pammeters the kernel above satisfies Mercer's condition. \n\n\" \n\n6 Eigensystems on Ud \n\nIn order to find the eigensystem of Tk on Ud we have to find a different representation \nof k where the radial part Ilxllllyll and the angular part ~ = (~ . ~) are factored \nout separately. We assume that k(x\u00b7 y) can be written as \n\n00 \n\nn=O \n\n(21) \n\nwhere Kn are polynomials. To see that we can always find such an expansion for \nanalytic functions, first expand k in a Taylor series and then expand each coefficient \n(1IxIIIIYII~)n into (1Ixllllyll)nEj=ocj(d,n)Pf(~). Rearranging terms into a series of \nPf gives expansion (21). This allows us to factorize the integral operator into its \nradial and its angular part. We obtain the following theorem: \n\nTheorem 5 (Eigenfunctions of Tk on Ud) For any kernel k with expansion \n(21) the eigensystem of the integml opemtor Tk on Ud is given by \n\n(22) \nwith eigenvalues An,j,! = J:(~~~\\ An,/, and multiplicity N(d, n), where ((llxll). \n\nNext apply the Funk-Hecke equation (13) to expand the associated Legendre \nPolynomials P~ into the spherical harmonics Y':'i . As in Lemma 4 this leads to the \nspherical harmonics as the angular part of the eigensystem. The remaining radial \npart is then (23). See [6] for more details. \n\u2022 \n\nThis leads to the eigensystem of the homogeneous polynomial kernel k(x, y) = \n(x\u00b7 y)P: if we use (18) in conjunction with (12) to expand ~P into a series of P~(~) \nwe obtain an expansion of type (21) where all Kn(T\",Ty) ex: (T\",Ty)P for n ~ p and \nKn(T\",Ty) = 0 otherwise. Hence, the only solution to (23) is 4>n(T) = Td, thus \nCPn,j (x) = IlxlIPY':'i (~). Eigenvalues can be obtained in a similar way. \n\n7 Discussion \n\nIn this paper we gave conditions on the properties of dot product kernels, under \nwhich the latter satisfy Mercer's condition. While the requirements are relatively \neasy to check in the case where data is restricted to spheres (which allowed us to \nprove that several kernels never may be suitable SV kernels) and led to explicit \nformulations for eigenvalues and eigenfunctions, the corresponding calculations on \nballs are more intricate and mainly amenable to numerical analysis. \n\nAcknowledgments: AS was supported by the DFG (Sm 62-1). The authors thank \nBernhard Sch6lkopf for helpful discussions. \n\nReferences \n[1] C. J. C. Burges. Geometry and invariance in kernel based methods. In B. SchOlkopf, \nSupport \n\nC. J . C. Burges, and A. J . Smola, editors, Advances in Kernel Methods -\nVector Learning, pages 89-116, Cambridge, MA, 1999. MIT Press. \n\n[2] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic \n\nPress, New York, 1981. \n\n[3] J. Mercer. Functions of positive and negative type and their connection with the \ntheory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415-446, 1909. \n[4] C. Millier. Analysis of Spherical Symmetries in Euclidean Spaces, volume 129 of \n\nApplied Mathematical Sciences. Springer, New York, 1997. \n\n[5] N. Oliver, B. Scholkopf, and A.J. Smola. Natural regularization in SVMs. In A.J. \nSmola, P .L. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large \nMargin Classifiers, pages 51 - 60, Cambridge, MA, 2000. MIT Press. \n\n[6] Z. Ovari. Kernels, eigenvalues and support vector machines. Honours thesis, Aus(cid:173)\n\ntralian National University, Canberra, 2000. \n\n[7] I. Schoenberg. Positive definite functions on spheres. Duke Math. J., 9:96-108, 1942. \n[8] A. Smola, B. Scholkopf, and K.-R. Miiller. The connection between regularization \n\noperators and support vector kernels. Neural Networks, 11:637-649, 1998. \n\n[9] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional \n\nConference Series in Applied Mathematics. SIAM, Philadelphia, 1990. \n\n[10] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to \nlinear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in \nGraphical Models. Kluwer, 1998. \n\n\f", "award": [], "sourceid": 1790, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Zolt\u00e1n", "family_name": "\u00d3v\u00e1ri", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}