{"title": "Regularization with Dot-Product Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 308, "page_last": 314, "abstract": null, "full_text": "Regularization  with Dot-Product  Kernels \n\nAlex J.  SIDola,  Zoltan L.  Ovari, and Robert  C.  WilliaIDson \n\nDepartment of Engineering \n\nAustralian National University \n\nCanberra, ACT,  0200 \n\nAbstract \n\nIn  this  paper  we  give  necessary  and  sufficient  conditions  under \nwhich  kernels  of dot  product  type  k(x, y)  =  k(x . y)  satisfy Mer(cid:173)\ncer's  condition  and  thus  may  be  used  in  Support  Vector  Ma(cid:173)\nchines  (SVM),  Regularization  Networks  (RN)  or  Gaussian  Pro(cid:173)\ncesses  (GP).  In  particular,  we  show  that  if the  kernel  is  analytic \n(i.e.  can be expanded in a  Taylor series),  all expansion coefficients \nhave to be nonnegative.  We  give an explicit functional form for  the \nfeature map by calculating its eigenfunctions and eigenvalues. \n\n1 \n\nIntroduction \n\nKernel functions are widely used in learning algorithms such as Support Vector Ma(cid:173)\nchines,  Gaussian  Processes,  or Regularization Networks.  A  possible interpretation \nof their effects is that they represent dot products in some feature  space  :7,  i.e. \n\nk(x,y) =  \u00a2(x)\u00b7 \u00a2(y) \n\n(1) \nwhere  \u00a2  is  a  map  from  input  (data)  space  X  into:7.  Another interpretation is  to \nconnect \u00a2 with the regularization properties of the corresponding learning algorithm \n[8].  Most  popular  kernels  can be  described  by  three  main categories:  translation \ninvariant kernels  [9] \n\n(2) \nkernels originating from generative models  (e.g.  those of Jaakkola and Haussler,  or \nWatkins), and thirdly, dot-product kernels \n\nk(x, y)  = k(x - y), \n\nk(x, y)  =  k(x . y). \n\n(3) \n\nSince k  influences the properties of the estimates generated by any of the algorithms \nabove,  it is  natural to ask which regularization properties are associated with k. \n\nIn  [8,  10,  9]  the general connections between kernels and regularization properties \nare pointed out, containing details on the connection between the Fourier spectrum \nof translation invariant kernels  and the smoothness properties of the estimates.  In \na  nutshell, the necessary and sufficient condition for  k(x - y)  to be a  Mercer kernel \n(i.e.  be admissible for  any of the aforementioned kernel methods) is that its Fourier \ntransform be nonnegative.  This also  allowed for  an easy to check criterion for  new \nkernel  functions.  Moreover,  [5]  gave  a  similar  analysis  for  kernels  derived  from \ngenerative models. \n\n\fDot  product  kernels  k(x . y),  on  the  other  hand,  have  been eluding  further  theo(cid:173)\nretical analysis and only a  necessary condition  [1]  was found,  based on geometrical \nconsiderations.  Unfortunately,  it  does  not  provide  much  insight  into  smoothness \nproperties of the corresponding estimate. \n\nOur aim in the present paper is to shed some light on the properties of dot product \nkernels, give an explicit equation how its eigenvalues can be determined, and, finally, \nshow  that for  analytic kernels  that can be expanded in terms  of monomials  ~n or \nassociated Legendre polynomials P~(~) [4],  i.e. \n\nk(x, y)  =  k(x\u00b7 y)  with k(~) =  L anC or k(~) =  L bnP~(~) \n\n00 \n\n00 \n\n(4) \n\nn=O \n\nn=O \n\na  necessary  and  sufficient  condition  is  an  ~ 0  for  all  n  E  N  if  no  assumption \nabout  the dimensionality of the input space is  made  (for finite  dimensional spaces \nof dimension  d,  the  condition  is  that  bn  ~ 0).  In  other  words,  the  polynomial \nseries  expansion in dot  product  kernels  plays  the  role  of the  Fourier  transform in \ntranslation invariant kernels. \n\n2  Regularization,  Kernels,  and Integral Operators \n\nLet  us  briefly  review  some  results  from  regularization  theory,  needed  for  the  fur(cid:173)\nther  understanding  of the  paper.  Many  algorithms  (SVM,  GP,  RN,  etc.)  can  be \nunderstood as minimizing a  regularized risk functional \nRreg[f]  := Remp[f] + AO[f] \n\n(5) \nwhere Remp is the tmining error of the function f  on the given data, A > 0 and O[f] \nis  the so-called regularization term.  The first  term depends on the specific problem \nat  hand  (classification,  regression,  large  margin  algorithms,  etc.),  A  is  generally \nadjusted  by  some  model  selection  criterion,  and O[f]  is  a  nonnegative  functional \nof f  which models our  belief  which functions should be considered to  be  simple  (a \nprior in the Bayesian sense or a  structure in a  Structuraillisk Minimization sense). \n\n2.1  Regularization Operators \n\nOne  possible  interpretation of k  is  [8]  that  it leads  to  regularized  risk  functionals \nwhere \n\nO[f] = ~IIPfI12 or equivalently (Pk(x, .), Pk(y,')) = k(x, y). \n\n(6) \nHere  P  is  a  regularization operator mapping functions  f  on X  into  a  dot  product \nspace  (we  choose  L2(X)),  The  following  theorem  allows  us  to  construct  explicit \noperators  P  and  it  provides  a  criterion  whether  a  symmetric  function  k(x, y)  is \nsuitable. \n\nTheorem 1  (Mercer  [3])  Suppose  k  E  Loo(X2)  such  that  the  integml  opemtor \nTk  : L 2 (X)  -t L 2 (X), \n\nTkf(-)  := Ix k(\u00b7,x)f(x)dp,(x) \n\n(7) \nis  positive.  Let \u00abPj  E  L 2(X)  be  the  eigenfunction  of Tk  with  eigenvalue  Aj  =I- 0  and \nnormalized such  that II \u00abP j II L2  = 1  and let \u00abP j  denote  its  complex  conjugate.  Then \n\n1.  (Aj(T))j  E  fl. \n2.  \u00abPj  E  Loo(X)  and SUPj II\u00abpjIILoo  <  00. \n\n\f3.  k(x,x')  =  ~ Aj\u00abPj(X)\u00abPj(x')  holds  for  almost  all  (x,x'),  where  the  series \n\njEN \n\nconverges  absolutely  and uniformly for  almost  all  (x, x'). \n\nThis  means  that by  finding  the  eigensystem  (Ai, \u00abPi)  of Tk  we  can  also  determine \nthe regularization operator P  via  [8] \n\n(8) \n\nThe eigensystem  (Ai, \u00abPi)  tells us which functions  are considered  \"simple\"  in terms \nof the operator P.  Consequently, in order to determine the regularization properties \nof dot product kernels we  have to find  their eigenfunctions and eigenvalues. \n\n2.2  Specific Assumptions \n\nBefore we  diagonalize Tk  for  a  given kernel we  have yet to specify the assumptions \nwe  make  about  the measure  J.t  and the  domain of integration  X.  Since  a  suitable \nchoice can drastically simplify the problem we try to keep as much of the symmetries \nimposed by k (x . y)  as possible.  The predominant symmetry in dot product kernels \nis rotation invariance.  Therefore we  set choose the  unit ball  in lRd \n\nX:= Ud  := {xix E lRd  and IIxl12 ::;  I}. \n\n(9) \nThis is  a  benign  assumption since  the radius can  always  be adjusted  by  rescaling \nk(x\u00b7 y)  --+  k((Ox)\u00b7 (Oy)).  Similar considerations apply to translation.  In some cases \nthe  unit sphere  in lR: \n\nis more amenable to our analysis.  There we  choose \nX:= Sd-1  := {xix E  lRd  and IIxl12  =  I}. \n\n(10) \nThe  latter  is  a  good  approximation  of  the  situation  where  dot  product  kernels \nperform best -\nif the training data has approximately equal Euclidean norm (e.g. \nin images  or handwritten digits).  For  the sake of simplicity we  will limit ourselves \nto  (10)  in most of the cases. \nSecondly we  choose J.t  to be the uniform measure on X.  This means that we  have to \nsolve the following integral equation:  Find functions  \u00abPi  : L 2 (X)  --+  lR  together with \ncoefficients Ai  such that Tk\u00abPi(X)  := Ix k(x\u00b7 y)\u00abpi(y)dy  =  Ai\u00abPi(X). \n\n3  Orthogonal Polynomials and Spherical Harmonics \n\nBefore  we  can  give  eigenfunctions  or  state  necessary  and  sufficient  conditions  we \nneed some basic relations about Legendre Polynomials and spherical harmonics. \n\nDenote by  Pn(~) the Legendre  Polynomials and by  P~(~) the  associated  Legendre \nPolynomials  (see e.g.  [4]  for  details).  They have the following  properties \n\n\u2022  The polynomials Pn(~) and P~(~) are of degree n, and moreover Pn := P~ \n\u2022  The (associated)  Legendre Polynomials form an orthogonal basis with \n\nn \n\nn-1 \n\nd \n\nd \n\nr1 \n1-1 Pn(~)Pm(~)(I- ~ )  2  d~ =  I Sd-21  N(d,n/m,n. \nI \n\nISd-11 \n\n2  d-S \n\n1 \n\nI \n\n2.\".d j 2 \n\nHere  Sd-1  =  I'(d72)  denotes  the  surface  of  Sd-b  and  N  d, n  denotes \nthe multiplicity of spherical harmonics  of order n  on  Sd-b  i.e.  N(d,n)  = \n2ntd-2 (ntd-3). \n\n) \n\n( \n\n(11) \n\n\f\u2022  This  admits  the  orthogonal  expansion  of  any  analytic  function  k(~)  on \n\n[-1,1]  into P~ by \n\nMoreover,  the  Legendre  Polynomials may  be expanded  into  an  orthonormal basis \nof spherical harmonics Y':,j  by the Funk-Heeke equation (cf.  e.g.  [4])  to obtain \n\nIS \n\nI N(d,n) \n\nP~(x' y) = N(~~~)  ~ Y:'j(x)Y:'j(y) \n\nwhere  Ilxll  = Ilyll  = 1 and moreover \n\n1  Y:'j(X)Y':',j,(x)dx =  On,n,Oj,j\" \n\nSd - l \n\n(13) \n\n(14) \n\n4  Conditions and Eigensystems  on Sd- l \n\nSchoenberg  [7]  gives  necessary  and  sufficient  conditions  under  which  a  function \nk(x . y)  defined  on  Sd-l  satisfies  Mercer's  condition.  In  particular  he  proves  the \nfollowing  two  theorems: \n\nTheorem 2  (Dot  Product  Kernels  in  Finite Dimensions)  A  kernel  k(x\u00b7 y) \ndefined  on Sd-l x  Sd-l  satisfies Mercer's  condition if and only if its  expansion into \nLegendre  polynomials P~ has  only nonnegative  coefficients,  i. e. \n\n00 \n\nk(~) = L bnP~(~) with bn  ::::  O. \n\ni=O \n\n(15) \n\nTheorem 3  (Dot  Product  Kernels  in  Infinite  Dimensions)  A  kernel k(x\u00b7y) \ndefined  on the  unit sphere in a Hilbert  space  satisfies Mercer's  condition if and only \nif its  Taylor  series  expansion has  only nonnegative  coefficients: \n\n00 \n\nk(~) = L anC  with an  ::::  O. \n\ni=O \n\n(16) \n\nTherefore,  all we  have  to do in order to check  whether a  particular kernel may be \nused  in  a  SV  machine  or  a  Gaussian  Process  is  to  look  at  its  polynomial  series \nexpansion and check the coefficients.  This will be done in Section 5. \n\nBefore  doing  so  note  that  (16)  is  a  more  stringent  condition  than  (15).  In other \nwords,  in order to  prove  Mercer's  condition for  arbitrary dimensions it  suffices  to \nshow  that  the  Taylor  expansion  contains  only  positive  coefficients.  On  the  other \nhand,  in  order  to  prove  that  a  candidate  of  a  kernel  function  will  never  satisfy \nMercer's condition, it is  sufficient to show this for  (15)  where  P~ = Pm i.e.  for  the \nLegendre Polynomials. \n\nWe conclude this section with an explicit representation ofthe eigensystem of k(x\u00b7y). \nIt is  given by the following lemma: \n\n\fLemma 4  (Eigensystem of Dot  Product  Kernels)  Denote by k(x\u00b7y) a kernel \non Sd-l x Sd-l  satisfying  condition  (15)  of Theorem  2.  Then  the  eigensystem  of k \nis  given  by \n\n'IIn,j  =  Y,:;'j  with  eigenvalues  An,j  =  an ~~~~) of multiplicity N(d,n). \n\n(17) \n\nIn  other words,  N(d,n)  determines  the  regularization  properties  of k(x\u00b7 y). \n\nProof  Using the Funk-Heeke formula (13) we may expand (15) further into Spheri(cid:173)\ncal Harmonics Y:!,j'  The latter, however,  are orthonormal, hence computing the dot \nproduct of the resulting expansion with Y:!,j (y)  over Sd-l leaves only the coefficient \nY:!,j (x) J:(~~~~  which proves that Y:!,j  are eigenfunctions of the integral operator Tk . \n\n\u2022 In  order  to  obtain  the  eigensystem  of k(x . y)  on  Ud  we  have  to  expand  k  into \n\nk(x\u00b7 y)  =  L:,n=o(llxllllyll)'np~ (~.~) and  expand'll  into  'II(llxll)'11 (~). \nThe latter is  very technical and is  thus omitted.  See  [6]  for  details. \n\n5  Examples and Applications \n\nIn the following we will analyze a few kernels and state under which conditions they \nmay be used as  SV kernels. \n\nExample 1  (Homogeneous Polynomial Kernels k(x, y)  =  (x\u00b7 y)P)  It  is  well \nknown that this kernel satisfies Mercer's  condition for pEN.  We  will show that for \np \u00a2 N  this  is  never the  case. \nThus  we  have  to  show that  (15)  cannot hold for an  expansion in terms  of Legendre \nPolynomials  (d = 3).  From  [2,  7.126.1J  we  obtain for  k(x, y)  = lelP  (we  need lei  to \nmake k  well-defined). \n\n1 \n\n. \nZ  n  even \n-1  n(e)lel  ~ - 2Pr (1 + ~ - ~) r G + ~ + ~)  f \n. \n\nJ7Tr(p + 1) \n\nP. \n\nP \n\n-\n\n/\n\n(18) \n\nFor  odd  n  the  integral  vanishes  since  Pn(-e)  =  (-I)npn(e).  In  order  to  satisfy \n(15),  the  integral  has  to  be  nonnegative  for  all  n.  One  can  see  that r  (1 + ~ - ~) \nis  the  only term in  (18)  that may change  its  sign.  Since  the  sign  of the r  function \nalternates with period 1 for x  < 0  (and has poles for negative integer arguments) we \ncannot find  any p  for  which n  = 2l~ + IJ  and n  = 2r~ + 11  correspond  to  positive \nvalues  of the  integrnl. \n\nExample 2  (Inhomogeneous  Polynomial Kernels  k(x, y)  = (x\u00b7 y + I)P) \nLikewise  we  might  conjecture  that  k(e)  =  (1 + e)p  is  an  admissible  kernel  for  all \np> O.  Again,  we  expand k  in a series  of Legendre  Polynomials to  obtain [2,  7.127J \n\n1 \n\n/\n\n-1 Pn(e)(e + I)Pde  =  r(p + 2 + n)r(p + 1 - n)' \n\n2P+lr2(p + 1) \n\n(19) \n\nFor pEN all terms with n  > p vanish and the remainder is positive.  For noninteger \np,  however,  (19)  may change  its  sign.  This  is  due  to r(p + 1 - n).  In particular, \nfor  any p \u00a2 N  (with p > 0) we  have r(p + 1- n) < 0 for n  =  rp1  + 1.  This  violates \ncondition  (15),  hence  such kernels  cannot be  used in SV machines  either. \n\n\fExample 3  (Vovk's  Real Polynomial k(x,y) =  11~.5(~~K  with pEN)  This \nkernel  can  be  written  as  k(~) =  E::~ ~n,  hence  all  the  coefficients  ai  =  1  which \nmeans  that  this  kernel  can  be  used  regardless  of the  dimensionality  of the  input \nspace.  Likewise  we  can  analyze  the  an infinite power series: \nExample 4  (Vovk's  Infinite Polynomial k(x,y) = (1- (x\u00b7 y\u00bb-l)  This  kernel \ncan  be  written as k(~) = E:=o ~n, hence  all the coefficients ai = 1.  It suggests poor \ngenemlization properties  of that kernel. \nExample 5  (Neural Networks  Kernels k(x,y) =  tanh(a + (x\u00b7 y)))  It \na \nlongstanding  open  question  whether kernels  k(~) =  tanh(a +~) may be  used  as  SV \nkernels,  or,  for  which  sets  of pammeters  this  might  be  possible.  We  show  that  is \nimpossible for  any  set of pammeters. \n\nis \n\nThe  technique  is  identical to  the  one  of Examples  1  and  2:  we  have  to  show  that k \nfails  the  conditions  of Theorem  2.  Since  this  is  very technical  (and  is  best  done  by \nusing  computer algebm progmms,  e.g.  Maple),  we  refer the  reader to [6J  for  details \nand  explain for  the  simpler  case  of Theorem  3  how  the  method works.  Expanding \ntanh(a +~) into  a  Taylor  series  yields \n\n1 \n\n_ \n\n3 \n\n\"  cosh' a \n\ntanh a + (: \n\n(:2  tanha  _  ~(1- tanh2 a)(I- 3tanh2 a) + 0\u00ab(:4) \n\"cosh' a \n\n(20) \nNow  we  analyze  (20)  coefficient-wise.  Since  all  of them have  to  be  nonnegative  we \nobtain from  the  first  term a  E JO' 00),  the  third  term a  E (-00,0],  and finally  from \nthe fourth  term lal  E  [arctanh 3' arctanh 1].  This  leaves  us  with a  E  0,  hence  under \nno  conditions  on its pammeters  the  kernel above  satisfies  Mercer's  condition. \n\n\" \n\n6  Eigensystems on Ud \n\nIn order to find the eigensystem of Tk  on Ud  we have to find a different representation \nof k  where the radial part Ilxllllyll  and the angular part ~ =  (~ . ~) are factored \nout separately.  We  assume that k(x\u00b7 y)  can be written as \n\n00 \n\nn=O \n\n(21) \n\nwhere  Kn  are  polynomials.  To  see  that we  can  always  find  such  an expansion  for \nanalytic functions, first expand k  in a Taylor series and then expand each coefficient \n(1IxIIIIYII~)n  into  (1Ixllllyll)nEj=ocj(d,n)Pf(~).  Rearranging terms into  a  series  of \nPf gives  expansion  (21).  This  allows  us  to factorize  the integral operator  into its \nradial and its angular part.  We  obtain the following theorem: \n\nTheorem 5  (Eigenfunctions of Tk  on Ud)  For  any  kernel  k  with  expansion \n(21)  the  eigensystem of the  integml opemtor Tk  on Ud  is given  by \n\n(22) \nwith  eigenvalues  An,j,!  =  J:(~~~\\ An,/,  and  multiplicity N(d, n),  where  (<Pn,t.  An,/)  is \nthe  eigensystem of the  integml opemtor \n\nCPn,j,!(x)  = Y:;'j  (~) <Pn,!(llxll) \n\n101 r~-lKn(r\",ry)<pn,!(r\",)dr\", =  An,!<Pn,/(ry). \n\n(23) \n\nIn general,  (23) cannot be solved analytically.  However, the accuracy of numerically \nsolving  (23)  (finite integral in one dimension)  is  much  higher than when diagonal(cid:173)\nizing Tk  directly. \n\n\fProof  All we have to do is split the integral fUd  dx into fol  Td - 1dT  fSd_1  dO..  More(cid:173)\nover note that since Tk  commutes with the group of rotations it follows  from group \ntheory  [4]  that we  may separate the angular  and the radial part in the  eigenfunc-\ntions,  hence use  the ansatz cp(x)  =  CPo  (~) 4>(llxll). \n\nNext  apply  the  Funk-Hecke  equation  (13)  to  expand  the  associated  Legendre \nPolynomials P~ into the spherical harmonics Y':'i .  As in Lemma 4 this leads to the \nspherical harmonics  as  the angular  part of the eigensystem.  The remaining radial \npart is then (23).  See  [6]  for  more details. \n\u2022 \n\nThis  leads  to  the  eigensystem  of  the  homogeneous  polynomial  kernel  k(x, y)  = \n(x\u00b7 y)P:  if we  use  (18)  in conjunction with  (12)  to expand ~P into a  series of P~(~) \nwe  obtain an expansion of type  (21)  where  all  Kn(T\",Ty)  ex:  (T\",Ty)P  for  n  ~ p  and \nKn(T\",Ty)  =  0  otherwise.  Hence,  the  only  solution  to  (23)  is  4>n(T)  =  Td,  thus \nCPn,j (x)  = IlxlIPY':'i (~). Eigenvalues can be obtained in a  similar way. \n\n7  Discussion \n\nIn this  paper  we  gave  conditions  on  the  properties  of dot  product  kernels,  under \nwhich the  latter satisfy Mercer's  condition.  While  the requirements  are  relatively \neasy  to check in the  case  where  data is  restricted to  spheres  (which  allowed  us  to \nprove  that  several  kernels  never  may  be  suitable  SV  kernels)  and  led  to  explicit \nformulations  for  eigenvalues and eigenfunctions,  the corresponding calculations on \nballs are more intricate and mainly amenable to numerical analysis. \n\nAcknowledgments:  AS was supported by the DFG (Sm 62-1).  The authors thank \nBernhard Sch6lkopf for  helpful discussions. \n\nReferences \n[1]  C.  J.  C.  Burges.  Geometry and invariance in kernel based methods.  In B.  SchOlkopf, \nSupport \n\nC.  J .  C.  Burges,  and  A.  J .  Smola,  editors,  Advances  in  Kernel  Methods  -\nVector  Learning,  pages 89-116, Cambridge,  MA,  1999.  MIT Press. \n\n[2]  I. S. Gradshteyn and I. M. Ryzhik.  Table  of integrals,  series,  and products.  Academic \n\nPress, New York,  1981. \n\n[3]  J.  Mercer.  Functions  of positive  and  negative  type  and  their  connection  with  the \ntheory of integral equations.  Philos.  Trans.  Roy.  Soc.  London, A  209:415-446, 1909. \n[4]  C.  Millier.  Analysis  of Spherical  Symmetries  in  Euclidean  Spaces,  volume  129  of \n\nApplied Mathematical  Sciences.  Springer,  New York,  1997. \n\n[5]  N.  Oliver,  B.  Scholkopf,  and  A.J.  Smola.  Natural  regularization  in  SVMs.  In A.J. \nSmola,  P .L.  Bartlett,  B.  Scholkopf,  and D.  Schuurmans,  editors,  Advances  in  Large \nMargin  Classifiers, pages  51  - 60, Cambridge,  MA, 2000.  MIT Press. \n\n[6]  Z.  Ovari.  Kernels,  eigenvalues  and  support  vector  machines.  Honours  thesis,  Aus(cid:173)\n\ntralian National University,  Canberra, 2000. \n\n[7]  I. Schoenberg.  Positive definite functions on spheres.  Duke  Math.  J., 9:96-108,  1942. \n[8]  A.  Smola,  B.  Scholkopf,  and  K.-R.  Miiller.  The  connection  between  regularization \n\noperators and support vector kernels.  Neural Networks,  11:637-649,  1998. \n\n[9]  G.  Wahba.  Spline  Models  for  Observational Data,  volume 59 of CBMS-NSF Regional \n\nConference  Series  in Applied Mathematics.  SIAM,  Philadelphia,  1990. \n\n[10]  C.  K.  I.  Williams.  Prediction  with  Gaussian  processes:  From  linear  regression  to \nlinear  prediction  and  beyond.  In  M.  I.  Jordan,  editor,  Learning  and  Inference  in \nGraphical  Models.  Kluwer,  1998. \n\n\f", "award": [], "sourceid": 1790, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Zolt\u00e1n", "family_name": "\u00d3v\u00e1ri", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}