{"title": "Optimal learning rates for least squares SVMs using Gaussian kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1539, "page_last": 1547, "abstract": "We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions.", "full_text": "Optimal learning rates for least squares SVMs using\n\nGaussian kernels\n\nM. Eberts, I. Steinwart\n\nInstitute for Stochastics and Applications\n\nUniversity of Stuttgart\n\nD-70569 Stuttgart\n\n{eberts,ingo.steinwart}@mathematik.uni-stuttgart.de\n\nAbstract\n\nWe prove a new oracle inequality for support vector machines with Gaussian RBF\nkernels solving the regularized least squares regression problem. To this end, we\napply the modulus of smoothness. With the help of the new oracle inequality we\nthen derive learning rates that can also be achieved by a simple data-dependent\nparameter selection method. Finally, it turns out that our learning rates are asymp-\ntotically optimal for regression functions satisfying certain standard smoothness\nconditions.\n\n1\n\nIntroduction\n\nOn the basis of i.i.d. observations D := ((x1, y1) , . . . , (xn, yn)) of input/output observations drawn\nfrom an unknown distribution P on X \u21e5 Y , where Y \u21e2 R, the goal of non-parametric least squares\nregression is to \ufb01nd a function fD : X ! R such that, for the least squares loss L : Y \u21e5R ! [0,1)\nde\ufb01ned by L (y, t) = (y t)2, the risk\n\nRL,P (fD) :=ZX\u21e5Y\n\nL (y, fD (x)) dP (x, y) =ZX\u21e5Y\nis small. This means RL,P (fD) has to be close to the optimal risk\n\n(y fD (x))2 dP (x, y)\n\nR\u21e4L,P := inf {RL,P (f) | f : X ! R measureable} ,\n\ncalled the Bayes risk with respect to P and L. It is well known that the function f\u21e4L,P : X ! R\nde\ufb01ned by f\u21e4L,P (x) = EP (Y |x), x 2 X, is the only function for which the Bayes risk is attained.\nFurthermore, some simple transformations show\n\n(1)\n\nRL,P (f) R \u21e4L,P =ZXf f\u21e4L,P2 dPX =f f\u21e4L,P2\n\nL2(PX ) ,\n\nwhere PX is the marginal distribution of P on X.\nIn this paper, we assume that X \u21e2 Rd is a non-empty, open and bounded set such that its boundary\n@X has Lebesgue measure 0, Y := [M, M] for some M > 0 and P is a probability measure on\nX\u21e5Y such that PX is the uniform distribution on X. In Section 2 we also discuss that this condition\ncan easily be generalized by assuming that PX on X is absolutely continuous with respect to the\nLebesgue measure on X such that the corresponding density of PX is bounded away from 0 and 1.\nRecall that because of the \ufb01rst assumption, it suf\ufb01ces to restrict considerations to decision functions\n\nf : X ! [M, M]. To be more precise, if, we denote the clipped value of some t 2 R by\u00dbt, that is\n\nM if t < M\nt\nM if t > M ,\n\nif t 2 [M, M]\n\n\u00dbt :=8<:\n\n1\n\n\fthen it is easy to check that\n\nfor all f : X ! R.\nThe non-parametric least squares problem can be solved in many ways. Several of them are e.g. de-\nscribed in [1]. In this paper, we use SVMs to \ufb01nd a solution for the non-parametric least squares\nproblem by solving the regularized problem\n\nRL,P(\u00dbf ) \uf8ffR L,P (f) ,\n\nfD, = arg min\nf2H\n\nkfk2\n\nH + RL,D (f) .\n\n(2)\n\nHere, > 0 is a \ufb01xed real number, H is a reproducing kernel Hilbert space (RKHS) over X, and\nRL,D (f) is the empirical risk of f, that is\n\nIn this work we restrict our considerations to Gaussian RBF kernels k on X, which are de\ufb01ned by\n\nL (yi, f (xi)) .\n\n1\nn\n\nRL,D (f) =\n\nnXi=1\nk (x, x0) = exp kx x0k2\n\n2\n\n2\n\n! ,\n\nx, x0 2 X ,\n\nfor some width 2 (0, 1]. Our goal is to deduce asymptotically optimal learning rates for the SVMs\n(2) using the RKHS H of k. To this end, we \ufb01rst establish a general oracle inequality. Based on\nthis oracle inequality, we then derive learning rates if the regression function is contained in some\nBesov space. It will turn out, that these learning rates are asymptotically optimal. Finally, we show\nthat these rates can be achieved by a simple data-dependent parameter selection method based on a\nhold-out set.\nThe rest of this paper is organized as follows: The next section presents the main theorems and as a\nconsequence of these theorems some corollaries inducing asymptotically optimal learning rates for\nregression functions contained in Sobolev or Besov spaces. Section 3 states some, for the proof of\nthe main statement necessary, lemmata and a version of [2, Theorem 7.23] applied to our special\ncase as well as the proof of the main theorem. Some further proofs and additional technical results\ncan be found in the appendix.\n\n2 Results\n\nIn this section we present our main results including the optimal rates for LS-SVMs using Gaussian\nkernels. To this end, we \ufb01rst need to introduce some function spaces, which are later assumed to\ncontain the regression function.\nLet us begin by recalling from, e.g. [3, p. 44], [4, p. 398], and [5, p. 360], the modulus of smooth-\nness:\nDe\ufb01nition 1. Let \u2326 \u21e2 Rd with non-empty interior, \u232b be an arbitrary measure on \u2326, and f :\u2326 ! Rd\nbe a function with f 2 Lp (\u232b) for some p 2 (0,1). For r 2 N, the r-th modulus of smoothness of\nf is de\ufb01ned by\n\n!r,Lp(\u232b) (f, t) = sup\n\nh (f, \u00b7 )kLp(\u232b) ,\nwhere k\u00b7k 2 denotes the Euclidean norm and the r-th difference 4r\nj (1)rj f (x + jh)\n\nkhk2\uf8fftk4r\nj=0r\n\nh (f, x) =(Pr\n\n4r\n\n0\n\nt 0 ,\n\nh (f,\u00b7) is de\ufb01ned by\nif x 2 \u2326r,h\nif x /2 \u2326r,h\n\nfor h = (h1, . . . , hd) 2 Rd with hi 0 and \u2326r,h := {x 2 \u2326: x + sh 2 \u2326 8 s 2 [0, r]}.\nIt is well-known that the modulus of smoothness with respect to Lp (\u232b) is a nondecreasing function\nof t and for the Lebesgue measure on \u2326 it satis\ufb01es\n\n!r,Lp(\u2326) (f, s) ,\n\n(3)\n\n!r,Lp(\u2326) (f, t) \uf8ff\u27131 + t\ns\u25c6r\n\n2\n\n\ffor all f 2 Lp (\u2326) and all s > 0, see e.g. [6, (2.1)]. Moreover, the modulus of smoothness can be\nused to de\ufb01ne the scale of Besov spaces. Namely, for 1 \uf8ff p, q \uf8ff 1, \u21b5> 0, r := b\u21b5c + 1, and an\narbitrary measure \u232b, the Besov space B\u21b5\n\np,q (\u232b) is\n\nB\u21b5\n\np,q(\u232b) is de\ufb01ned by\n\n|f|B\u21b5\nand, for q = 1, it is de\ufb01ned by\n\np,q (\u232b) :=nf 2 Lp (\u232b) : |f|B\u21b5\nwhere, for 1 \uf8ff q < 1, the seminorm |\u00b7|B\u21b5\np,q(\u232b) :=\u2713Z 1\np,1(\u232b) := sup\n\np,q(\u232b) < 1o ,\nt \u25c6 1\nt\u21b5!r,Lp(\u232b) (f, t)q dt\nt>0t\u21b5!r,Lp(\u232b) (f, t) .\np,q(\u232b), see\nIn both cases the norm of B\u21b5\n(\u232b) = Lip\u21e4 (\u21b5, Lp (\u232b))\ne.g. [3, pp. 54/55] and [4, p. 398]. Finally, for q = 1, we often write B\u21b5\np,1\nand call Lip\u21e4 (\u21b5, Lp (\u232b)) the generalized Lipschitz space of order \u21b5. In addition, it is well-known,\np (Rd) fall into the scale of Besov spaces,\nsee e.g. [7, p. 25 and p. 44], that the Sobolev spaces W \u21b5\nnamely\n\n|f|B\u21b5\np,q (\u232b) can be de\ufb01ned by kfkB\u21b5\n\np,q(\u232b) := kfkLp(\u232b) + |f|B\u21b5\n\n0\n\n,\n\nq\n\np (Rd) \u21e2 B\u21b5\nW \u21b5\n\np,q(Rd)\n\n(4)\n\nfor \u21b5 2 N, p 2 (1,1), and max{p, 2}\uf8ff q \uf8ff 1 and especially W \u21b5\nFor our results we need to extend functions f :\u2326 ! R to functions \u02c6f : Rd ! R such that the\nsmoothness properties of f described by some Sobolev or Besov space are preserved by \u02c6f. Recall\nthat Stein\u2019s Extension Theorem guarantees the existence of such an extension, whenever \u2326 is a\nbounded Lipschitz domain. To be more precise, in this case there exists a linear operator E mapping\nfunctions f :\u2326 ! R to functions Ef : Rd ! R with the properties:\n\n2 (Rd) = B\u21b5\n\n2,2(Rd).\n\n(a) E (f)\n(b) E continuously maps W m\n\n|\u2326 = f, that is, E is an extension operator.\n\np (\u2326) into W m\n\nThat is, there exist constants am,p 0, such that, for every f 2 W m\n\np (\u2326), we have\n\np Rd for all p 2 [1,1] and all integer m 0.\np,qRd for all p 2 (1,1), q 2 (0,1] and all \u21b5> 0.\n\np,q (\u2326), we have\n\np (Rd) \uf8ff am,p kfkW m\n\np (\u2326) .\n\n(5)\n\nkEfkW m\np,q (\u2326) into B\u21b5\n\n(c) E continuously maps B\u21b5\n\nThat is, there exist constants a\u21b5,p,q 0, such that, for every f 2 B\u21b5\n\nkEfkB\u21b5\n\np,q(Rd) \uf8ff a\u21b5,p,q kfkB\u21b5\n\np,q(\u2326) .\n\np\n\np\n\nand W m1\n\nFor detailed conditions on \u2326 ensuring the existence of E, we refer to [8, p. 181] and [9, p. 83].\nProperty (c) follows by some interpolation argument since B\u21b5\np,q can be interpreted as interpolation\nfor q 2 [1,1], p 2 (1,1), \u2713 2 (0, 1) and m0, m1 2 N0\nspace of the Sobolev spaces W m0\nwith m0 6= m1 and \u21b5 = m0(1 \u2713) + m1\u2713, see [10, pp. 65/66] for more details. In the following,\nwe always assume that we do have such an extension operator E. Moreover, if \u00b5 is the Lebesgue\nmeasure on \u2326, such that @\u2326 has Lebesgue measure 0, the canonical extension of \u00b5 to Rd is given\n\nby e\u00b5(A) := \u00b5(A \\ \u2326) for all measurable A \u21e2 Rd. However, in a slight abuse of notation, we\noften write \u00b5 instead ofe\u00b5, since this simpli\ufb01es the presentation. Analogously, we proceed for the\n\nuniform distribution on \u2326 and its canonical extension to Rd and the same convention will be applied\nto measures PX on \u2326 that are absolutely continuous w.r.t. the Lebesgue measure.\nFinally, in order to state our main results, we denote the closed unit ball of the d-dimensional Eu-\nclidean space by B`d\n2 .\nTheorem 1. Let X \u21e2 B`d\n2 be a domain such that we have an extension operator E in the above\nsense. Furthermore, let M > 0, Y := [M, M], and P be a distribution on X \u21e5 Y such that\nPX is the uniform distribution on X. Assume that we have \ufb01xed a version f\u21e4L,P of the regression\n\n3\n\n\ffunction such that f\u21e4L,P (x) = EP (Y |x) 2 [M, M] for all x 2 X. Assume that, for \u21b5 1 and\nr := b\u21b5c + 1, there exists a constant c > 0 such that, for all t 2 (0, 1], we have\n\n(6)\nThen, for all \"> 0 and p 2 (0, 1) there exists a constant K > 0 such that for all n 1, \u2327 1, and\n> 0, the SVM using the RKHS H satis\ufb01es\n\nkfD,k2\n\nH\n\n(1p)(1+\")d\n\npn\n\n+ K\u2327\nn\n\nwith probability Pn not less than 1 e\u2327 .\nWith this oracle inequality we can derive learning rates for the learning method (2).\nCorollary 1. Under the assumptions of Theorem 1 and for \"> 0, p 2 (0, 1), and \u2327 1 \ufb01xed, we\nhave, for all n 1,\n\n2\u21b5\n\n2\u21b5+2\u21b5p+dp+(1p)(1+\")d\n\n!r,L2(Rd)Ef\u21e4L,P, t \uf8ff ct\u21b5 .\n+ RL,P(\u00dbfD,) R \u21e4L,P \uf8ff K d + Kc22\u21b5 + K\n+ RL,P(\u00dbfD,n) R \u21e4L,P \uf8ff Cn\n\nn = c1n\nn = c2n\n\n2\u21b5+2\u21b5p+dp+(1p)(1+\")d ,\n\n2\u21b5+2\u21b5p+dp+(1p)(1+\")d .\n\nHn\n\n2\u21b5+d\n\n1\n\nn kfD,nk2\n\nwith probability Pn not less than 1 e\u2327 and with\n\nHere, c1 > 0 and c2 > 0 are user-speci\ufb01ed constants and C > 0 is a constant independent of n.\nNote that for every \u21e2> 0 we can \ufb01nd \", p 2 (0, 1) suf\ufb01ciently close to 0 such that the learning rate\nin Corollary 1 is at least as fast as\n\nn 2\u21b5\n\n2\u21b5+d +\u21e2 .\n\nTo achieve these rates, however, we need to set n and n as in Corollary 1, which in turn requires\nus to know \u21b5. Since in practice we usually do not know this value, we now show that a standard\ntraining/validation approach, see e.g. [2, Chapters 6.5, 7.4, 8.2], achieves the same rates adaptively,\ni.e. without knowing \u21b5. To this end, let \u21e4:= (\u21e4 n) and := ( n) be sequences of \ufb01nite subsets\n\u21e4n, n \u21e2 (0, 1]. For a data set D := ((x1, y1) , . . . , (xn, yn)), we de\ufb01ne\n\nD1 := ((x1, y1) , . . . , (xm, ym))\nD2 := ((xm+1, ym+1) , . . . , (xn, yn))\n\nwhere m :=\u2305 n\n\nfunctions\n\n2\u21e7 + 1 and n 4. We will use D1 as a training set by computing the SVM decision\n\nfD1,, := arg min\nf2H\n\nkfk2\n\nH\n\n+ RL,D1 (f) ,\n\n(, ) 2 \u21e4n \u21e5 n\n\nand use D2 to determine (, ) by choosing a (D2, D2) 2 \u21e4n \u21e5 n such that\n(,)2\u21e4n\u21e5n RL,D2 (fD1,,) .\n\nRL,D2fD1,D2 ,D2 =\n\nmin\n\nTheorem 2. Under the assumptions of Theorem 1 we \ufb01x sequences \u21e4:= (\u21e4 n) and := ( n)\nof \ufb01nite subsets \u21e4n, n \u21e2 (0, 1] such that \u21e4n is an \u270fn-net of (0, 1] and n is an n-net of (0, 1]\nwith \u270fn \uf8ff n1 and n \uf8ff n 1\n2+d . Furthermore, assume that the cardinalities |\u21e4n| and |n| grow\npolynomially in n. Then, for all \u21e2> 0, the TV-SVM producing the decision functions fD1,D2 ,D2\nlearns with the rate\n\nn 2\u21b5\n\n2\u21b5+d +\u21e2\n\n(7)\n\nwith probability Pn not less than 1 e\u2327 .\nWhat is left to do is to relate Assumption (6) with the function spaces introduced earlier, such\nthat we can show that the learning rates deduced earlier are asymptotically optimal under some\ncircumstances.\n\n4\n\n\fCorollary 2. Let X \u21e2 B`d\n2 be a domain such that we have an extension operator E of the form\ndescribed in front of Theorem 1. Furthermore, let M > 0, Y := [M, M], and P be a distribution\non X \u21e5 Y such that PX is the uniform distribution on X. If, for some \u21b5 2 N, we have f\u21e4L,P 2\n2 (PX), then, for all \u21e2> 0, both the SVM considered in Corollary 1 and the TV-SVM considered\nW \u21b5\nin Theorem 2 learn with the rate\n\nn 2\u21b5\n\n2\u21b5+d +\u21e2\n\nwith probability Pn not less than 1 e\u2327 . Moreover, if \u21b5> d/ 2, then this rate is asymptotically\noptimal in a minmax sense.\n\nSimilar to Corollary 2 we can show assumption (6) and asymptotically optimal learning rates if the\nregression function is contained in a Besov space.\nCorollary 3. Let X \u21e2 B`d\n2 be a domain such that we have an extension operator E of the form\ndescribed in front of Theorem 1. Furthermore, let M > 0, Y := [M, M], and P be a distribution\non X \u21e5 Y such that PX is the uniform distribution on X. If, for some \u21b5 1, we have f\u21e4L,P 2\n(PX), then, for all \u21e2> 0, both the SVM considered in Corollary 1 and the TV-SVM considered\nB\u21b5\n2,1\nin Theorem 2 learn with the rate\n\nn 2\u21b5\n\n2\u21b5+d +\u21e2\n\n(PX) = B\u21b5\n\n2,1\n\n2,1\n\n2,1\n\n(PX) ! L2(PX)) \u21e0 i \u21b5\n\nwith probability Pn not less than 1 e\u2327 .\nSince for the entropy numbers ei( id : B\u21b5\nd holds (cf. [7, p. 151])\n(X) is continuously embedded into the space `1(X) of all bounded\nand since B\u21b5\nfunctions on X, we obtain by [11, Theorem 2.2] that n 2\u21b5\n2\u21b5+d is the optimal learning rate in a\nminimax sense for \u21b5> d (cf. [12, Theorem 13]). Therefore, for \u21b5> d , the learning rates obtained\nin Corollary 3 are asymptotically optimal.\nSo far, we always assumed that PX is the uniform distribution on X. This can be generalized by as-\nsuming that PX is absolutely continuous w.r.t. the Lebesgue measure \u00b5 such that the corresponding\ndensity is bounded away from zero and from in\ufb01nity. Then we have L2(PX) = L2(\u00b5) with equiva-\nlent norms and the results for \u00b5 hold for PX as well. Moreover, to derive learning rates, we actually\nonly need that the Lebesgue density of PX is upper bounded. The assumption that the density is\nbounded away from zero is only needed to derive the lower bounds in Corollaries 2 and 3.\nFurthermore, we assumed 2 (0, 1] in Theorem 1, and hence in Corollary 1 and Theorem 2 as\nwell. Note that does not need to be restricted by one. Instead only needs to be bounded from\nabove by some constant such that estimates on the entropy numbers for Gaussian kernels as used in\nthe proofs can be applied. For the sake of simplicity we have chosen one as upper bound, another\nupper bound would only have in\ufb02uence on the constants.\n\nThere have already been made several investigations on learning rates for SVMs using the least\nsquares loss, see e.g. [13, 14, 15, 16, 17] and the references therein. In particular, optimal rates\nhave been established in [16], if f\u21e4P 2 H, and the eigenvalue behavior of the integral operator\nassociated to H is known. Moreover, if f\u21e4P 62 H [17] and [12] establish both learning rates of\nthe form n/(+p), where is a parameter describing the approximation properties of H and\np is a parameter describing the eigenvalue decay. Furthermore, in the introduction of [17] it is\nmentioned that the assumption on the eigenvalues and eigenfunctions also hold for Gaussian kernels\nwith \ufb01xed width, but this case as well as the more interesting case of Gaussian kernels with variable\nwidths are not further investigated. In the \ufb01rst case, where Gaussian kernels with \ufb01xed width are\nconsidered, the approximation error behaves very badly as shown in [18] and fast rates cannot be\nexpected as we discuss below. In the second case, where variable widths are considered as in our\npaper, it is crucial to carefully control the in\ufb02uence of on all arising constants which unfortunately\nhas not been worked out in [17], either. In [17] and [12], however, additional assumptions on the\ninterplay between H and L2(PX) are required, and [17] actually considers a different exponent\nin the regularization term of (2). On the other hand, [12] shows that the rate n/(+p) is often\nasymptotically optimal in a minmax sense. In particular, the latter is the case for H = W m\n2 (X),\n2 (X), and s 2 (d/2, m], that is, when using a Sobolev space as the underlying RKHS H,\nf 2 W s\n\n5\n\n\fthen all target functions contained in a Sobolev of lower smoothness s > d/2 can be learned with the\nasymptotically optimal rate n 2s\n2s+d . Here we note that the condition s > d/2 ensures by Sobolev\u2019s\n2 (X) consists of bounded functions, and hence Y = [M, M] does not\nembedding theorem that W s\nimpose an additional assumption on f\u21e4L,P. If s 2 (0, d/2], then the results of [12] still yield the\nabove mentioned rates, but we no longer know whether they are optimal in a minmax sense, since\nY = [M, M] does impose an additional assumption. In addition, note that for Sobolev spaces this\nresult, modulo an extra log factor, has already been proved by [1]. This result suggests that by using\na C1-kernel such as the Gaussian RBF kernel, one could actually learn the entire scale of Sobolev\nspaces with the above mentioned rates. Unfortunately, however, there are good reasons to believe\nthat this is not the case. Indeed, [18] shows that for many analytic kernels the approximation error\ncan only have polynomial decay if f\u21e4L,P is analytic, too. In particular, for Gaussian kernels with\n\ufb01xed width and f\u21e4L,P 62 C1 the approximation error does not decay polynomially fast, see [18,\n2 (X), then, in general, the approximation error function only\nProposition 1.1.], and if f\u21e4L,P 2 W m\nhas a logarithmic decay. Since it seems rather unlikely that these poor approximation properties can\nbe balanced by superior bounds on the estimation error, the above-mentioned results indicate that\nGaussian kernels with \ufb01xed width may have a poor performance. This conjecture is backed-up by\nmany empirical experience gained throughout the last decade. Beginning with [19], research has thus\nfocused on the learning performance of SVMs with varying widths. The result that is probably the\nclosest to ours is [20]. Although these authors actually consider binary classi\ufb01cation using convex\nloss functions including the least squares loss, formulated it is relatively straightforward to translate\ntheir \ufb01nding to our least squares regression scenario. The result is the learning rate n m\nm+2d+2 , again\n2 (X) for some m > 0. Furthermore, [21] treats the case, where X\nunder the assumption f\u21e4L,P 2 W m\nis isometrically embedded into a t-dimensional, connected and compact C1-submanifold of Rd. In\nthis case, it turns out that the resulting learning rate does not depend on the dimension d, but on the\nintrinsic dimension t of the data. Namely the authors show the rate n s\n8s+4t modulo a logarithmic\nfactor, where s 2 (0, 1] and f\u21e4L,P 2 Lip (s). Another direction of research that can be applied to\nGaussian kernels with varying widths are multi-kernel regularization schemes, see [22, 23, 24] for\nsome results in this direction. For example, [22] establishes learning rates of the form n 2md\n4(4md) +\u21e2\n2 (X) for some m 2 (d/2, d/2 + 2), where again \u21e2> 0 can be chosen to be\nwhenever f\u21e4L,P 2 W m\narbitrarily close to 0. Clearly, all these results provide rates that are far from being optimal, so that\nit seems fair to say that our results represent a signi\ufb01cant advance. Furthermore, we can conclude\nthat, in terms of asymptotical minmax rates, multi-kernel approaches applied Gaussian RBFs cannot\nprovide any signi\ufb01cant improvement over a simple training/validation approach for determining the\nkernel width and the regularization parameter, since the latter already leads to rates that are optimal\nmodulo an arbitrarily small \u21e2 in the exponent.\n\n3 Proof of the main result\n\nTo prove Theorem 1 we deduce an oracle inequality for the least squares loss by specializing [2,\nTheorem 7.23] (cf. Theorem 3). To be \ufb01nally able to show Theorem 1 originating from Theorem 3,\nwe have to estimate the approximation error.\nLemma 1. Let X \u21e2 Rd be a domain such that we have an extension operator E of the form de-\nscribed in front of Theorem 1, PX be the uniform distribution on X and f 2 L1 (X). Furthermore,\nlet \u02dcf be de\ufb01ned by\n\n2 Ef (x)\n\n\u02dcf (x) :=p\u21e1 d\nfor all x 2 Rd and, for r 2 N and > 0, K : Rd ! R be de\ufb01ned by\nrXj=1\u2713r\nj\u25c6 (1)1j 1\njd\u2713 2\np\u21e1\u25c6 d\nK (\u00b7) := exp k\u00b7k2\n2 ! .\n\nK (\u00b7) :=\n\nwith\n\n(8)\n\n(9)\n\n(\u00b7)\n\n2\n\nK jp2\n\n2\n\n6\n\n\fThen, for r 2 N, > 0, and q 2 [1,1), we have Ef 2 Lq(ePX) and\n\nq\n\nLq(PX ) \uf8ff Cr,q !q\n\nr,Lq(Rd) (Ef, /2) ,\n\nwhere Cr,q is a constant only depending on r, q and \u00b5(X).\n\nK \u21e4 \u02dcf f\n\nIn order to use the conclusion of Lemma 1 in the proof of Theorem 1 it is necessary to know some\nproperties of K \u21e4 \u02dcf. Therefore, we need the next two lemmata.\nLemma 2. Let g 2 L2Rd, H be the RKHS of the Gaussian RBF kernel k over X \u21e2 Rd and\np\u21e1\u25c6 d\n\nj\u25c6 (1)1j 1\n\nj22 !\n2kxk2\n\njd\u2713 2\n\nexp \n\nK (x) :=\n\n2\n\n2\n\nrXj=1\u2713r\n\nfor x 2 Rd and a \ufb01xed r 2 N. Then we have\nK \u21e4 g 2 H ,\nkK \u21e4 gkH \uf8ff (2r 1)kgkL2(Rd) .\n\nLemma 3. Let g 2 L1Rd, H be the RKHS of the Gaussian RBF kernel k over X \u21e2 Rd and\n\nK be as in Lemma 2. Then\n\n|K \u21e4 g (x)|\uf8ff p\u21e1 d\n\n2 (2r 1)kgkL1(Rd)\n\nholds for all x 2 X. Additionally, we assume that X is a domain in Rd such that we have an\nextension operator E of the form described in front of Theorem 1, Y := [M, M] and, for all x 2\n2 Ef\u21e4L,P (x), where f\u21e4L,P denotes a version of the conditional expectation\nRd, \u02dcf (x) := (p\u21e1) d\nsuch that f\u21e4L,P (x) = EP (Y |x) 2 [M, M] for all x 2 X. Then we have \u02dcf 2 L1Rd and\nfor all x 2 X, which implies\n\n|K \u21e4 \u02dcf (x)|\uf8ff a0,1 (2r 1) M\n\nL(y, K \u21e4 \u02dcf (x)) \uf8ff 4ra2M 2\n\nH\n\nfor the least squares loss L and all (x, y) 2 X \u21e5 Y .\nNext, we modify [2, Theorem 7.23], so that the proof of Theorem 1 can be build upon it.\n2 , Y := [M, M] \u21e2 R be a closed subset with M > 0 and P be a\nTheorem 3. Let X \u21e2 B`d\ndistribution on X \u21e5 Y . Furthermore, let L : Y \u21e5 R ! [0,1) be the least squares loss, k be\nthe Gaussian RBF kernel over X with width 2 (0, 1] and H be the associated RKHS. Fix an\nf0 2 H and a constant B0 4M 2 such that kL f0k1 \uf8ff B0. Then, for all \ufb01xed \u2327 1, > 0,\n\"> 0 and p 2 (0, 1), the SVM using H and L satis\ufb01es\nkfD,k2\n+3456M 2 + 15B0 (ln(3) + 1)\u2327\n\uf8ff 9\u21e3kf0k2\nwith probability Pn not less than 1 e\u2327 , where C\",p is a constant only depending on \", p and M.\nWith the previous results we are \ufb01nally able to prove the oracle inequality declared by Theorem 1.\nProof of Theorem 1. First of all, we want to apply Theorem 3 for f0 := K \u21e4 \u02dcf with\nj22 !\n2kxk2\n\n+ RL,P\u21e3\u00dbfD,\u2318 R \u21e4L,P\n\n+ RL,P (f0) R \u21e4L,P\u2318 + C\",p\n\nexp \n\nj\u25c6 (1)1j 1\n\np\u21e1\u25c6 d\n\njd\u2713 2\n\n(1p)(1+\")d\n\nK (x) :=\n\npn\n\nH\n\nn\n\n2\n\n2\n\nrXj=1\u2713r\n\n7\n\n\f(10)\n\nk \u02dcfkL2(Rd) =p\u21e1 d\n\uf8ffp\u21e1 d\np\u21e1\u25c6 d\n\uf8ff\u2713 2\ni.e. \u02dcf 2 L2Rd. Because of this and Lemma 2\n\n2\n\nf0 = K \u21e4 \u02dcf 2 H\n\n2 kEf\u21e4L,PkL2(Rd)\n2 a0,2kf\u21e4L,PkL2(X)\na0,2M ,\n\nis satis\ufb01ed and with Lemma 3 we have\n\nkL f0k1\n\n=\n\nsup\n\n(x,y)2X\u21e5Y |L (y, f0 (x))| =\n\nFurthermore, (1) and Lemma 1 yield\n\nsup\n\n(x,y)2X\u21e5YL\u21e3y, K \u21e4 \u02dcf (x)\u2318 \uf8ff 4ra2M 2 =: B0 .\n=K \u21e4 \u02dcf f\u21e4L,P\n\nRL,P (f0) R \u21e4L,P = RL,P\u21e3K \u21e4 \u02dcf\u2318 R \u21e4L,P\nr,L2(Rd)\u21e3Ef\u21e4L,P,\n\n2\u2318\n\nL2(PX )\n\n\n\n2\n\n\uf8ff Cr,2 !2\n\uf8ff Cr,2 c22\u21b5 ,\n\nand\n\n\u02dcf (x) :=p\u21e1 d\n\n2 Ef\u21e4L,P (x)\n\nfor all x 2 Rd. The choice f\u21e4L,P (x) 2 [M, M] for all x 2 X implies f\u21e4L,P 2 L2 (X) and the latter\ntogether with X \u21e2 B`d\n\n2 and (5) yields\n\nwhere we used the assumption\n\n!r,L2(Rd)\u21e3Ef\u21e4L,P,\n\n\n\n2\u2318 \uf8ff c\u21b5\n\nfor 2 (0, 1], \u21b5 1, r = b\u21b5c + 1 and a constant c > 0 in the last step. By Lemma 2 and (10) we\nknow\n\nkf0kH\n\n= kK \u21e4 \u02dcfkH \uf8ff (2r 1)k \u02dcfkL2(Rd) \uf8ff (2r 1)\u2713 2\n\np\u21e1\u25c6 d\n\n2\n\na0,2M .\n\nTherefore, Theorem 3 and the above choice of f0 yield, for all \ufb01xed \u2327 1, > 0, \"> 0 and\np 2 (0, 1), that the SVM using H and L satis\ufb01es\n\nH\n\nkfD,k2\n\n+ RL,P\u21e3\u00dbfD,\u2318 R \u21e4L,P\n\n\uf8ff 9 (2r 1)2\u2713 2\np\u21e1\u25c6d\n+3456 + 15 \u00b7 4ra2 M 2(ln(3) + 1)\u2327\n\n0,2M 2 + Cr,2c22\u21b5!\n\n(1p)(1+\")d\n\n+ C\",p\n\npn\n\na2\n\nn\n(1p)(1+\")d\n\npn\n\n+ C2\u2327\nn\n\n\uf8ff C1d + 9 Crc22\u21b5 + C\",p\n\nwith probability Pn not less than 1 e\u2327 and with constants C1 := 9 (2r 1)2 2d\u21e1 d\nC2 := (ln(3) + 1)3456 + 15 \u00b7 4ra2 M 2, a := max{a0,1, 1}, Cr := Cr,2 only depending on r\nand \u00b5(X) and C\",p as in Theorem 3.\n\n0,2M 2,\n\n2 a2\n\n8\n\n\fReferences\n[1] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A Distribution-Free Theory of Nonparametric\n\nRegression. Springer-Verlag New York, 2002.\n\n[2] I. Steinwart and A. Christmann. Support Vector Machines. Springer-Verlag, New York, 2008.\n[3] R.A. DeVore and G.G. Lorentz. Constructive Approximation. Springer-Verlag Berlin Heidel-\n\nberg, 1993.\n\n[4] R.A. DeVore and V.A. Popov. Interpolation of Besov Spaces. AMS, Volume 305, 1988.\n[5] H. Berens and R.A. DeVore. Quantitative Korovin theorems for positive linear operators on\n\nLp-spaces. AMS, Volume 245, 1978.\n\n[6] H. Johnen and K. Scherer. On the equivalence of the K-functional and moduli of continuity and\nsome applications. In Lecture Notes in Math., volume 571, pages 119\u2013140. Springer-Verlag\nBerlin, 1976.\n\n[7] D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential 0perators.\n\nCambridge University Press, 1996.\n\n[8] E.M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton Univ.\n\nPress, 1970.\n\n[9] R.A. Adams and J.J.F. Fournier. Sobolev Spaces. Academic Press, 2nd edition, 2003.\n[10] H. Triebel. Theory of Function Spaces III. Birkh\u00a8auser Verlag, 2006.\n[11] V. Temlyakov. Optimal estimators in learning theory. Banach Center Publications, Inst. Math.\n\nPolish Academy of Sciences, 72:341\u2013366, 2006.\n\n[12] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression.\n\nProceedings of the 22nd Annual Conference on Learning Theory, 2009.\n\n[13] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math. Soc.,\n\n39:1\u201349, 2002.\n\n[14] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares\n\nalgorithm in learning theory. Found. Comput. Math., 5:59\u201385, 2005.\n\n[15] S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approxi-\n\nmations. Constr. Approx., 26:153\u2013172, 2007.\n\n[16] A. Caponnetto and E. De Vito. Optimal rates for regularized least squares algorithm. Found.\n\nComput. Math., 7:331\u2013368, 2007.\n\n[17] S. Mendelson and J. Neeman. Regularization in kernel learning. Ann. Statist., 38:526\u2013565,\n\n2010.\n\n[18] S. Smale and D.-X. Zhou. Estimating the approximation error in learning theory. Anal. Appl.,\n\nVolume 1, 2003.\n\n[19] I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Ann.\n\nStatist., 35:575\u2013607, 2007.\n\n[20] D.-H. Xiang and D.-X. Zhou. Classi\ufb01cation with Gaussians and convex loss. J. Mach. Learn.\n\nRes., 10:1447\u20131468, 2009.\n\n[21] G.-B. Ye and D.-X. Zhou. Learning and approximation by Gaussians on Riemannian mani-\n\nfolds. Adv. Comput. Math., Volume 29, 2008.\n\n[22] Y. Ying and D.-X. Zhou. Learnability of Gaussians with \ufb02exible variances. J. Mach. Learn.\n\nRes. 8, 2007.\n\n[23] C.A. Micchelli, M. Pontil, Q. Wu, and D.-X. Zhou. Error bounds for learning the kernel. 2005.\n[24] Y. Ying and C. Campbell. Generalization bounds for learning the kernel. In S. Dasgupta and\n\nA. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning Theory, 2009.\n\n9\n\n\f", "award": [], "sourceid": 874, "authors": [{"given_name": "Mona", "family_name": "Eberts", "institution": null}, {"given_name": "Ingo", "family_name": "Steinwart", "institution": null}]}