{"title": "An Oracle Inequality for Clipped Regularized Risk Minimizers", "book": "Advances in Neural Information Processing Systems", "page_first": 1321, "page_last": 1328, "abstract": null, "full_text": "An Oracle Inequality for Clipped Regularized Risk Minimizers\nIngo Steinwart, Don Hush, and Clint Scovel Modelling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory Los Alamos, NM 87545 {ingo, dhush, jcs}@lanl.gov\n\nAbstract\nWe establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9].\n\n1 Introduction\nThe theoretical understanding of support vector machines (SVMs) and related kernel-based methods has been substantially improved in recent years. For example using Talagrand's concentration inequality and local Rademacher averages it has recently been shown that SVMs for classification can learn with rates up to n-1 under somewhat realistic assumptions on the data-generating distribution (see [9, 11] and the related work [2]). However, the so-called \"shrinking technique\" of [9, 11] for establishing such rates, requires the free parameters to be chosen a-priori, and in addition, the optimal values of these parameters depend on features of the data-generating distribution which are typically unknown. Consequently, [9, 11] do not provide a practical method for learning with fast rates. On the other hand, the oracle inequality in [2] only holds for distributions having Tsybakov noise exponent , and hence it describes a situation which is rarely met in practice. The goal of this work is to overcome these shortcomings by establishing a general oracle inequality (see Theorem 3.1) for regularized empirical risk minimizers. The key ingredient of this oracle inequality is the observation that for most commonly used loss functions it is possible to \"clip\" the decision function of the algorithm before beginning with the theoretical analysis. In addition, a careful choice of the weighted empirical process Talagrand's inequality is applied to, makes the \"shrinking technique\" superfluous. Finally, by explicitly dealing with -approximate minimizers of the regularized risk our results also apply to actual SVM algorithms. With the help of the general oracle inequality we then establish an oracle inequality for SVM type algorithms (see Theorem 2.1) as well as a simple oracle inequality for model selection (see Theorem 4.2). For the former, we show that it leads to improved rates for e.g. binary classification under the assumptions considered in [9] and a-priori known noise exponents. Using the model selection theorem we then show how our new oracle inequality for SVMs can be used to analyze a simple parameter selection procedure based on a validation set that achieves the same learning rates without prior knowledge on the noise exponents. The rest of this work is organized as follows: In Section 2 we present our oracle inequality for SVM type algorithms. We then discuss its implications and analyze the simple parameter selection\n\n\f\nprocedure when using Gaussian RBF kernels. In Section 3 we then present and prove the general oracle inequality. The proof of Theorem 2.1 as well as the oracle inequality for model selection can be found in Section 4.\n\n2 Main Results\nThroughout this work we assume that X is compact metric space, Y [-1, 1] is compact, P is a Borel probability measure on X Y , and F is a set of functions over X such that 0 F . Often F is a reproducing kernel Hilbert space (RKHS) H of continuous functions over X with closed unit ball BH . It is well-known that H can then be continuously embedded into the space of continuous functions C (X ) equipped with the usual maximum-norm . . In order to avoid constants we always assume that this embedding has norm 1, i.e. . . H . Furthermore, L : Y R [0, ) always denotes a continuous function which is convex in its second variable such that L(y , 0) 1. The functions L will serve as loss functions and consequently let us recall that the associated L-risk of a measurable function f : X R is defined by y . RL,P (f ) = E(x,y)P L , f (x) Note that the assumption L(y , 0) 1 immediately gives RL,P (0) 1. Furthermore, the minimal L-risk is denoted by R ,P , i.e. R ,P = inf {RL,P (f ) | f : X R measurable}, and a function L L attaining this infimum is denoted by fL,P . We always assume that such an fL,P exists. The learning schemes we are mainly interested in are based on an optimization problem of the form , fP, := arg min f H + RL,P (f ) 2 (1)\nf H\n\nwhere > 0. Note that if we identify a training set T = ((x1 , y1 ), . . . , (xn , yn )) (X Y )n with its empirical measure, then fT , denotes the empirical estimators of the above learning scheme. Obviously, support vector machines (see e.g. [5]) and regularization networks (see e.g. [7]) are both learning algorithms which fall into the above category. One way to describe the approximation error of these learning schemes is the approximation error function a() := fP, H + RL,P (fP, ) - R ,P , 2 L > 0,\n\nwhich has been discussed in some detail in [10]. Furthermore in order to deal with the complexity of the used RKHSs let us recall that for a subset A E of a Banach space E the covering numbers are defined by N (A, , E ) := min n 1 : x1 , . . . , xn E with A in\n=1\n\n(xi + BE )\n\n,\n\n > 0,\n\nwhere BE denotes the closed unit ball of E . Given a finite sequence T = ((x1 , y1 ), . . . , (xn , yn )) (X Y )n we write TX := (x1 , . . . , xn ). For our main results we are particularly interested in covering numbers in the Hilbert space L2 (TX ) which consists of all equivalence classes of functions f : X Y R and which is equipped with the norm f L2 (TX ) := 1 in f n\n=1\n\n(xi )\n\n1 22\n\n.\n\n(2)\n\nIn other words, L2 (TX ) is a L2 -space with respect to the empirical measure of (x1 , . . . , xn ). Learning schemes of the form (1) typically produce functions fP, with lim0 fP, = (see e.g. [10] for a precise statement). Unfortunately, this behaviour has a serious negative impact on the learning rates when directly employing standard tool's such as Hoeffding's, Bernstein's or Talagrand's inequality. On the other hand, when dealing with e.g. the hinge loss it is obvious that clipping the function fP, at -1 and 1 does not worsen the corresponding risks. Following this simple observation we will consider loss functions L that satisfy the clipping condition L (y , 1) if t 1 L(y , t) (3) L(y , -1) if t -1 ,\n\n\f\nfor all y Y . Recall that this type of loss function was already considered in [4, 11], but the clipping idea actually goes back to [1]. Moreover, it is elementary to check that most commonly used loss functions including the hinge loss and the least squares loss satisfy (3). Given a function f : X R ^ we now define its clipped version f : X [-1, 1] by if f (x) > 1 1 ^ f (x) := f (x) if f (x) [-1, 1] -1 if f (x) < -1 . ^ It is clear from (3) that we always have L(y , f (x)) L(y , f (x)) and consequently we obtain ^) RL,P (f ) for all distributions P . Finally, we also need the following Lipschitz condition RL,P (f |L|1 := |L(y , t1 ) - L(y , t2 )| 2. |t1 - t2 | y Y ,-1t1 ,t2 1 sup (4)\n\nWith the help of these definitions we can now state our main result which establishes an oracle inequality for clipped versions of fT , : Theorem 2.1 Let P be a distribution on X Y and let L be a loss function which satisfies (3) and (4). Let H be a RKHS of continuous functions on X . For a fixed element f0 H we define a(f0 ) := f0 2 + RL,P (f0 ) - R ,P H L . L (y , f0 (x)) B (f0 ) := sup\nxX,y Y\n\n(5)\n\nIn addition, we assume that we have a variance bound of the form L 2 E ^ ^ EP f - L fL,P v P (L f - L fL,P )\n\n(6)\n\nfor constants v 1, [0, 1] and all measurable f : X R. Moreover, suppose that H satisfies B -2p sup log N H , , L2 (TX ) a , > 0, (7)\nT (X Y )n\n\nfor some constants p (0, 1) and a 1. For fixed > 0 let fT , H be a function that minimizes f f 2 + RL,T (f ) up to some > 0. Then there exists a constant Kp,v depending only on p H and v such that for all 1 we have with probability not less than 1 - 3e- that K 1 32v 2- 1 2-+p(-1) Kp,v a 140 14B (f0 ) p,v a ^ + p +5 + + RL,P (fT , ) - R ,P L p n n n n 3n + 8a(f0 ) + 4 . (8)\n\nThe above oracle inequality has some interesting consequences as the following examples illustrate. We begin with an example that deals with a fixed kernel: Example 2.2 (Learning rates for single kernel) Assume that in Theorem 2.1 we have a Lipschitz continuous loss function such as the hinge loss. In addition assume that the approximation error function satisfies a() c , > 0, for some constants c > 0 and (0, 1]. Setting f0 := fP, and optimizing (8) with respect to then shows that the corresponding SVM learns with rate n- , where 2 . 2 +, := min - + p( - 1) p +1 Recall that this learning rate has already been obtained in [11]. The next example investigates SVMs that use a Gaussian RBF kernel whose width may vary with the sample size: Example 2.3 (Classification with several Gaussian kernels) Let X be the unit ball in Rd and Y := {-1, 1}. Furthermore assume that we are interested in binary classification using the hinge\n\n\f\nloss and the Gaussian RKHSs H that belong to the RBF kernels k (x1 , x2 ) := e- x1 -x2 2 with width > 0. If P has geometric noise exponent (0, ) in the sense of [9] then it was shown in [9] that there exists a function f0 H with f0 1 and , a (f0 ) c d + -d > 0, > 0, where c > 0 is a constant independent of and . Moreover, [9, Thm. 2.1] shows that H satisfies (7) for all p (0, 1) with a := cp,d, (1-p)(1+)d where > 0 can be arbitrarily chosen and cp,d, is a suitable constant. Now assume that P has Tsybakov noise exponent q [0, ] in the sense of [9]. It was then shown in [9] that (6) is satisfied q for := q+1 . Minimizing (8) with respect to and and choosing p and sufficiently small then yields that the corresponding SVM can learn with rate n- + , where := (q + 1) , (q + 2) + q + 1\n\n2\n\nand > 0 can be chosen arbitrarily small. Note that these rates are superior to those obtained in [9, Theorem 2.8]. In the above examples the optimal parameters and depend on the sample size n but not on the training samples T . However, these optimal parameters require us to know certain characteristics of the distribution such as the approximation exponent or the noise exponents and q . The following example shows that the oracle inequality of Theorem 2.1 can be used to find these optimal parameters in a data-dependent fashion which does not require any a-priori knowledge: Example 2.4 In this example we assume that our training set T consists of 2n samples. We write T0 for the first n samples and T1 for the last n samples. Let fT0 ,, be the SVM solution using a Gaussian kernel with width . Moreover, let [1, n1/d ) and (0, 1] be finite sets with cardinality m and m , respectively. Under the assumptions of Example 2.3 the oracle inequality (8) then shows that with probability not less than 1 - 3m m e- we have s d q+1 q+1 q +2- q +2 ^ ,, ) - R Kd,q,, RL,P (fT0 + + d + -d L,P n n imultaneously for all and , where (0, 1] is arbitrarily but fixed and Kd,q,, is a suitable constant. Now using a simple model selection approach (see e.g. Theorem 4.2) for the second half T1 of our training set we find that with probability not less than 1 - e- we have q+1 + log(m m ) q+2 ^ , , ) - R RL,P (fT0 T T C L,P 1 1 n , d q+1 q +2- + + C min d + -d , n\n where C is a constant only depending on d, q , , and , and (T1 , 1 ) is a pair that T minimizes the empirical risk RL,T1 (.) over . Now assume that n and n are 1/n- and 1/n2 -nets of [1, n1/d ) and (0, 1], respectively. Obviously, we can choose n and n such that mn n2 and mn n2 , respectively. With such parameter sets it is then easy to check that we obtain exactly the rates we have found in Example 2.3, but without knowing the noise exponents and q a-priori.\n\n3 An oracle inequality for clipped penalized ERM\nTheorem 2.1 is a consequence of a far more general oracle inequality on clipped penalized empirical risk minimizers. Since this result is of its own interest we now present it together with its proof in detail. To is end recall that a subroot is a nondecreasing function : [0, ) [0, ) such th that (r)/ r is nonincreasing in r. Moreover, for a Rademacher sequence := (1 , . . . , n ) with respect to the measure and a function h : Z R we define R h : Z n R by R h := . n-1 1 h(z1 ) + + n h(zn ) Now the general oracle inequality is:\n\n\f\nTheorem 3.1 Let P = be a set of (hyper)-parameters, F be a set of measurable functions f : X R with 0 F , and : P F [0, ] be a function. Let P be a distribution on X Y and L be a loss function which satisfies (3) and (4). For a fixed pair (p0 , f0 ) P F we define a (p0 , f0 ) := (p0 , f0 ) + RL,P (f0 ) - R ,P . L Moreover, let us assume that the quantity B (f0 ) defined in (5) is finite. In addition, we assume that we have a variance bound of the form (6) for constants v 1, [0, 1] and all measurable f : X R. Furthermore, suppose that there exists a subroot n with R ^ ET P n E n (r) , r > 0. (9) sup (L f - L fL,P )\n(p,f )P F ^ (p,f )+EP (Lf -LfL,P )r\n\nFinally, let (pT , , fT , ) be an -approximate minimizer of (p, f ) (p, f ) + RL,T (f ). Then for all 1 and all r satisfying 1 32v 2- 28 1 ( r max 20n (r), , 10) n n we have with probability not less than 1 - 3e- that ^ (pT , , fT , ) + RL,P (fT , ) - R ,P 5r + L 14B (f0 ) + 8a (p0 , f0 ) + 4 . 3n\n\n^ Proof: We write B for B (f0 ). For T (X Y )n we now observe (pT , , fT , ) + RL,T (fT , ) - (p0 , f0 ) - RL,T (f0 ) by the definition of (pT , , fT , ), and hence we find ^ (pT , , fT ,) + RL,P (fT ,) - R ,P L ^ ^ = RL,P (fT , ) - RL,T (fT , ) + RL,T (f0 ) - RL,P (f0 ) + a (p0 , f0 ) + ^ ^ RL,P (fT ,) - RL,P (f ) - RL,T (fT ,) + RL,T (f )\nL,P L,P\n\n(11) (12) (13)\n\n^ ^ +RL,T (f0 ) - RL,T (f0 ) - RL,P (f0 ) + RL,P (f0 ) ^ ^ +RL,T (f0 ) - RL,T (fL,P ) - RL,P (f0 ) + RL,P (fL,P ) +a (p0 , f0 ) + .\n\n^ Let us first estimate the term in line (12). To this end we write h1 := L f0 - L f0 . Then our assumption on L guarantees h1 0, and since we also have h1 B , we find h1 - EP h1 B . In addition, we obviously have EP (h1 - EP h1 )2 EP h2 B EP h1 . Consequently, Bernstein's 1 inequality [6, Thm. 8.2] shows that with probability not less than 1 - e- we have 2 B EP h1 2B + . ET h1 - EP h1 < n 3n 1 b Now using ab a + 2 we find 2 B EP h1 n- 2 EP h1 + Bn , and consequently we have 2 2 T 7B ^ ^ 1 - e- . (14) Pn Z n : RL,T (f0 ) - RL,T (f0 ) - RL,P (f0 ) + RL,P (f0 ) < EP h1 + 6n ^ Let us now estimate the term in line (13). To this end we write h2 := L f0 - L f . Then we have\nL,P\n\nh2 3 and h2 - EP h2 6. In addition, our variance bound gives EP (h2 - EP h2 )2 EP h2 v (EP h2 ) , and consequently, Bernstein's inequality shows that with probability not less 2 than 1 - e- we have 2 v (EP h2 ) 4 ET h2 - EP h2 < + . n n ( -1 -1 Now, for q -1 + (q ) = 1 the elementary inequality ab aq q -1 + bq q ) holds, and hence for 2EP h2 /2 1 2 2 q := 2- , q := , a := 21- v n- 2 , and b := we obtain 2 1 21- v 2- 1 v (EP h2 ) - + EP h2 . n 2 n\n\n\f\n2 1 Since elementary calculations show that - 2- 1 we obtain 2 1 1 2v 2- v (EP h2 ) - + EP h2 . n 2 n Therefore we have with probability not less than 1 - e- that 2v 2- 4 1 + . (15) n n ^ Let us finally estimate the term in line (11). To this end we write hf := L f - L fL,P , f F . Moreover, for r > 0 we define E h -h . Pf f Gr := : (p, f ) P F (p, f ) + EP (hf ) + r\n ^ ^ RL,T (f0 ) - RL,T (fL,P ) - RL,P (f0 ) + RL,P (fL,P ) < EP h2 +\n\n1\n\n-\n\n 2\n\nThen for gp,f :=\n\nEP hf -hf (p,f )+EP (hf )+r\n\n Gr we have EP gp,f = 0 and\n\ngp,f = sup\nz Z\n\nE h - h (z ) = EP hf - hf 6 Pf f . (p, f ) + EP (hf ) + r (p, f ) + EP (hf ) + r r EP h2 EP h2 v f f 2- 2- . (EP (hf ) + r)2 r (EP hf ) r sup\n(p,f )P F\n\nIn addition, the inequality a b2- (a + b)2 and the variance bound assumption (6) implies that\n2 EP gp,f \n\nNow define (r) := ET P n Standard symmetrization then yields ET P n sup\n(p,f )P F (p,f )+EP (hf )r\n\nEP hf - ET hf . (p, f ) + EP (hf ) + r sup\n(p,f )P F (p,f )+EP (hf )r\n\n|EP hf - ET hf | 2ET P n E\n\n|R hf | ,\n\nand hence Lemma 3.2 proved below together with (9) shows (r) 10n (r)r-1 , r > 0. Therefore applying Talagrand's inequality in the version of [3] to the class Gr we obtain 2 T 30n (r) v 7 n n + + 1 - e- . P Z : sup ET g r nr2- nr g Gr 2 v 1 /2 7 n Let us define r := 30r (r) + nr- + nr . Then the above inequality gives with probability 2 not less than 1 - e- that for all (p, f ) P F we have 2 + 7 vr + , EP hf - ET hf r (p, f ) + EP hf 30n (r) + n n and consequently we have with probability not less than 1 - e- that\n ^ ^ RL,P (fT , ) - RL,P (fL,P ) - RL,T (fT , ) + RL,T (fL,P )\n\n2 vr 7 r 30n (r) + + . (16) n n Now observe that for the functions h1 and h2 which we defined when estimating (12) and (13) we have EP g + EP h = RL,P (f0 ) - R ,P , (17) L and hence we can combine our estimates (16), (14), and (15) of the terms (11), (12), and (13) to obtain that with probability not less than 1 - 3e- we have ^ (1 - r ) (pT , , fT , ) + RL,P (fT , ) - R ,P L 2 2v 2- 1 vr (66 + 7B ) 30n (r) + + (1 - ) + + a (p0 , f0 ) + RL,P (f0 ) - R ,P + . L n 2 n 6n \n ^ (pT , , fT , ) + RL,P (fT , ) - RL,P (fL,P )\n\n+\n\n\f\n2 v 1 /2 n In particular, for r satisfying the assumption (10) we have 30r (r) 1 , nr- 1 , and 2 4 4 7 1 . This shows 1 - r 1 , and hence we obtain with probability not less than 1 - 3e- that nr 4 4 3 1 2v 2- 44 2 v r ^ (pT , , fT , ) + RL,P (fT , ) - RL,P 120n (r) + + 2(2 - ) + n n n R + 14B + + 4a (p0 , f0 ) + 4 L,P (f0 ) - R ,P 4. L 3n 3 v 1/2 2v 1 4 r, 4n 5r , and 2(2 - ) n 2- However we also have 120n (r) r, 2n r 3 r 2(2 - ) 4 r, and hence we find the assertion. For the proof of Theorem 3.1 it remains to show the following lemma: Lemma 3.2 Let P and F be as in Theorem 3.1. Furthermore, let W : F R and a : P F [0, ). Define Ea T W (f ) - EP W (f ) (r) := ET P n sup (p, f ) + r f P F and suppose that there exists a subroot such that E ET P n sup (r) , T W (f ) - EP W (f )\n(p,f )P F a(p,f )r\n\nr > 0.\n\nThen we have (r) 5 (r) for all r > 0. r Proof: For x > 1, r > 0, and T (X Y )n we obtain by a standard peeling approach that sup\n(p,f )P F\n\n|EP W (f ) - ET W (f )| a(p, f ) + r |EP W (f ) - ET W (f )| i + a(p, f ) + r =0\n \n\n\n\nsup\n(p,f )P F a(p,f )r\n\nsup\n(p,f )P F a(p,f )r xi a(p,f )r xi+1\n\n|EP W (f ) - ET W (f )| a(p, f ) + r\n\n\n\n|EP W (f ) - ET W (f )| i + r (p,f )P F =0 sup\na(p,f )r\n\nsup\n(p,f )P F a(p,f )r xi a(p,f )r xi+1\n\n|EP W (f ) - ET W (f )| rxi + r\n\n\n\n 1 1 1i sup |EP W (f ) - ET W (f )| + i+1 r (p,f )P F r =0 x a(p,f )r\n\nsup\n\n|EP W (f ) - ET W (f )|\n\n(p,f )P F a(p,f )r xi+1\n\n=\n\ni (rxi+1 ) . 1 (r) + r xi + 1 =0\ni+1 2\n\nHowever since is a subroot we obtain that (rxi+1 ) x by setting x := 4.\n\n(r) so that we obtain the assertion\n\n4 Proof of Theorem 2.1\nBefore we begin the proof of Theorem 2.1 let us state the following proposition which follows directly from [8] (see also [9, Prop. 5.7]) together with simple considerations on covering numbers: Proposition 4.1 Let F := H be a RKHS, P := {p0 } be a singleton, and (p0 , f ) := f 2. If (7) is satisfied then there exists a constant cp depending only on p such that (9) is satisfied for v r p a 1 r 1+p a 1+p . p 1 1 2 2 2 (1-p) r 2 (1-p) , n (r) := cp max n n\n\n\f\nProof of Theorem 2.1: From the covering bound assumption we observe that Proposition 4.1 implies we have the bound (9) with n (r) defined by the righthand side of Proposition 4.1 and therefore Theorem 3.1 implies that Condition (10) becomes 1 1 p 1 1 rpa2 r 1+p a 1+p 32v 2- 28 ( 1 2 r max 20cp v 2 (1-p) r 2 (1-p) 18) , 120cp , , n n n n and solving with respect to r yields the conclusion. Finally, for the parameter selection approach in Example 2.4 we need the following oracle inequality for model selection: Theorem 4.2 Let P be a distribution on X Y and let L be a loss function which satisfies (3), (4), and the variance bound (6). Furthermore, let F := {f1 , . . . , fm } be a finite set of functions mapping X into [-1, 1]. For T (X Y )n we define fT := arg min RL,T (f ) .\nf F\n\nThen there exists a universal constant K such that for all 1 we have with probability not less than 1 - 3e- that K log m 2- 32v 2- 1 1 5K log m + 154 RL,P (fT ) - R ,P 5 +5 + L n n n +8 min(RL,P (f ) - R ,P ) . L\nf F\n\nProof: Since all functions fi already map into [-1, 1] we do not have to consider the clipping operator. For r > 0 we now define Fr := {f F : RL,P (f ) - R ,P r}. Then the cardinality of L Fr is smaller than or equal to m and hence we have N (L Fr - L fL,P , , L2 (T )) m for all > 0. Using the technique of [8] (cf. also [9, Prop. 5.7]) we hence obtain that (9) is satisfied for v , c log m n (r) := max log m r/2 , n n where c is a universal constant. Applying Theorem 3.1 then yields the assertion.\n\nReferences\n[1] P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44:525536, 1998. [2] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Technical Report, 2004. [3] O. Bousquet. A Bennet concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris, 334:495500, 2002. [4] D.R. Chen, Q. Wu, Y.M. Ying, and D.X. Zhou. Support vector machine soft margin classifiers: Error analysis. Journal of Machine Learning Research, 5:11431175, 2004. [5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [6] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, New York, 1996. [7] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219269, 1995. [8] S. Mendelson. Improving the sample complexity using global data. IEEE Trans. Inform. Theory, 48:19771991, 2002. [9] I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, to appear. [10] I. Steinwart and C. Scovel. Fast rates for support vector machines. In Proceedings of the 18th Annual Conference on Learning Theory, COLT 2005, pages 279294. Springer, 2005. [11] Q. Wu, Y. Ying, and D.-X. Zhou. Multi-kernel regularized classifiers. J. Complexity, to appear.\n\n\f\n", "award": [], "sourceid": 3066, "authors": [{"given_name": "Ingo", "family_name": "Steinwart", "institution": null}, {"given_name": "Don", "family_name": "Hush", "institution": null}, {"given_name": "Clint", "family_name": "Scovel", "institution": null}]}