{"title": "Shrinking the Tube: A New Support Vector Regression Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 330, "page_last": 336, "abstract": null, "full_text": "Shrinking the Thbe: \n\nA New Support Vector Regression Algorithm \n\nBernhard SchOikopr\u00a7,*, Peter Bartlett*, Alex Smola\u00a7,r, Robert Williamson* \n\n\u00a7 GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany \n\n* FEITIRSISE, Australian National University, Canberra 0200, Australia \n\nbs, smola@first.gmd.de, Peter.Bartlett, Bob.Williamson@anu.edu.au \n\nAbstract \n\nA new algorithm for Support Vector regression is described. For a priori \nchosen 1/, it automatically adjusts a flexible tube of minimal radius to the \ndata such that at most a fraction 1/ of the data points lie outside. More(cid:173)\nover, it is shown how to use parametric tube shapes with non-constant \nradius. The algorithm is analysed theoretically and experimentally. \n\nINTRODUCTION \n\n1 \nSupport Vector (SV) machines comprise a new class of learning algorithms, motivated by \nresults of statistical learning theory (Vapnik, 1995). Originally developed for pattern recog(cid:173)\nnition, they represent the decision boundary in terms of a typically small subset (SchOikopf \net aI., 1995) of all training examples, called the Support Vectors. In order for this property \nto carryover to the case of SV Regression, Vapnik devised the so-called E-insensitive loss \nf(x)1 - E}, which does not penalize errors below \nfunction Iy -\nsome E > 0, chosen a priori. His algorithm, which we will henceforth call E-SVR, seeks to \nestimate functions \n\nf(x)lc = max{O, Iy -\n\nf (x) = (w . x) + b, w, x E ~N , b E ~, \n\nbased on data \n\n(1) \n\n(2) \n\n(xl,yd, ... ,(xe,Ye) E ~N x~, \n\nby minimizing the regularized risk functional \n\nIIwll2/2 + C . R~mp, \n\n(3) \nwhere C is a constant determining the trade-off between minimizing training errors and \nminimizing the model complexity term IIwll2, and R~mp := t 2::;=1 IYi -\nThe parameter E can be useful if the desired accuracy of the approximation can be specified \nbeforehand. In some cases, however, we just want the estimate to be as accurate as possible, \nwithout having to commit ourselves to a certain level of accuracy. \n\nf(Xi)lc' \n\nWe present a modification of the E-SVR algorithm which automatically minimizes E, thus \nadjusting the accuracy level to the data at hand. \n\n\fShrinking the Tube: A New Support Vector Regression Algorithm \n\n331 \n\n2 ZJ-SV REGRESSION AND c-SV REGRESSION \nTo estimate functions (1) from empirical data (2) we proceed as follows (SchOlkopf et aI., \n1998a). At each point Xi, we allow an error of E. Everything above E is captured in \nslack variables d*) \u00ab(*) being a shorthand implying both the variables with and without \nasterisks), which are penalized in the objective function via a regularization constant C, \nchosen a priori (Vapnik, 1995). The tube size E is traded off against model complexity and \nslack variables via a constant v > 0: \n\nminimize \n\nsubject to \n\n-\n\n-r(w, e(*) ,E) = Ilw112/2 + C\u00b7 (VE + \u00a3 :L(Ei + En) \n\n1 e \n\n((w,xi)+b)-Yi < E+Ei \nYi-((W ' Xi)+b) < E+Ei \n\ni-I \n-\n\n(4) \n\n(5) \n(6) \n\nd*) ~ 0, E > 0. \n\n(7) \nHere and below, it is understood that i = 1, ... , i, and that bold face greek letters denote \ni-dimensional vectors of the corresponding variables. Introducing a Lagrangian with mul-\ntipliers o~ *) , 77i *) ,f3 ~ 0, we obtain the the Wolfe dual problem. Moreover, as Boser et al. \n(1992), we substitute a kernel k for the dot product, corresponding to a dot product in some \nfeature space related to input space via a nonlinear map * , \nk(x,y) = (**(x)\u00b7 **(y)). \n\n(8) \n\nThis leads to the v-SVR Optimization Problem: for v ~ 0, C > 0, \n\nmaximize W(o(*)) = :L(oi - Oi)Yi - ~ :L (oi - Oi)(O; - OJ)k(Xi, Xj) \n\ne \n\ni=1 \n\ne \n\ni,j=1 \n\nsubject to \n\nThe regression estimate can be shown to take the form \n\n(11) \n\n(9) \n\n(13) \n\nf(x) = :L(oi - oi)k(Xi' x) + b, \n\nl \n\ni=1 \n\nwhere b (and E) can be computed by taking into account that (5) and (6) (substitution of \n\nL: j (0; - oj)k(xj, x) for (w\u00b7 x) is understood) become equalities with E~*) = \u00b0 for points \nwith \u00b0 < o~*) < C / i, respectively, due to the Karush-Kuhn-Tuckerconditions (cf. Vapnik, \n\n1995). The latter moreover imply that in the kernel expansion (13), only those o~*) will \nbe nonzero that correspond to a constraint (5)/(6) which is precisely met. The respective \npatterns Xi are referred to as Support Vectors. \n\nBefore we give theoretical results explaining the significance of the parameter v, the fol(cid:173)\nlowing observation concerning E is helpful. If v > 1, then E = 0, since it does not pay to \nincrease E (cf. (4)). If v ~ 1, it can still happen that E = 0, e.g. if the data are noise-free \nand can perfectly be interpolated with a low capacity model. The case E = 0, however, is \nnot what we are interested in; it corresponds to plain Ll loss regression . Below, we will use \nthe term errors to refer to training points lying outside of the tube, and the term fraction \nof errors/SVs to denote the relative numbers of errors/SVs, i.e. divided by i. \n\nProposition 1 Assume E > 0. The following statements hoLd: \n\n(i) v is an upper bound on the fraction of errors. \n\n(ii) v is a Lower bound on the fraction ofSVs. \n\n\f332 \n\nB. SchOlkopf, P. L. Bartlett, A. 1. Smola and R. Williamson \n\n(iii) Suppose the data (2) were generated iid from a distribution P(x, y) \n\nP(x)P(ylx) with P(ylx) continuous. With probability 1, asymptotically, v equals \nboth the fraction of SVs and the fraction of errors. \n\nThe first two statements of this proposition can be proven from the structure of the dual op(cid:173)\ntimization problem, with (12) playing a crucial role. Presently, we instead give a graphical \nproof based on the primal problem (Fig. 1). \n\nTo understand the third statement, note that all errors are also SVs, but there can be SVs \nwhich are not errors: namely, if they lie exactly at the edge of the tube. Asymptotically, \nhowever, these SVs form a negligible fraction of the whole SV set, and the set of errors and \nthe one of SV s essentially coincide. This is due to the fact that for a class of functions with \nwell-behaved capacity (such as SV regression functions), and for a distribution satisfying \nthe above continuity condition, the number of points that the tube edges f \u00b1 \u00a3 can pass \nthrough cannot asymptotically increase linearly with the sample size. Interestingly, the \nproof (Scholkopf et aI., 1998a) uses a uniform convergence argument similar in spirit to \nthose used in statistical learning theory. \n\nDue to this proposition, 0 ::; v ::; 1 can be used to control the number of errors (note that \nfor v ~ 1, (11) implies (12), since ai . a; = 0 for all i (Vapnik, 1995)). Moreover, since \nthe constraint (10) implies that (12) is equivalent to Li a~*) ::; Cv/2, we conclude that \nProposition 1 actually holds for the upper and the lower edge of the tube separately, with \nv /2 each. As an aside, note that by the same argument, the number of SVs at the two edges \nof the standard \u00a3-SVR tube asymptotically agree. \n\nMoreover, note that this bears on the robustness of v-SVR. At first glance, SVR seems all \nbut robust: using the \u00a3-insensitive loss function, only the patterns outside of the \u00a3-tube con(cid:173)\ntribute to the empirical risk term, whereas the patterns closest to the estimated regression \nhave zero loss. This, however, does not mean that it is only the outliers that determine the \nregression. In fact, the contrary is the case: one can show that local movements of target \nvalues Yi of points Xi outside the tube do not influence the regression (Scholkopf et aI., \n1998c). Hence, v-SVR is a generalization of an estimator for the mean of a random vari(cid:173)\nable which throws away the largest and smallest examples (a fraction of at most v /2 of \neither category), and estimates the mean by taking the average of the two extremal ones of \nthe remaining examples. This is close in spirit to robust estimators like the trimmed mean. \n\nLet us briefly discuss how the new algorithm relates to \u00a3-SVR (Vapnik, 1995). By rewriting \n(3) as a constrained optimization problem, and deriving a dual much like we did for v-SVR, \n\nFigure 1: Graphical depiction of the v-trick. Imag(cid:173)\nine increasing \u00a3, starting from O. The first term in \nv\u00a3+ 1 L;=l (~i +~n (cf. (4)) will increase propor(cid:173)\ntionally to v, while the second term will decrease \nproportionally to the fraction of points outside of \nthe tube. Hence, \u00a3 will grow as long as the latter \n+\u00a3 fraction is larger than v . At the optimum, it there-\no fore must be::; v (Proposition 1, (i)). Next, imag(cid:173)\nine decreasing \u00a3, starting from some large value. \n-\u00a3 Again, the change in the first term is proportional \nto v, but this time, the change in the second term \nis proportional to the fraction of SVs (even points \non the edge of the tube will contribute). Hence, \u00a3 \nwill shrink as long as the fraction of SVs is smaller \nthan v, eventually leading to Proposition 1, (ii). \n\n\fShrinking the Tube: A New Support Vector Regression Algorithm \n\n333 \n\none arrives at the following quadratic program: maximize \n\nl \n\nl \n\nW(a, a*) = -\u00a3 2)0: +Oi)+ 'L)oi -Oi)Yi-~ L (0; -Oi)(O) -OJ)k(Xi' Xj) (14) \nsubject to (10) and (11). Compared to (9), we have an additional term -c 2:;=1 (aT + Oi), \nwhich makes it plausible that the constraint (12) is not needed. \n\ni,j=l \n\ni=l \n\ni=l \n\nl \n\nIn the following sense, v-SVR includes c-SVR. Note that in the general case, using kernels, \nw is a vector in feature space. \n\nProposition 2 If v-SVR leads to the solution t, w, b, then c-SVR with E set a priori to t, \nand the same value of C, has the solution W, b. \n\nProof If we minimize (4), then fix c and minimize only over the remaining variables, the \nsolution does not change. \n\u2022 \n\n3 PARAMETRIC INSENSITIVITY MODELS \n\nWe generalized \u00a3-SVR by considering the tube as not given but instead estimated it as a \nmodel parameter. What we have so far retained is the assumption that the c-insensitive zone \nhas a tube (or slab) shape. We now go one step further and use parametric models of arbi-\ntrary shape. Let { d *)} (here and below, q = 1, ... ,p is understood) be a set of 2p positive \nfunctions on IRN. Consider the following quadratic program: for given v~*), . .. , v~*) 2: 0, \nminimize \n\nr(w, e(*), c(*\u00bb) = IlwW /2 + C\u00b7 ?;(vqcq + v;\u00a3;) + f ~(~i + ~n \n\n) \n\np \n\n( \n\n1 \n\nl \n\nsubject to \n\n((w\u00b7 Xi) + b) - Yi < L cq(q(X;) + ~i \nYi-((W'Xi)+b) < L c;(;(xd+C \n~J*) 2: 0, \n\nE~*) > O. \n\nq \n\nq \n\n(15) \n\n(16) \n\n(17) \n\n(18) \n\nA calculation analogous to Sec. 2 shows that the Wolfe dual consists of maximizing (9) \nsubject to (10), (11), and, instead of (12), the modified constraints 2:;=1 o~*)d*)(xd :S \nC . v~*). In the experiments in Sec. 4, we use a simplified version of this optimization \nproblem, where we drop the term v;c~ from the objective function (15), and use Cq and \n(q in (17). By this, we render the problem symmetric with respect to the two edges of the \ntube. In addition, we use p = 1. This leads to the same Wolfe dual, except for the last \nconstraint, which becomes (cf. (12\u00bb \n\nl L i=l (a; + ai)((xi) :S C . v. \n\n(19) \nThe advantage of this setting is that since the same v is used for both sides of the tube, the \ncomputation of E, b is straightforward: for instance, by solving a linear system, using two \nconditions as those described following (13). Otherwise, general statements are harder to \nmake: the linear system can have a zero determinant, depending on whether the functions \nd *) , evaluated on the Xi with 0 < o~ *) < C / \u00a3, are linearly dependent. The latter occurs, \nfor instance, if we use constant functions (( *) == 1. In this case, it is pointless to use \ntwo different values v, v*; for, the constraint (10) then implies that both sums 2:;=1 a~ *) \nwill be bounded by C . min {v, v*}. We conclude this section by giving, without proof, a \ngeneralization of Proposition 1, (iii), to the optimization problem with constraint (19): \n\n\f334 \n\nB. SchOlkopf, P L. Bartlett, A. J. Smola and R. Williamson \n\nProposition 3 Assume c > O. Suppose the data (2) were generated iid from a distribution \nP(x, y) = P(x)P(ylx) with P(ylx) continuous. With probability 1, asymptotically, the \nfractions of SVs and errors equal v \u00b7(J ((x) d?(X))-l, where? is the asymptotic distribu(cid:173)\ntion of SVs over x. \n\n4 EXPERIMENTS AND DISCUSSION \nIn the experiments, we used the optimizer LOQO (http://www.princeton.edwrvdb/).This \nhas the serendipitous advantage that the primal variables band c can be recovered as the \ndual variables of the Wolfe dual (9) (i.e. the double dual variables) fed into the optimizer. \nIn Fig. 2, the task was to estimate a regression of a noisy sinc function, given f examples \n(Xi,Yi), with Xi drawn uniformly from [-3,3], and Yi = sin(7l'Xi)/(7l'Xi) + Vi, with Vi \ndrawn from a Gaussian with zero mean and variance (J2. We used the default parameters \n\u00a3 = 50, C = 100, (J = 0.2, and the RBF kernel k(x, x') = exp( -Ix - x /12 ). \nFigure 3 gives an illustration of how one can make use of parametric insensitivity models as \nproposed in Sec. 3. Using the proper model, the estimate gets much better. In the parametric \ncase, we used v = 0.1 and ((x) = sin2 ((27l' /3)x), which, due to J ((x) dP(x) = 1/2, \ncorresponds to our standard choice v = 0.2 in v-SVR (cf. Proposition 3). The experimental \nfindings are consistent with the asymptotics predicted theoretically even if we assume a \nuniform distribution of SVs: for \u00a3 = 200, we got 0.24 and 0.19 for the fraction of SVs and \nerrors, respectively. \n\nThis method allows the incorporation of prior knowledge into the loss function. Although \nthis approach at first glance seems fundamentally different from incorporating prior know(cid:173)\nledge directly into the kernel (Sch6lkopf et al., 1998b), from the point of view of statistical \n\n,,,' \n\n\"'''''''' \n\nFigure 2: Left: v-SV regression with v = 0.2 (top) and v = 0.8 (bottom). The larger v \nallows more points to lie outside the tube (see Sec. 2). The algorithm automatically adjusts \nc to 0.22 (top) and 0.04 (bottom). Shown are the sinc function (dotted), the regression f \nand the tube f \u00b1 c. Middle: v-SV regression on data with noise (J = 0 (top) and (J = 1 \n(bottom). In both cases, v = 0.2. The tube width automatically adjusts to the noise (top: \nc = 0, bottom: c = 1.19). Right: c-SV regression (Vapnik, 1995) on data with noise (J = 0 \n(top) and (J = 1 (bottom). In both cases, c = 0.2 -\nthis choice, which has to be specified \na priori, is ideal for neither case: in the top figure, the regression estimate is biased; in the \nbottom figure, c does not match the external noise (cf. Smola et al., 1998). \n\n\fShrinking the Tube: A New Support Vector Regression Algorithm \n\n335 \n\n,--~-------, Figure 3: Toy example, using \n\n,'''-''-, \n\nprior knowledge about an x(cid:173)\ndependence of the noise. Additive \nnoise (0' = 1) was multiplied by \nsin2 ((27r 13)x). Left: the same \nfunction was used as ( as a para(cid:173)\nmetric insensitivity tube (Sec. 3) . \n\u2022 ,,:---::-----c7-----:--7-----:------!, .,,:---::--.,------:---.,------:------!, Right: v-S VR with standard tube. \n\n. \n, . '-: \n. ', \" ., \n\n-....... _- _ ....... . \n\n. ,'_ . \n\n\u2022 ___ \u2022\n\n, \n\n.0.5 \n\n..' \n\n\u2022 \n\nI 0.1 I 0.2 I 0.3 I 0.4 I 0.5 I 0.6 I 0.7 I 0,8 I 0,9 I 1.0 I \n\nTable 1: Results for the Boston housing benchmark; top: v-SVR, bottom: e:-SVR MSE: \nMean squared errors, STD: standard deviations thereof (100 trials), Errors: fraction oftrain(cid:173)\ning points outside the tube, SVs: fraction of training points which are SVs, \nIv \nautomatic e: \nMSE \nSTD \nErrors \nSVs \n\n0.0 \n11.3 \n9.6 \n0.5 \n1.0 \n\n0.6 \n10.0 \n8.4 \n0.3 \n0.8 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n0.3 \n10.6 \n9.0 \n0.4 \n0.9 \n\n0.0 \n11.3 \n9.5 \n0.5 \n1.0 \n\n1.2 \n9.3 \n7.6 \n0.2 \n0.6 \n\n2.6 \n9.4 \n6.4 \n0.0 \n0.3 \n\n1.7 \n8.7 \n6.8 \n0.1 \n0.4 \n\n0.8 \n9.5 \n7.9 \n0.2 \n0.7 \n\nIe: \nMSE \nSTD \nErrors \nSVs \n\n0 1 \n\n1 I 2 I \n\n3 I \n\n11.3 \n9.5 \n0.5 \n1.0 \n\n9.5 \n7.7 \n0.2 \n0.6 \n\n8.8 \n6.8 \n0.1 \n0.4 \n\n9.7 \n6.2 \n0.0 \n0.3 \n\n41 \n11.2 \n6.3 \n0.0 \n0.2 \n\n5 I \n\n6 1 \n\n7 I \n\n8 I \n\n9 I 10 I \n\n13.1 \n6.0 \n0.0 \n0.1 \n\n15.6 \n6.1 \n0.0 \n0.1 \n\n18.2 22.1 \n6.2 \n6.6 \n0.0 \n0.0 \n0.1 \n0.1 \n\n27.0 34.3 \n8.4 \n7.3 \n0.0 \n0.0 \n0.1 \n0.1 \n\nlearning theory the two approaches are closely related: in both cases, the structure of the \nloss-function-induced class of functions (which is the object of interest for generalization \nerror bounds) is customized; in the first case, by changing the loss function, in the second \ncase, by changing the class of functions that the estimate is taken from . \n\nEmpirical studies using e:-SVR have reported excellent performance on the widely used \nBoston housing regression benchmark set (Stitson et aI., 1999). Due to Proposition 2, \nthe only difference between v-SVR and standard e:-SVR lies in the fact that different \nparameters, e: vs. v , have to be specified a priori. Consequently, we are in this exper(cid:173)\niment only interested in these parameters and simply adjusted C and the width 20'2 in \nk(x, y) = exp( -llx - YI12/(20'2)) as Scholkopf et ai. (1997): we used 20' 2 = 0.3 \u00b7 N, \nwhere N = 13 is the input dimensionality, and C 1 e = 10 . 50 (i.e. the original value of \n10 was corrected since in the present case, the maximal y-value is 50). We performed 100 \nruns, where each time the overall set of 506 examples was randomly split into a training \nset of e = 481 examples and a test set of 25 examples. Table 1 shows that in a wide range \nof v (note that only 0 :s v :s 1 makes sense), we obtained performances which are close to \nthe best performances that can be achieved by selecting e: a priori by looking at the test set. \nFinally, note that although we did not use validation techniques to select the optimal values \nfor C and 20'2, we obtained performance which are state of the art (Stitson et ai. (1999) re(cid:173)\nport an MSE of 7.6 for e:-SVR using ANOVA kernels, and 11.7 for Bagging trees). Table 1 \nmoreover shows that v can be used to control the fraction of SVs/errors. \n\nDiscussion. The theoretical and experimental analysis suggest that v provides a way to \ncontrol an upper bound on the number of training errors which is tighter than the one used \nin the soft margin hyperplane (Vapnik, 1995). In many cases, this makes it a parameter \nwhich is more convenient than the one in e:-SVR. Asymptotically, it directly controls the \n\n\f336 \n\nB. SchOlkopf, P L. Bartlett, A. 1. Smola and R. Williamson \n\nnumber of Support Vectors, and the latter can be used to give a leave-one-out generalization \nbound (Vapnik, 1995). In addition, v characterizes the compression ratio: it suffices to train \nthe algorithm only on the SVs, leading to the same solution (SchOlkopf et aI., 1995). In \nc:-SVR, the tube width c: must be specified a priori; in v-SVR, which generalizes the idea of \nthe trimmed mean, it is computed automatically. Desirable properties of c:-SVR, including \nthe formulation as a definite quadratic program, and the sparse SV representation of the \nsolution, are retained. We are optimistic that in many applications, v-SVR will be more \nrobust than c:-SVR. Among these should be the reduced set algorithm of Osuna and Girosi \n(1999), which approximates the SV pattern recognition decision surface by c:-SVR. Here, \nv should give a direct handle on the desired speed-up. \n\nOne of the immediate questions that a v-approach to SV regression raises is whether a \nsimilar algorithm is possible for the case of pattern recognition . This question has recently \nbeen answered to the affirmative (SchOlkopf et aI., 1998c). Since the pattern recognition \nalgorithm (Vapnik, 1995) does not use c:, the only parameter that we can dispose of by \nusing v is the regularization constant C. This leads to a dual optimization problem with a \nhomogeneous quadratic form, and v lower bounding the sum of the Lagrange multipliers. \nWhether we could have abolished C in the regression case, too, is an open problem. \nAcknowledgement This work was supported by the ARC and the DFG (# Ja 379171). \n\nReferences \nB. E. Boser, 1. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin \nclassifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on \nComputational Learning Theory, pages 144-152, Pittsburgh, PA, 1992. ACM Press. \n\nE. Osuna and F. Girosi. Reducing run-time complexity in support vector machines. In \nB. SchOlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support \nVector Learning, pages 271 - 283. MIT Press, Cambridge, MA, 1999. \n\nB. SchOlkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In \nU. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference \non Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA, 1995. \n\nB. Scholkopf, P. Bartlett, A. Smola, and R. Williamson. Support vector regression with \n\nautomatic accuracy control. In L. Niklasson, M. Boden, and T. Ziemke, editors, Pro(cid:173)\nceedings of the 8th International Conference on Artificial Neural Networks, Perspectives \nin Neural Computing, pages III - 116, Berlin, 1998a. Springer Verlag. \n\nB. SchOlkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector \nkernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information \nProcessing Systems 10, pages 640 - 646, Cambridge, MA, 1998b. MIT Press. \n\nB. SchOlkopf, A. Smola, R. Williamson, and P. Bartlett. New support vector algorithms. \n\n1998c. NeuroColt2-TR 1998-031; cf. http:!www.neurocolt.com \n\nB. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Com(cid:173)\n\nparing support vector machines with gaussian kernels to radial basis function classifiers. \nIEEE Trans. Sign. Processing, 45:2758 - 2765, 1997. \n\nA. Smola, N. Murata, B. SchOlkopf, and K.-R. Moller. Asymptotically optimal choice of \nc:-Ioss for support vector machines. In L. Niklasson, M. Boden, and T. Ziemke, editors, \nProceedings of the 8th International Conference on Artificial Neural Networks, Perspec(cid:173)\ntives in Neural Computing, pages 105 - 110, Berlin, 1998. Springer Verlag. \n\nM. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston. Support \nvector regression with ANOVA decomposition kernels. In B. Scholkopf, C. Burges, \nand A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages \n285 - 291. MIT Press, Cambridge, MA, 1999. \n\nV. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. \n\n\f", "award": [], "sourceid": 1563, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}*