{"title": "Support Vector Regression Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 155, "page_last": 161, "abstract": "", "full_text": "Support Vector Regression Machines \n\nHarris Drucker\u00b7 Chris J.C. Burges\" Linda Kaufman\" \n\nAlex Smola\u00b7\u00b7 Vladimir Vapoik + \n\n*Bell Labs and Monmouth University \nDepartment of Electronic Engineering \n\nWest Long Branch. NJ 07764 \n**BellLabs + AT&T Labs \n\nAbstract \n\nA new regression technique based on Vapnik's concept of support \nvectors is introduced. We compare support vector regression (SVR) \nwith a committee regression technique (bagging) based on regression \ntrees and ridge regression done in feature space. On the basis of these \nexperiments, it is expected that SVR will have advantages in high \ndimensionality space because SVR optimization does not depend on the \ndimensionality of the input space. \n\n1. Introduction \nIn the following, \nlower case bold characters represent vectors and upper case bold \ncharacters represent matrices. Superscript \"t\" represents the transpose of a vector. y \nrepresents either a vector (in bold) or a single observance of the dependent variable in the \npresence of noise. yCp) indicates a predicted value due to the input vector x(P) not seen in \nthe training set. \n\nSuppose we have an unknown function G(x) (the \"truth\") which is a function of a vector \nx (termed input space). The vector xt = [.XI,X2, ... ,Xd] has d components where d is \ntermed the dimensionality of the input space. F(x, w) is a family of functions \nparameterized by w. w is that value of w that minimizes a measure of error between \nG(x) and F(x, w). Our objective is to estimate w with w by observing the N training \ninstances Vj, j=l, .. \u00b7,N. We will develop two apprOximations for the truth G(x). The first \none is F 1 (x, w) which we term a feature space representation. One (of many) such \nfeature vectors is: \n\nZt_[x2 ... X2d X X \n\n, 1 2, \n\nI, \n\n, \n\n-\n\n... x\u00b7x\u00b7 ... Xd X\u00b7-J X \n\n, I l' ' \n\n-I u' .. , d, \n\n... x 1] \n\nwhich is a quadratic function of the input space components. Using the feature space \nrepresentation, then F 1 (x, w) = z tw , that is, F 1 (x, w) is linear in feature space although \n\n\f156 \n\nH. Drucker, C. J. Burges, L. Kaufman. A. Smola and V. Vapnik \n\nit is quadratic in input space. \ndimensional input space, the feature dimensionality / of w is \n\nIn general, for a p'th order polynomial and d'th \n\np+d-l \n\n. \n\n/ = L CJ-l \n\n;=d-l \n\nh Cn \nwere k = k !(n-k)! \n\nn! \n\nThe second representation is a support vector regression (SVR) representation that was \ndeveloped by Vladimir Vapnik (1995): \nN \n\nF2(x,w)=L(at-a;)(v~x+1)P + b \n\n;=1 \n\nF 2 is an expansion explicitly using the training examples. The rationale for calling it a \nsupport vector representation will be clear later as will the necessity for having both an \na and an a rather than just one multiplicative constant. In this case we must choose the \n2N + 1 values of a; at and b. If we expand the term raised to the p'th power, we find/ \ncoefficients that multiply the various powers and cross product terms of the components \nof x. So, in this sense Fl looks very similar to F2 in that they have the same number of \nterms. However F} has/free coefficients while F2 has 2N+1 coefficients that must be \ndetermined from the N training vectors. \n\nWe let a represent the 2N values of aj and at. The optimum values for the components \nof w or a depend on our definition of the loss function and the objective function. Here \nthe primal objective function is: \n\nN \n\nULL[Yj-F(vj, w)]+11 w 112 \n\nj=l \n\nwhere L is a general loss function (to be defined later) and F could be F 1 or F 2, Yj is the \nobservation of G(x) in the presence of noise, and the last term is a regularizer. The \nregularization constant is U which in typical developments multiplies the regularizer but \nis placed in front of the first term for reasons discussed later. \n\nIf the loss function is quadratic, i.e., we L[\u00b7]=[\u00b7J2, and we let F=F 1, i.e., the feature space \nrepresentation, the objective function may be minimized by using linear algebra \ntechniques since the feature space representation is linear in that space. This is termed \nridge regression (Miller, 1990). In particular let V be a matrix whose i'th row is the i'th \ntraining vector represented in feature space (including the constant term \"1\" which \nrepresents a bias). V is a matrix where the number of rows is the number of examples \n(N) and the number of columns is the dimensionality of feature space f Let E be the tx/ \ndiagonal matrix whose elements are 11U. y is the Nxl column vector of observations of \nthe dependent variable. We then solve the following matrix formulation for w using a \nlinear technique (Strang, 1986) with a linear algebra package (e.g., MA TLAB): \n\nVly = [VtV +E] w \n\nThe rationale for the regularization term is to trade off mean square error (the first term) \nin the objective function against the size of the w vector. If U is large, then essentially we \nare minimizing the mean square error on the training set which may give poor \ngeneralization to a test sel. We find a good value of U by varying U to find the best \nperformance on a validation set and then applying that U to the test set. \n\n\fSupport Vector Regression Machines \n\n157 \n\nLet us now define a different type of loss function termed an E-insensitive loss (Vapnik, \n1995): \n\nL _ { \n\n-\n\n0 \n\nI Yj-F 2(Xj, w) I - E \n\nif I Yj-F2(X;,w) 1< E \n\notherwise \n\nThis defines an E tube (Figure 1) so that if the predicted value is within the tube the loss \nis zero, while if the predicted pOint is outside the tube, the loss is the magnitude of the \ndifference between the predicted value and the radius E of the tube. \n\nSpecifically, we minimize: \n\nN \n\nN \n\nU(~~ \u2022 + ~~) + \"2(w t w) \n\n1 \n\nwhere ~j or ~. is zero if the sample point is inside the tube. If the observed point is \n\"above\" the tube, l;; is the positive difference between the observed value and E and aj \nwill be nonzero. Similary, ~j. will be nonzero if the observed point is below the tube \nand in this case a7 will be nonzero. Since an observed point can not be simultaneously \non both sides of the tube, either aj or a7 will be nonzero, unless the point is within the \ntube, in which case, both constants will be zero. If U is large, more emphasis is placed on \nthe error while if U is small, more emphasis is placed on the norm of the weights leading \nto (hopefully) a better generalization. The constraints are: (for all i, i=1,N) \n\nYi-(wtVi)--b~~; \n(wtvi)+b-yj~~; \n\nl;; *~ \n~~ \nThe corresponding Lagrangian is: \n\nL=-(wtw) + U(L~*j + L~i) - La;[yi-{wtvi)-b+E~;*] \n\n1 \n\n2 \n\nN \n\nN \n\ni=1 \n\nN \n\ni=1 \n\nN \n\ni=1 \nN \n\n- Lai[(wtvi)+b-Yi+E~i] - L(17~7+Y;~i) \ni=1 \n\ni=1 \n\nwhere the 1i and aj are Lagrange multipliers. \n\nWe find a saddle point of L (Vapnik. 1995) by differentiating with respect to Wi , b, and ~ \nwhich results in the equivalent maximization of the (dual space) objective function: \n\nN \n\nN \n\n1 N \n\nW(a,a*) = -E~(a7 +Clj)+ ~yj(a~ -Clj) - \"2.~ (a7-Clj)(a;-ai)(v~vj + 11 \n\n1=1 \n\n1=1 \n\nI.J=I \n\nwith the constraints: \n\n~Clj~U ~aj\u00b7~U i=1, ... ,N \nN \nLa; = Lai \n\nN \n\ni=1 \n\ni=1 \n\nWe must find N Largrange multiplier pairs (ai, (7). We can also prove that the product of \nCl; and a; is zero which means that at least one of these two terms is zero. A Vi \ncorresponding to a non-zero Clj or a; is termed a support vector. There can be at most N \nsupport vectors. Suppose now, we have a new vector x(P), then the corresponding \n\n\f158 \n\nH. Drucker. C. J Burges, L Kaufman, A. Smola and V. Vapnik \n\nprediction of y(P) is: \n\nN \n\ny(P) = L(a: - ai)(vfx(p) + 1)P+b \n\ni=1 \n\nMaximizing W is a quadratic programming problem but the above expression for W is \nnot in standard fonn for use in quadratic programming packages (which usually does \nminimization). If we let \n\nA . = a~ \n\n1-'1 \n\nI \n\n~i+N = Clj \n\ni=1, ... ,N \n\nthen we minimize: \n\nsubject to the constraints \n\nwhere \n\n'1N \n\nN \nL~i = L~; and \n;=1 \n\nN+l \n\n05~i~U i=1,\u00b7\u00b7\u00b7 ,2N \n\nC'=[E-YloE-Y2, ... ,E-YN,E+Yl ,E+Y2, ... ,E+YN] \n\nQ= [! :] \n\ndjj = (vfVj + 1)P \n\ni,j = 1, ... ,N \n\nWe use an active set method (Bunch and Kaufman, 1980) to solve this quadratic \nprogramming problem. \n2. Nonlinear Experiments \n\nWe tried three artificial functions from (Friedman, 1991) and a problem (Boston \nHousing) from the UCI database. Because the first three problems are artificial, we \nknow both the observed values and the truths. Boston Housing has 506 cases with the \ndependent variable being the median price of housing in the Boston area. There are \ntwelve continuous predictor variables. This data was obtaining from the UCI database \n(anonymous ftp at ftp.ics.ucLedu in directory Ipub/machine-learning-databases) In this \ncase, we have no \"truth\", only the observations. \n\nIn addition to the input space representation and the SVR representation, we also tried \nbagging. Bagging is a technique that combines regressors, in this case regression trees \n(Breiman, 1994). We used this technique because we had a local version available. In the \ncase of regression trees, the validation set was used to prune the trees. \n\nSuppose we have test points with input vectors xfP) \ni=1,M and make a prediction yr) \nusing any procedure discussed here. Suppose Yi is the actually observed value, which is \nthe truth G(x) plus noise. We define the prediction error (PE) and the modeling error \n(ME): \n\nME=-L(YrLG(Xj\u00bb2 \n\n1 M \n\nM ;=1 \n\nPE=-L(YfPL y;)2 \n\n1 M \n\nM i=1 \n\nFor the three Friedman functions we calculated both the prediction error and modeling \n\n\fSupport Vector Regression Machines \n\n159 \n\nerror. For Boston Housing, since the \"truth\" was not known, we calculated the prediction \nerror only. For the three Friedman functions, we generated (for each experiment) 200 \ntraining set examples and 40 validation set examples. The validation set examples were \nused to find the optimum regularization constant in the feature space representation. The \nfollowing procedure was followed. Train on the 200 members of the training set with a \nchoice of regularization constant and obtain the prediction error on the validation set. \nNow repeat with a different regularization constant until a minimum of prediction error \noccurs on the validation set. Now, use that regularizer constant that minimizes the \nvalidation set prediction error and test on a 1000 example test set. This experiment was \nrepeated for 100 different training sets of size 200 and validation sets of size 40 but one \ntest set of size 1000. Different size polynomials were tried (maximum power 3). Second \norder polynomials fared best. For Friedman function #1, the dimensionality of feature \nspace is 66 while for the last two problems, the dimensionality of feature space was 15 \n(for d=2). Thus the size of the feature space is smaller than that of the number of \nexamples and we would expect that a feature space representation should do well. \n\nA similar procedure was followed for the SVR representation except the regularizer \nconstant U, \u00a3 and power p were varied to find the minimum validation prediction error. \nIn the majority of cases p=2 was the optimum choice of power. \n\nFor the Boston Housing data, we picked randomly from the 506 cases using a training set \nof size 401, a validation set of size 80 and a test set of size 25. This was repeated 100 \ntimes. The optimum power as picked by the validations set varied between p=4 and p=5. \n\n3. Results of experiments \n\nThe first experiments we tried were bagging regression trees versus support regression \n(Table I). \n\nTable I. Modeling error and prediction error \non the three Friedman problems (100 trials). \n\nbagging \n\nME \n2.26 \n10,185 \n.0302 \n\nSVR \nME \n.67 \n4,944 \n.0261 \n\nbagging \n\nPE \n3.36 \n66,077 \n.0677 \n\nSVR \nPE \n1.75 \n60,424 \n.0692 \n\n1# trials \nbetter \n100 \n92 \n46 \n\n1#1 \n1#2 \n1#3 \n\nRather than report the standard error, we did a comparison for each training set. That is, \nfor the first experiment we tried both SVR and bagging on the same training, validation, \nand test set. If SVR had a better modeling error on the test set, it counted as a win. Thus \nfor Friedman 1#1. SVR was always better than bagging on the 100 trials. There is no clear \nwinner for Friedman function #3. \n\nSubsequent to our comparison of bagging to SVR, we attempted working directly in \nfeature space. That is. we used F 1 as our approximating function with square loss and a \nsecond degree polynomial. The results of this ridge regression (Table TI) are better than \nSVR. In retrospect, this is not surprising since the dimensionality of feature space is \nsmall (/=66 for Friedman #1 and.t=15 for the two remaining functions) in relation to the \nnumber of training examples (200). This was due to the fact that the best approximating \npolynomial is second order. The other advantages of the feature space representation in \n\n\f160 \n\nH. Drucker; C. J. Burges, L Kaufman, A. Smola and V. Vapnik \n\nthis particular case are that both PE and ME are mean squared error and the loss function \nis mean squared error also. \n\nTable ll. Modeling error for SVR and \nfeature space polynomial approximation. \n\nSVR \n\n.67 \n4,944 \n.0261 \n\nfeature \nspace \n\n.61 \n3051 \n.0176 \n\n#1 \n#2 \n#3 \n\nWe now ask the question whether U and E are important in SVR by comparing the results \nin Table I with the results obtaining by setting E to zero and U to 100,000 making the \nregularizer insignificant (Table DI). On Friedman #2 (and less so on Friedman #3), the \nproper choice of E and U are important. \n\nTable m. Comparing the results above with those obtained by setting \n\nE to zero and U to 100,000 (labeled suboptimum). \n\noptimum \n\nME \n.67 \n4,944 \n.0261 \n\n#1 \n#2 \n#3 \n\nsuboptimum \n\nME \n.70 \n\n34,506 \n.0395 \n\nFor the case of Boston Housing, the prediction error using bagging was 12.4 while for \nSVR we obtained 7.2 and SVR was better than bagging on 71 out of 100 trials. The \noptimum power seems to be about five. We never were able to get the feature \nrepresentation to work well because the number of coefficients to be determined (6885) \nwas much larger than the number of training examples (401). \n4 Conclusions \nSupport vector regression was compared to bagging and a feature space representation on \nfour nonlinear problems. On three of these problems a feature space representation was \nbest, bagging was worst, and SVR came in second. On the fourth problem, Boston \nHousing, SVR was best and we were unable to construct a feature space representation \nbecause of the high dimensionality required of the feature space. On linear problems, \nforward subset selection seems to be the method of choice for the two linear problems we \ntried at varying signal to noise ratios. \n\nIn retrospect, the problems we decided to test on were too simple. SVR probably has \ngreatest use when the dimensionality of the input space and the order of the \napproximation creates a dimensionality of a feature space representation much larger than \nthat of the number of examples. This was not the case for the problems we considered. \nWe thus need real life examples that fulfill these requirements. \n5. Acknowledgements \nThis project was supported by ARPA contract number NOOOl4-94-C-1 086. \n\n\fSupport Vector Regression Machines \n\n161 \n\n6. References \n\nLeo Breiman, \"Bagging Predictors\", Technical Report 421, September 1994, Department \nof Statistics, University of California Berkeley, CA Also at anonymous ftp site: \nftp.stat.berkeley.edulpub/tech-reports/421.ps.Z. \n\nJame R. Bunch and Linda C. Kaufman, \" A Computational Method of the Indefinite \nQuadratic Programming Problem\", Linear Algebra and Its Applications, Elsevier-North \nHolland, 1980. \n\nJerry Friedman, \"Multivariate Adaptive Regression Splines\", Annal Of Statistics, vol 19, \nNo.1, pp. 1-141 \n\nAlan J. Miller, Subset Selection in Regression, Chapman and Hall, 1990. \n\nGilbert Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986. \n\nVladimirN. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. \n\nFigure 1: The p ... .meters for the support vector \nregression. \n\n\f", "award": [], "sourceid": 1238, "authors": [{"given_name": "Harris", "family_name": "Drucker", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "Linda", "family_name": "Kaufman", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}