{"title": "Model Selection for Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 230, "page_last": 236, "abstract": null, "full_text": "Model Selection for Support Vector Machines \n\nOlivier Chapelle*,t, Vladimir Vapnik* \n* AT&T Research Labs, Red Bank, NJ \n\nt LIP6, Paris, France \n\n{ chapelle, vlad} @research.au.com \n\nAbstract \n\nNew functionals for parameter (model) selection of Support Vector Ma(cid:173)\nchines are introduced based on the concepts of the span of support vec(cid:173)\ntors and rescaling of the feature space. It is shown that using these func(cid:173)\ntionals, one can both predict the best choice of parameters of the model \nand the relative quality of performance for any value of parameter. \n\n1 Introduction \n\nSupport Vector Machines (SVMs) implement the following idea: they map input vectors \ninto a high dimensional feature space, where a maximal margin hyperplane is constructed \n[6]. It was shown that when training data are separable, the error rate for SVMs can be \ncharacterized by \n\n(1) \n\nwhere R is the radius ofthe smallest sphere containing the training data and M is the mar(cid:173)\ngin (the distance between the hyperplane and the closest training vector in feature space). \nThis functional estimates the VC dimension of hyperplanes separating data with a given \nmargin M. \nTo perform the mapping and to calculate Rand M in the SVM technique. one uses a \npositive definite kernel K(x, x') which specifies an inner product in feature space. An \nexample of such a kernel is the Radial Basis Function (RBF). \n\nK(x, x') = e-llx-x'II2/20'2. \n\nThis kernel has a free parameter (7 and more generally, most kernels require some param(cid:173)\neters to be set. When treating noisy data with SVMs. another parameter. penalizing the \ntraining errors. also needs to be set. The problem of choosing the values of these parame(cid:173)\nters which minimize the expectation of test error is called the model selection problem. \n\nIt was shown that the parameter of the kernel that minimizes functional (1) provides a good \nchoice for the model: the minimum for this functional coincides with the minimum of the \ntest error [1]. However. the shapes of these curves can be different. \n\nIn this article we introduce refined functionals that not only specify the best choice of \nparameters (both the parameter of the kernel and the parameter penalizing training error). \nbut also produce curves which better reflect the actual error rate. \n\n\fModel Selection for Support Vector Machines \n\n231 \n\nThe paper is organized as follows. Section 2 describes the basics of SVMs, section 3 \nintroduces a new functional based on the concept of the span of support vectors, section 4 \nconsiders the idea of rescaling data in feature space and section 5 discusses experiments of \nmodel selection with these functionals. \n\n2 Support Vector Learning \n\nWe introduce some standard notation for SVMs; for a complete description, see [6]. Let \n(Xi, Yih * 0 are called support vectors. We distinguish between those with \no < ai < C and those with ai = C. We call them respectively support vectors of the first \nand second category. \n\n3 Prediction using the span of support vectors \n\nThe results introduced in this section are based on the leave-one-out cross-validation esti(cid:173)\nmate. This procedure is usually used to estimate the probability of test error of a learning \nalgorithm. \n\n3.1 The leave-one-out procedure \n\nThe leave-one-out procedure consists of removing from the training data one element, con(cid:173)\nstructing the decision rule on the basis of the remaining training data and then testing the \nremoved element. In this fashion one tests all f elements of the training data (using f dif(cid:173)\nferent decision rules). Let us denote the number of errors in the leave-one-out procedure \nby \u00a3(Xl' Yl, ... , Xl, Yl) . It is known [6] that the the leave-one-out procedure gives an al(cid:173)\nmost unbiased estimate of the probability of test error: the expectation of test error for the \nmachine trained on f - 1 examples is equal to the expectation of 1\u00a3(Xl' Yl, ... , Xl, Yt). \nWe now provide an analysis of the number of errors made by the leave-one-out procedure. \nFor this purpose, we introduce a new concept, called the span of support vectors [7]. \n\n\f232 \n\nO. Chapelle and V. N. Vapnik \n\n3.2 Span of support vectors \n\nSince the results presented in this section do not depend on the feature space, we will \nconsider without any loss of generality, linear SVMs, i.e. K (Xi, Xj) = Xi . Xj. \n\nSuppose that 0\u00b0 = (a?, ... , a~) is the solution of the optimization problem (3). \nFor any fixed support vector xp we define the set Ap as constrained linear combinations of \nthe support vectors of the first category (Xi)i:;t:p : \n\n. t Ai = 1, 0 ~ a? + Yiypa~Ai ~ c} . \n\nt=l , t#p \n\n(4) \n\nNote that Ai can be less than O. \nWe also define the quantity Sp, which we call the span of the support vector xp as the \nminimum distance between xp and this set (see figure 1) \n\n(5) \n\nt... \n2= +inf \nt...3 = -inf \n\n.. AI \n\n\u00b7\u00b7\u00b7'' - - ' ~ .. 2,, \n\nFigure 1: Three support vectors with al = a2 = a3/2. The set Al is the semi-opened \ndashed line. \n\nIt was shown in [7] that the set Ap is not empty and that Sp = d(xp, Ap) ~ Dsv, where \nD sv is the diameter of the smallest sphere containing the support vectors. \nIntuitively, the smaller Sp = d(xp, Ap) is, the less likely the leave-one-out procedure is to \nmake an error on the vector xp' Formally, the following theorem holds : \n\nTheorem 1 [7 J If in the leave-one-out procedure a support vector xp corresponding to \no < a p < C is recognized incorrectly, then the following inequality holds \n\na O > \np - Sp max(D, 1/.JC)\u00b7 \n\n1 \n\nThis theorem implies that in the separable case (C = \n(0), the number of errors \nmade by the leave-one-out procedure is bounded as follows: \u00a3(Xl' Yl, .'\" Xl, Yl) ~ \n2:p a~ maxp SpD = maxp SpD / M2 , because 2: a~ = 1/ M2 [6]. This is already an \nimprovement compared to functional (I), since Sp ~ Dsv. But depending on the geome(cid:173)\ntry of the support vectors the value of the span Sp can be much less than the diameter D sv \nof the support vectors and can even be equal to zero. \n\nWe can go further under the assumption that the set of support vectors does not change \nduring the leave-one-out procedure, which leads us to the following theorem: \n\n\fModel Selection for Support Vector Machines \n\n233 \n\nTheorem 2 If the sets of support vectors of first and second categories remain the same \nduring the leave-one-out procedure. then for any support vector xp. the following equality \nholds: \n\nyp[fO(xp) -\n\nfP(x p)] = o~S; \n\nwhere fO and fP are the decisionfunction (2) given by the SVM trained respectively on the \nwhole training set and after the point xp has been removed. \n\nThe proof of the theorem follows the one of Theorem 1 in [7]. \n\nThe assumption that the set of support vectors does not change during the leave-one-out \nprocedure is obviously not satisfied in most cases. Nevertheless, the proportion of points \nwhich violate this assumption is usually small compared to the number of support vec(cid:173)\ntors. In this case, Theorem 2 provides a good approximation of the result of the leave-one \nprocedure, as pointed out by the experiments (see Section 5.1, figure 2). \n\nAs already noticed in [1], the larger op is, the more \"important\" in the decision function the \nsupport vector xp is. Thus, it is not surprising that removing a point xp causes a change in \nthe decision function proportional to its Lagrange multiplier op . The same kind of result as \nTheorem 2 has also been derived in [2], where for SVMs without threshold, the following \ninequality has been derived: yp(f\u00b0(xp) -\nfP(xp)) ~ o~K(xp,xp). The span Sp takes \ninto account the geometry of the support vectors in order to get a precise notion of how \n\"important\" is a given point. \n\nThe previous theorem enables us to compute the number of errors made by the leave-one(cid:173)\nout procedure: \n\nCorollary 1 Under the assumption of Theorem 2, the test error prediction given by the \nleave-one-out procedure is \n\n(6) \n\nNote that points which are not support vectors are correctly classified by the leave-one-out \nprocedure. Therefore t/. defines the number of errors of the leave-one-out procedure on the \nentire training set. \n\nUnder the assumption in Theorem 2, the box constraints in the definition of Ap (4) can \nbe removed. Moreover, if we consider only hyperplanes passing through the origin, the \nconstraint E Ai = 1 can also be removed. Therefore, under those assumptions, the com(cid:173)\nputation of the span Sp is an unconstrained minimization of a quadratic form and can be \ndone analytically. For support vectors of the first category, this leads to the closed form \nS~ = l/(KsMpp, where Ksv is the matrix of dot products between support vectors of \nthe first category. A similar result has also been obtained in [3]. \n\nIn Section 5, we use the span-rule (6) for model selection in both separable and non(cid:173)\nseparable cases. \n\n4 Rescaling \n\nAs we already mentioned, functional (1) bounds the VC dimension of a linear margin clas(cid:173)\nsifier. This bound is tight when the data almost \"fills\" the surface of the sphere enclosing \nthe training data, but when the data lie on a flat ellipsoid, this bound is poor since the radius \nof the sphere takes into account only the components with the largest deviations. The idea \nwe present here is to make a rescaling of our data in feature space such that the radius of the \nsphere stays constant but the margin increases, and then apply this bound to our rescaled \ndata and hyperplane. \n\n\f234 \n\n0. Chapelle and V. N. Vapnik \n\nLet us first consider linear SVMs, i.e. without any mapping in a high dimensional space. \nThe rescaling can be achieved by computing the covariance matrix of our data and rescaling \naccording to its eigenvalues. Suppose our data are centered and let ('PI' ... ,'Pn) be the \nnormalized eigenvectors of the covariance matrix of our data. We can then compute the \nsmallest enclosing box containing our data, centered at the origin and whose edges are \nparallels to ('PI' ... , 'Pn)' This box is an approximation of the smallest enclosing ellipsoid. \nThe length of the edge in the direction 'P k is J-Lk = maxi IXi . 'P k I. The rescaling consists \nof the following diagonal transformation: \n\nD : x --t Dx = LJ-Lk(X' 'Pk) 'Pk' \n\nk \n\nLet us consider Xi = D-I xi and w = Dw. The decision function is not changed under \nthis transformation since w . Xi = W \n. xi and the data Xi fill a box of side length 1. Thus, \nin functional (l), we replace R2 by 1 and 1/ M2 by w2 . Since we rescaled our data in a \nbox, we actually estimated the radius of the enclosing ball using the foo-norm instead of \nthe classical f 2-norm. Further theoretical works needs to be done to justify this change of \nnorm. \n\nIn the non-linear case, note that even if we map our data in a high dimensional feature space, \nthey lie in the linear subspace spanned by these data. Thus, if the number of training data f \nis not too large, we can work in this subspace of dimension at most f. For this purpose, one \ncan use the tools of kernel PCA [5] : if A is the matrix of normalized eigenvectors of the \nGram matrix Kij = K (Xi, Xj) and (>'d the eigenvalues, the dot product Xi . 'P k is replaced \nby v'XkAik and W\u00b7 'Pk becomes v'XkL:i AikYiO'i. Thus, we can still achieve the diagonal \ntransformation A and finally functional (1) becomes \n\nL \nk \n\n5 Experiments \n\n>.~ max Ark (2: Aik YiO'i)2 . \n\n~ \n\ni \n\nTo check these new methods, we performed two series of experiments. One concerns the \nchoice of (7, the width of the RBF kernel, on a linearly separable database, the postal \ndatabase. This dataset consists of 7291 handwritten digit of size 16x16 with a test set \nof 2007 examples. Following [4], we split the training set in 23 subsets of 317 training \nexamples. Our task consists of separating digit 0 to 4 from 5 to 9. Error bars in figures 2a \nand 3 are standard deviations over the 23 trials. In another experiment, we try to choose \nthe optimal value of C in a noisy database, the breast-cancer database! . The dataset has \nbeen split randomly 100 times into a training set containing 200 examples and a test set \ncontaining 77 examples. \n\nSection 5.1 describes experiments of model selection using the span-rule (6), both in the \nseparable case and in the non-separable one, while Section 5.2 shows VC bounds for model \nselection in the separable case both with and without rescaling. \n\n5.1 Model selection using the span-rule \n\nIn this section, we use the prediction of test error derived from the span-rule (6) for model \nselection. Figure 2a shows the test error and the prediction given by the span for differ(cid:173)\nent values of the width (7 of the RBF kernel on the postal database. Figure 2b plots the \nsame functions for different values of C on the breast-cancer database. We can see that \nthe method predicts the correct value of the minimum. Moreover, the prediction is very \naccurate and the curves are almost identical. \n\nI Available from http; I Ihorn. first. gmd. del \"'raetsch/da ta/breast-cancer \n\n\fModel Selection for Support Vector Machines \n\n235 \n\n40,-----~---~r=_=_ -\"\"T==es=:'t=er=ro=r ='ll \n\n-\n\nSpan prediction \n\n35 \n\n30 \n\n25 \ng20 \n\nUJ \n\n15 \n\n10 \n\n5 \n\ni\" \" , \n\n~6 \n\n-4 \n\n-2 \n\n0 \n\nLog sigma \n\n2 \n\n4 \n\n6 \n\no \n\n2 \n\n4 \n\n6 \n\nLoge \n\n8 \n\n10 \n\n12 \n\n(a) choice of (T in the postal database \n\n(b) choice of C in the breast-cancer database \n\nFigure 2: Test error and its prediction using the span-rule (6). \n\nThe computation of the span-rule (6) involves computing the span Sp (5) for every support \n\nvector. Note, however, that we are interested in the inequality S; ::; Yp!(xp)/a~, rather \nthan the exact value of the span Sp. Thus, while minimizing Sp = d(xp, Ap), if we find a \npoint x* E Ap such that d(xp, x*)2 ::; Yp! (xp ) / a~, we can stop the minimization because \nthis point will be correctly classified by the leave-one-out procedure. \n\nIt turned out in the experiments that the time required to compute the span was not pro(cid:173)\nhibitive, since it is was about the same than the training time. \n\nThere is a noteworthy extension in the application of the span concept. If we denote by \ne one hyperparameter of the kernel and if the derivative 8K(;~'Xi) is computable, then it \nis possible to compute analytically 8 ~ aiS~~y;fO(x;) , which is the derivative of an upper \nbound of the number of errors made by the leave-one-out procedure (see Theorem 2). This \nprovides us a more powerful technique in model selection. Indeed, our initial approach \nwas to choose the value of the width (T of the RBF kernel according to the minimum of \nthe span-rule. In our case, there was only hyperparamter so it was possible to try different \nvalues of (T. But, if we have several hyperparameters, for example one (T per component, \n\n_~ (Xk- Xj,)2 \n\nK(x, x') = e \n, it is not possible to do an exhaustive search on all the possible \nvalues of of the hyperparameters. Nevertheless, the previous remark enables us to find their \noptimal value by a classical gradient descent approach. \n\nk \n\n2 \n\n40 \n\n20 \n\n~L-==~===c~~ ____ ~ __ ~ __ ~ \n6 \n....., \n\n-2 \n\n-4 \n\n4 \n\n0 \n\nLog sigma \n\n2 \n\n~ -4 \n\n-2 \n\n0 \n\nLog sigma \n\n2 \n\n4 \n\n6 \n\n(a) without rescaling \n\n(b) with rescaling \n\nFigure 3: Bound on the VC dimension for different values of ~ on the postal database. The \nshape of the curve with rescaling is very similar to the test error on figure 2. \n\n6 Conclusion \n\nIn this paper, we introduced two new techniques of model selection for SVMs. One is based \non the span, the other is based on rescaling of the data in feature space. We demonstrated \nthat using these techniques, one can both predict optimal values for the parameters of the \nmodel and evaluate relative performances for different values of the parameters. These \nfunctionals can also lead to new learning techniques as they establish that generalization \nability is not only due to margin. \n\nAcknowledgments \n\nThe authors would like to thank Jason Weston and Patrick Haffner for helpfull discussions \nand comments. \n\nReferences \n\n[1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and \n\nKnowledge Discovery, 2(2):121-167, 1998. \n\n[2] T. S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proceedings of the \n\nJ 999 Conference on AI and Statistics, 1999. \n\n[3] M. Opper and O. Winther. Gaussian process classification and SVM: Mean field results and \n\nleave-one-out estimator. In Advances in Large Margin Classifiers. MIT Press, 1999. to appear. \n\n[4] B. SchOlkopf, J. Shawe-Taylor, A. 1. Smola, and R. C. Williamson. Kernel-dependent Support \nVector error bounds. In Ninth International Conference on Artificial Neural Networks, pp. 304 -\n309 \n\n[5] B. SchOlkopf, A. Smola, and K.-R. Muller. Kernel principal component analysis. In Artifi(cid:173)\ncial Neural Networks -ICANN'97, pages 583 - 588, Berlin, 1997. Springer Lecture Notes in \nComputer Science, Vol. 1327. \n\n[6] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. \n[7] V. Vapnik and O. Chapelle. Bounds on error expectation for SVM. Neural Computation, 1999. \n\nSubmitted. \n\n\f", "award": [], "sourceid": 1663, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}*