{"title": "Feature Selection for SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 668, "page_last": 674, "abstract": null, "full_text": "Feature Selection for SVMs \n\nJ. Weston t, S. Mukherjee tt , O. Chapelle*, M. Pontiltt \n\nT. Poggiott, V. Vapnik*,ttt \n\nt Barnhill Biolnformatics.com, Savannah, Georgia, USA. \n\ntt CBCL MIT, Cambridge, Massachusetts, USA. \n* AT&T Research Laboratories, Red Bank, USA. \n\nttt Royal Holloway, University of London, Egham, Surrey, UK. \n\nAbstract \n\nWe introduce a method of feature selection for Support Vector Machines. \nThe method is based upon finding those features which minimize bounds \non the leave-one-out error. This search can be efficiently performed via \ngradient descent. The resulting algorithms are shown to be superior to \nsome standard feature selection algorithms on both toy data and real-life \nproblems of face recognition, pedestrian detection and analyzing DNA \nmicro array data. \n\n1 Introduction \n\nIn many supervised learning problems feature selection is important for a variety of rea(cid:173)\nsons: generalization performance, running time requirements, and constraints and interpre(cid:173)\ntational issues imposed by the problem itself. \nIn classification problems we are given f data points Xi E ~n labeled Y E \u00b11 drawn i.i.d \nfrom a probability distribution P(x, y). We would like to select a subset of features while \npreserving or improving the discriminative ability of a classifier. As a brute force search \nof all possible features is a combinatorial problem one needs to take into account both the \nquality of solution and the computational expense of any given algorithm. \n\nSupport vector machines (SVMs) have been extensively used as a classification tool with a \ngreat deal of success from object recognition [5, 11] to classification of cancer morpholo(cid:173)\ngies [10] and a variety of other areas, see e.g [13] . In this article we introduce feature se(cid:173)\nlection algorithms for SVMs. The methods are based on minimizing generalization bounds \nvia gradient descent and are feasible to compute. This allows several new possibilities: \none can speed up time critical applications (e.g object recognition) and one can perform \nfeature discovery (e.g cancer diagnosis). We also show how SVMs can perform badly in \nthe situation of many irrelevant features, a problem which is remedied by using our feature \nselection approach. \n\nThe article is organized as follows. In section 2 we describe the feature selection problem, \nin section 3 we review SVMs and some of their generalization bounds and in section 4 we \nintroduce the new SVM feature selection method. Section 5 then describes results on toy \nand real life data indicating the usefulness of our approach. \n\n\f2 The Feature Selection problem \n\nThe feature selection problem can be addressed in the following two ways: (1) given a \nfixed m \u00ab n , find the m features that give the smallest expected generalization error; or \n(2) given a maximum allowable generalization error \"(, find the smallest m. In both of \nthese problems the expected generalization error is of course unknown, and thus must be \nestimated. In this article we will consider problem (1). Note that choices of m in problem \n(1) can usually can be reparameterized as choices of\"( in problem (2). \nProblem (1) is formulated as follows. Given a fixed set of functions y = f(x, a) we wish \nto find a preprocessing of the data x r-t (x * 0'), 0' E {a, I} n, and the parameters a of the \nfunction f that give the minimum value of \n\nT(O', a) = f V(y,f((x*O'),a))dP(x,y) \n\n(1) \n\nsubject to 110'110 = m, where P(x,y) is unknown, x * 0' = (Xl 0'1 , ... ,xnO'n) denotes an \nelementwise product, V (', .) is a loss functional and II . 110 is the a-norm. \nIn the literature one distinguishes between two types of method to solve this problem: the \nso-called filter and wrapper methods [2]. Filter methods are defined as a preprocessing step \nto induction that can remove irrelevant attributes before induction occurs, and thus wish to \nbe valid for any set of functions f(x, a). For example one popular filter method is to use \nPearson correlation coefficients. \n\nThe wrapper method, on the other hand, is defined as a search through the space of feature \nsubsets using the estimated accuracy from an induction algorithm as a measure of goodness \nof a particular feature subset. Thus, one approximates T(O', a) by minimizing \n\nTwrap(O', a) = min Talg(O') \n\n(2) \nsubject to 0' E {a, l}n where Talg is a learning algorithm trained on data preprocessed with \nfixed 0'. Wrapper methods can provide more accurate solutions than filter methods [9], \nbut in general are more computationally expensive since the induction algorithm Talg must \nbe evaluated over each feature set (vector 0') considered, typically using performance on a \nhold out set as a measure of goodness of fit. \n\nIT \n\nIn this article we introduce a feature selection algorithm for SVMs that takes advantage \nof the performance increase of wrapper methods whilst avoiding their computational com(cid:173)\nplexity. Note, some previous work on feature selection for SVMs does exist, however \nresults have been limited to linear kernels [3, 7] or linear probabilistic models [8]. Our \napproach can be applied to nonlinear problems. In order to describe this algorithm, we first \nreview the SVM method and some of its properties. \n\n3 Support Vector Learning \n\nSupport Vector Machines [13] realize the following idea: they map x E IRn into a high \n(possibly infinite) dimensional space and construct an optimal hyperplane in this space. \nDifferent mappings x r-t ~(x) E 1l construct different SVMs. \nThe mapping ~ (.) is performed by a kernel function K (', .) which defines an inner product \nin 1l. The decision function given by an SVM is thus: \n\nf(x) = w . ~(x) + b = L a?YiK(xi, x) + b. \n\n(3) \n\nThe optimal hyperplane is the one with the maximal distance (in 1l space) to the closest im(cid:173)\nage ~(Xi) from the training data (called the maximal margin). This reduces to maximizing \n\n\fthe following optimization problem: \n\nl \n\nW 2(0:) = LO:i -\n\nl \n\n1 \n2 L O:iO:jYiyjK(Xi,Xj) \n\n(4) \n\ni=1 \n\ni ,j=1 \n\nunder constraints 2:;=1 O:iYi = \u00b0 and O:i 2:: 0, \ni = 1, ... , \u00a3. For the non-separable case \none can quadratically penalize errors with the modified kernel K +- K + t I where I is \nthe identity matrix and A a constant penalizing the training errors (see [4] for reasons for \nthis choice). \nSuppose that the size of the maximal margin is M and the images *(Xl), ... , **(Xl) of the \ntraining vectors are within a sphere of radius R. Then the following holds true [13]. \n\nTheorem 1 lfimages of training data of size \u00a3 belonging to a .Iphere of size R are separa(cid:173)\nble with the corresponding margin M, then the expectation of the error probability has the \nbound \n\n2 O} \nEPerr ~ \u00a3E M2 = \u00a3E R W (0:) \n\n1 {R2} \n\n1 { 2 \n\n, \n\n(5) \n\nwhere expectation is taken over sets of training data of size \u00a3. \n\nThis theorem justifies the idea that the performance depends on the ratio E{ R2 / M2} and \nnot simply on the large margin M, where R is controlled by the mapping function <1>(.). \nOther bounds also exist, in particular Vapnik and Chapelle [4] derived an estimate using \nthe concept of the span of support vectors. \n\nTheorem 2 Under the assumption that the set of support vectors does not change when \nremoving the example p \n\nEpl - 1 < !E ~ \\II ( \n\no:~ \n- \u00a3 ~ (K- 1 ) \nsv pp \n\np=1 \n\nerr \n\n-1) \n\n(6) \n\nwhere \\II is the step function, Ksv is the matrix of dot products between support vectors, \np~;:-; is the probability of test error for the machine trained on a sample of size \u00a3 - 1 and \nthe expectations are taken over the random choice of the sample. \n\n4 Feature Selection for SVMs \n\nIn the problem of feature selection we wish to minimize equation (1) over u and 0:. The \nsupport vector method attempts to find the function from the set f(x, w, b) = w . ** (x) + b \nthat minimizes generalization error. We first enlarge the set of functions considered by the \nalgorithm to f(x, w, b, u) = w . **(x * u) + b. Note that the mapping u(x) = ** (x * u) \ncan be represented by choosing the kernel function Ku in equations (3) and (4): \n\nKu(x, y) = K((x * u), (y * u)) = (**u(x) . u(y)) \n\n(7) \nfor any K . Thus for these kernels the bounds in Theorems (1) and (2) still hold. Hence, to \nminimize T(U, 0:) over 0: and u we minimize the wrapper functional Twrap in equation (2) \nwhere Talg is given by the equations (5) or (6) choosing a fixed value of u implemented by \nthe kernel (7). Using equation (5) one minimizes over u: \n\nwhere the radius R for kernel Ku can be computed by maximizing (see, e.g [13]): \n\nR2W2(U) = R2(U)W2(o:O, u) \n\n(8) \n\n(9) \n\n\ff3i ~ 0, i = 1, ... , f, and W2(aO, 0\") is defined by the maximum \nsubject to L:i f3i = 1, \nof functional (4) using kernel (7). In a similar way, one can minimize the .span bound over \n0\" instead of equation (8). \n\nFinding the minimum of R 2W 2 over 0\" requires searching over all possible subsets of n \nfeatures which is a combinatorial problem. To avoid this problem classical methods of \nsearch include greedily adding or removing features (forward or backward selection) and \nhill climbing. All of these methods are expensive to compute if n is large. \n\nAs an alternative to these approaches we suggest the following method: approximate the \nbinary valued vector 0\" E {O, 1}n, with a real valued vector 0\" E ]Rn . Then, to find the \noptimum value of 0\" one can minimize R 2W 2 , or some other differentiable criterion, by \ngradient descent. As explained in [4] the derivative of our criterion is: \n\naR2W2(0\") \n\naO\"k \naR2(0\") \n\n= \n\nR2( )aW2(aO,0\") W2( 0 \n\n0\" \n\na \n\nO\"k \n\n+ \n\n)aR2(0\") \n\na ,fI a \n\nO\"k \n\n(10) \n\n(11) \n\n(12) \n\nWe estimate the minimum of 7(0\", a) by minimizing equation (8) in the space 0\" E ]Rn \nusing the gradients (10) with the following extra constraint which approximates integer \nprogramming: \n\n(13) \n\nsubject to L:i O\"i = m, O\"i ~ 0, i = 1, ... ,f. \nFor large enough), as p -+ \u00b0 only m elements of 0\" will be nonzero, approximating opti(cid:173)\nmization problem 7(0\", a). One can further simplify computations by considering a step(cid:173)\nwise approximation procedure to find m features. To do this one can minimize R 2W 2 (0\") \nwith 0\" unconstrained. One then sets the q \u00ab n smallest values of 0\" to zero, and repeats \nthe minimization until only m nonzero elements of 0\" remain. This can mean repeatedly \ntraining a SVM just a few times, which can be fast. \n\n5 Experiments \n\n5.1 Toy data \n\nWe compared standard SVMs, our feature selection algorithms and three classical filter \nmethods to select features followed by SVM training. The three filter methods chose the m \nlargest features according to: Pearson correlation coefficients, the Fisher criterion score 1, \nand the Kolmogorov-Smirnov test2). The Pearson coefficients and Fisher criterion cannot \nmodel nonlinear dependencies. \n\nIn the two following artificial datasets our objective was to assess the ability of the algo(cid:173)\nrithm to select a small number of target features in the presence of irrelevant and redundant \nfeatures. \n\n1 F( r) = 1 i, -1\u00a3; 21 , where 1-'; is the mean value for the r-th feature in the positive and negative \n\nU r +U r \n\nclasses and 0\"; 2 is the standard deviation \n\n2KStst(r) = Vl sup (P{X :::; fr} - PiX :::; fr, Yr = I}) where fr denotes the r-th feature \n\nfrom each training example, and P is the corresponding empirical distribution. \n\n\fLinear problem Six dimensions of 202 were relevant. The probability of y = 1 or -1 was \nequal. The first three features {Xl,X2,X3} were drawn as Xi = yN(i,l) and the second \nthree features {X4, X5, X6} were drawn as Xi = N(O, 1) with a probability of 0.7, otherwise \nthe first three were drawn as Xi = N(O, 1) and the second three as Xi = yN(i - 3, 1). The \nremaining features are noise Xi = N(O, 20), i = 7, ... ,202. \nNonlinear problem Two dimensions of 52 were relevant. The probability of y = 1 or -1 \nwas equal. The data are drawn from the following: if y = -1 then {Xl, X2} are drawn \nfrom N(JLl, 1;) or N(JL2, 1;) with equal probability, JLl = {-\u00a3, -3} and JL2 = ii, 3} and \n1; = I , if Y = 1 then {Xl, xd are drawn again from two normal distributions with equal \nprobability, with JLl = {3, -3} and JL2 = {-3, 3} and the same 1; as before. The rest of \nthe features are noise Xi = N(O, 20), i = 3, .. . ,52. \nIn the linear problem the first six features have redundancy and the rest of the features are \nirrelevant. In the nonlinear problem all but the first two features are irrelevant. \n\nWe used a linear SVM for the linear problem and a second order polynomial kernel for the \nnonlinear problem. For the filter methods and the SVM with feature selection we selected \nthe 2 best features. \n\nThe results are shown in Figure (1) for various training set sizes, taking the average test \nerror on 500 samples over 30 runs of each training set size. The Fisher score (not shown in \ngraphs due to space constraints) performed almost identically to correlation coefficients. \nIn both problems standard SVMs perform poorly: in the linear example using \u00a3 = 500 \npoints one obtains a test error of 13% for SVMs, which should be compared to a test error of \n3% with \u00a3 = 50 using our methods. Our SVM feature selection methods also outperformed \nthe filter methods, with forward selection being marginally better than gradient descent. \nIn the nonlinear problem, among the filter methods only the Kolmogorov-Smirnov test \nimproved performance over standard SVMs. \n\n0 . 7 \n\n0 . 6 \n\n0 . 5 \n\no Span-Bound & Forward Se l ection \n\n--- RW-Bound & Gradient \n\nx Standard SVMs \n\nCorrelation Coefficients \n\n-\n- ~ Ko l moqorov-Srnirnov Test \n\n, \n\n' \n\n, \n, \n0.41\\' \n0.3 \n\\ \n, \n, \n'b \n\n0 . 2 \n\n, \n, \n\\ \n\\ \n' \n' \n\\, \n'. , \n~~--~--~~----~ \n\n0 .1 \n\n0.7 \n\n0.4 \n\n0 . 3 \n\n0 . 2 \n\n0 .1 \n\no \n\nSpan- Bound & Forward Se l ection \n\n--- RW-Bound & Gradi ent \n\nx Standard SVMs \n\n0 . 6 \n\nCorrelation Coefficients \n\n-\n- ~ Ko l rnoqorov-Smirnov Test \n\nO. 5~:::;:========~=::=::;~===~ \n\n~Q_ - --o-- -- - - --o- - -\n\n2 0 \n\n40 \n\n80 \n\n1 00 \n\n20 \n\n60 \n(a) \n\n, \n~o.... _ _ B __ ___ _ 8- _\n\nt!J -\n\n-\n\n_ ______ _ \n\no ~'~~ -- ---- - o - -- - - -- __ _ _ \n40 \n(b) \n\n80 \n\n60 \n\n100 \n\nFigure 1: A comparison of feature selection methods on (a) a linear problem and (b) a \nnonlinear problem both with many irrelevant features. The x-axis is the number of training \npoints, and the y-axis the test error as a fraction of test points. \n\n5.2 Real-life data \n\nFor the following problems we compared minimizing R2W 2 via gradient descent to the \nFisher criterion score. \n\nFace detection The face detection experiments described in this section are for the system \nintroduced in [12, 5]. The training set consisted of 2, 429 positive images offrontal faces of \n\n\fsize 19x 19 and 13,229 negative images not containing faces. The test set consisted of 105 \npositive images and 2, 000, 000 negative images. A wavelet representation of these images \n[5] was used, which resulted in 1,740 coefficients for each image. \n\nPerformance of the system using all coefficients, 725 coefficients, and 120 coefficients is \nshown in the ROC curve in figure (2a). The best results were achieved using all features, \nhowever R2W 2 outperfomed the Fisher score. In this case feature selection was not useful \nfor eliminating irrelevant features, but one could obtain a solution with comparable perfor(cid:173)\nmance but reduced complexity, which could be important for time critical applications. \n\nPedestrian detection The pedestrian detection experiments described in this section are \nfor the system introduced in [11]. The training set consisted of 924 positive images of \npeople of size 128x64 and 10, 044 negative images not containing pedestrians. The test set \nconsisted of 124 positive images and 800, 000 negative images. A wavelet representation \nof these images [5, 11] was used, which resulted in 1,326 coefficients for each image. \n\nPerformance of the system using all coefficients and 120 coefficients is shown in the ROC \ncurve in figure (2b). The results showed the same trends that were observed in the face \nrecognition problem. \n\nl~~\"j \n\n10 \" \n\n10 ' \n\n10 ' \n\nFalsoPositiveRato \n\n(a) \n\n\". \n\nFalso PositillO Rillo \n\n(b) \n\nFigure 2: The solid line is using all features, the solid line with a circle is our feature \nselection method (minimizing R2W 2 by gradient descent) and the dotted line is the Fisher \nscore. (a)The top ROC curves are for 725 features and the bottom one \nfor 120 features for face detection. (b) ROC curves using all features and 120 features for \npedestrian detection. \n\nCancer morphology classification For DNA micro array data analysis one needs to deter(cid:173)\nmine the relevant genes in discrimination as well as discriminate accurately. We look at \ntwo leukemia discrimination problems [6, 10] and a colon cancer problem [1] (see also [7] \nfor a treatment of both of these problems). \n\nThe first problem was classifying myeloid and lymphoblastic leukemias based on the ex(cid:173)\npression of 7129 genes. The training set consists of 38 examples and the test set of 34 \nexamples. Using all genes a linear SVM makes 1 error on the test set. Using 20 genes a \nerrors are made for R2W2 and 3 errors are made using the Fisher score. Using 5 genes \n1 error is made for R 2W 2 and 5 errors are made for the Fisher score. The method of [6] \nperforms comparably to the Fisher score. \n\nThe second problem was discriminating B versus T cells for lymphoblastic cells [6]. Stan(cid:173)\ndard linear SVMs make 1 error for this problem. Using 5 genes a errors are made for \nR 2W 2 and 3 errors are made using the Fisher score. \n\n\fIn the colon cancer problem [1] 62 tissue samples probed by oligonucleotide arrays contain \n22 normal and 40 colon cancer tissues that must be discriminated based upon the expression \nof 2000 genes. Splitting the data into a training set of 50 and a test set of 12 in 50 separate \ntrials we obtained a test error of 13% for standard linear SVMs. Taking 15 genes for each \nfeature selection method we obtained 12.8% for R 2W 2 , 17.0% for Pearson correlation \ncoefficients, 19.3% for the Fisher score and 19.2% for the Kolmogorov-Smirnov test. Our \nmethod is only worse than the best filter method in 8 of the 50 trials. \n\n6 Conclusion \n\nIn this article we have introduced a method to perform feature selection for SVMs. This \nmethod is computationally feasible for high dimensional datasets compared to existing \nwrapper methods, and experiments on a variety of toy and real datasets show superior \nperformance to the filter methods tried. This method, amongst other applications, speeds up \nSVMs for time critical applications (e.g pedestrian detection), and makes possible feature \ndiscovery (e.g gene discovery). Secondly, in simple experiments we showed that SVMs can \nindeed suffer in high dimensional spaces where many features are irrelevant. Our method \nprovides one way to circumvent this naturally occuring, complex problem. \n\nReferences \n\n[1] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns \nof gene expression revealed by clustering analysis of tumor and normal colon cancer tissues \nprobed by oligonucleotide arrays. Cell Biology, 96:6745- 6750, 1999. \n\n[2] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. \n\nArtijicialintelligence, 97:245- 271\" 1997. \n\n[3] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support \nvector machines. In Proc. 13th International Conference on Machine Learning, pages 82- 90, \nSan Francisco, CA, 1998. \n\n[4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukhetjee. Choosing kernel parameters for sup(cid:173)\n\nport vector machines. Machine Learning, 2000. \n\n[5] T. Evgeniou, M. Ponti!, C. Papageorgiou, and T. Poggio. Image representations for object \n\ndetection using kernel classifiers. In Asian Conference on Computer Vision , 2000. \n\n[6] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, \nJ. Downing, M. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of \ncancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531-\n537, 1999. \n\n[7] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using \n\nsupport vector machines. Machine Learning, 2000. \n\n[8] T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. \n\nIn Uncertainity In Artijiciallntellegence, 2000. \n\n[9] J. Kohavi. Wrappers for feature subset selection. All issue on relevance, 1995. \n[10] S. Mukhetjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio. Support \nvector machine classification of micro array data. AI Memo 1677, Massachusetts Institute of \nTechnology, 1999. \n\n[11] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using \nwavelet templates. In Proc. Computer Vision and Pattern Recognition, pages 193- 199, Puerto \nRico, June 16- 20 1997. \n\n[12] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Inter(cid:173)\n\nnational Conference on Computer Vision , Bombay, India, January 1998. \n\n[13] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. \n\n\f", "award": [], "sourceid": 1850, "authors": [{"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Sayan", "family_name": "Mukherjee", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}*