{"title": "Adaptive Scaling for Feature Selection in SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 569, "page_last": 576, "abstract": null, "full_text": "Adaptive Scaling for Feature Selection in SVMs\n\nYves Grandvalet\n\nHeudiasyc, UMR CNRS 6599,\n\nUniversit\u00b4e de Technologie de Compi`egne,\n\nCompi`egne, France\n\nSt\u00b4ephane Canu\n\nPSI\n\nINSA de Rouen,\n\nSt Etienne du Rouvray, France\n\nYves.Grandvalet@utc.fr\n\nStephane.Canu@insa-rouen.fr\n\nAbstract\n\nThis paper introduces an algorithm for the automatic relevance determi-\nnation of input variables in kernelized Support Vector Machines. Rele-\nvance is measured by scale factors de\ufb01ning the input space metric, and\nfeature selection is performed by assigning zero weights to irrelevant\nvariables. The metric is automatically tuned by the minimization of the\nstandard SVM empirical risk, where scale factors are added to the usual\nset of parameters de\ufb01ning the classi\ufb01er. Feature selection is achieved\nby constraints encouraging the sparsity of scale factors. The resulting\nalgorithm compares favorably to state-of-the-art feature selection proce-\ndures and demonstrates its effectiveness on a demanding facial expres-\nsion recognition problem.\n\n1 Introduction\n\nIn pattern recognition, the problem of selecting relevant variables is dif\ufb01cult. Optimal\nsubset selection is attractive as it yields simple and interpretable models, but it is a com-\nbinatorial and acknowledged unstable procedure [2]. In some problems, it may be better\nto resort to stable procedures penalizing irrelevant variables. This paper introduces such a\nprocedure applied to Support Vector Machines (SVM).\n\nThe relevance of input features may be measured by continuous weights or scale factors,\nwhich de\ufb01ne a diagonal metric in input space. Feature selection consists then in determin-\ning a sparse diagonal metric, and sparsity can be encouraged by constraining an appropriate\nnorm on scale factors. Our approach can be summarized by the setting of a global optimiza-\ntion problem pertaining to 1) the parameters of the SVM classi\ufb01er, and 2) the parameters\nof the feature space mapping de\ufb01ning the metric in input space. As in standard SVMs,\nonly two tunable hyper-parameters are to be set: the penalization of training errors, and\nthe magnitude of kernel bandwiths. In this formalism we derive an ef\ufb01cient algorithm to\nmonitor slack variables when optimizing the metric. The resulting algorithm is fast and\nstable.\n\nAfter presenting previous approaches to hard and soft feature selection procedures in the\ncontext of SVMs, we present our algorithm. This exposure is followed by an experimental\nsection illustrating its performances and conclusive remarks.\n\n\f2 Feature Selection via adaptive scaling\n\nfactors.\n\n\u0001\u0003\u0002\u0005\u0004\u0006\u0001\n\ndiag\b\n\t\f\u000b\n\ncan be tuned in two ways:\n\nScaling is a usual preprocessing step, which has important outcomes in many classi\ufb01cation\nmethods including SVM classi\ufb01ers [9, 3]. It is de\ufb01ned by a linear transformation within\n\nis a diagonal matrix\u0004\u000e\r\u000f\r\u000f\u0010\u0011\u0002\u0013\u0012\u0014\r\u0016\u0015\u0017\r\u0018\r\u0019\u0010 of scale\n\nthe input space: \n, where\u0004\u0007\u0002\nAdaptive scaling consists in letting\t\nexplicit aim of achieving a better recognition rate. For kernel classi\ufb01ers, \t\n[8],\t\n\nto be adapted during the estimation process with the\nis a set of hyper-\nparameters of the learning process. According to the structural risk minimization principle\n\n1. estimate the parameters of classi\ufb01er \u001a by empirical risk minimization for sev-\nto produce a structure of classi\ufb01ers \u001a\neral values of \u001b\n\t multi-indexed by\n\u0012#\r\n\u0012\u0014\r\n\u001c!\u001e\n\r\u0019\u001f\u0011 minimiz-\n\r\u0019\u001f\" . Select one element of the structure by \ufb01nding the set \u001b\n\u001c!\u001e\n\r\u0019\u001f\u0011 by em-\n2. estimate the parameters of classi\ufb01er \u001a and the hyper-parameters \u001b\npirical risk minimization, while a second level hyper-parameter, say \u0012\u0014% , constrains\n\u0012\u0014\r\n\u001c!\u001e\n\r\u0019\u001f\" \n% , whose value is computed by minimizing some estimate of\nsi\ufb01ers indexed by \u0012\n\nin order to avoid over\ufb01tting. This procedure produces a structure of clas-\n\ning some estimate of generalization error.\n\ngeneralization error.\n\n\u001d\u001c\u0016\u001e\n\n\u0019\u001f\u0011 \n\n\u001c\u0016\u001e\n\u0012$\n\nThe usual paradigm consists in computing the estimate of generalization error for regularly\nspaced hyper-parameter values and picking the best solution among all trials. Hence, the\n\ufb01rst approach requires intensive computation, since the trials should be completed over a\n\n& -dimensional grid over\u0012'\r values.\n\nSeveral authors suggested to address this problem by optimizing an estimate of generaliza-\ntion error with respect to the hyper-parameters. For SVM classi\ufb01ers, Cristianini et al. [4]\n\ufb01rst proposed to apply an iterative optimization scheme to estimate a single kernel width\nhyper-parameter. Weston et al. [9] and Chapelle et al. [3] generalized this approach to\nmultiple hyper-parameters in order to perform adaptive scaling and variable selection.\n\nThe experimental results in [9, 3] show the bene\ufb01ts of this optimization. However, rely-\ning on the optimization of generalization error estimates over many hyper-parameters is\nhazardous. Once optimized, the unbiased estimates become down-biased, and the bounds\nprovided by VC-theory usually hold for kernels de\ufb01ned a priori (see the proviso on the\nradius/margin bound in [8]). Optimizing these criteria may thus result in over\ufb01tting.\n\nwith respect to \u0012\n\nIn the second solution considered here, the estimate of generalization error is minimized\n\n% , a single (second level) hyper-parameter, which constrains \u001b\n\nThe role of this constraint is twofold: control the complexity of the classi\ufb01er, and en-\ncourage variable selection in input space. This approach is related to some successful\nsoft-selection procedures, such as lasso and bridge [5] in the frequentist framework and\nAutomatic Relevance Determination (ARD) [7] in the Bayesian framework. Note that this\ntype of optimization procedure has been proposed for linear SVM in both frequentist [1]\nand Bayesian frameworks [6]. Our method generalizes this approach to nonlinear SVM.\n\n\u0019\u001f\u0011 .\n\n\u0012#\n\n\u001c!\u001e\n\n3 Algorithm\n\n3.1 Support Vector Machines\n\nThe decision function provided by SVM is (*),+.-$\b/\u001a\n\u0002;:=<2>\n\n\u0002214365\n\n\u000b\"798\n\n\u000b0\u000b , where function\u001a\n\n<\n\b\n1\n3\n5\n\t\n\b\n\u0001\n<\n\u000b\n\u0011\n\u0002\n\u000b\n\u0011\n<\n\u0017\n\u0002\n\u000b\n\t\n\b\n\u0001\n\b\n1\nB\n\t\n\u0002\n\u001b\n\u001b\n\u0012\n\u001e\nB\n\n\u001b\n\u0012\n\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0002\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0003\n\u0004\n)\n-\n\t\n\u0004\n)\n-\n\n\u000b\n\f\n1\n3\n1\n:\n<\n\u0011\n<\n>\n<\n\b\n1\n3\n5\n\t\n\b\n\u0001\n<\n\u000b\n7\n\u0011\n\u0002\n\u000b\n\u0011\n<\n\u0017\n\u0002\n\u000b\n\u000b\n&\n\u001e\n:\n\n%\n\u0012\n\n'\n\u0002\n\u000b\n&\n\u0012\n%\nB\n\n\u0002\n'\n\u0002\n\t\n\b\n\u0001\n\u000b\n\u0002\n\u0002\n\u0002\n%\n\f3.3 An alternated optimization scheme\n\n\b/\t\n8!\b\n\t\n\n8\u000f\u000b\n\u0001\u0003\u0002\u0005\t\n\nis determined by a\n\ndescent algorithm.\n\n\u0001\u0003\u0002\u0005\u0004\n\u0001\u0003\u0002\u0005\u0004\n\nProblem (3) is complex; we propose to solve iteratively a series of simplier problems.\nfor a \ufb01xed mapping\nof the feature space mapping are\n\nmization problem (2). Our implementation, based on an interior point method, will not\nbe detailed here. Several SVM retraining are necessary, but they are faster than the usual\ntraining since the algorithm is initialized appropriately with the solutions of the preceding\nround.\n\nThe function \u001a\nis \ufb01rst optimized with respect to parameters \b\n(standard SVM problem). Then, the parameters \t\noptimized while some characteristics of \u001a are kept \ufb01xed: At step , starting from a given\n\u0001\u0003\u0002\u0005\u0004 value, the optimal \b\u0007\u0006\n\u0001\b\u0002\u0005\u0004\n8\u0016\b\n\t\n\u000b0\u000b are computed. Then\t\n\u0001\u0003\u0002\u0005\u0004\n\u000b0\u000b are computed by solving the standard quadratic opti-\n\b/\t\nIn this scheme, \b\u0007\u0006\nFor solving the minimization problem with respect to \t\nvariables are \ufb01xed. We tried several versions: 1)1 \ufb01xed; 2) Lagrange multipliers\n \ufb01xed;\n3) set of support vectors \ufb01xed. For the three versions, the optimal value of 8 , or at least the\noptimal value of the slack variables\n can be obtained by solving a linear program, whose\n\n, we use a reduced conjugate gradi-\nent technique. The optimization problem was simpli\ufb01ed by assuming that some of the other\n\noptimum is computed directly (in a single iteration). We do not detail our \ufb01rst version here,\nsince the two last ones performed much better. The main steps of the two last versions are\nsketched below.\n\n3.4 Sclaling parameters update\n\nis de\ufb01ned as1\n\nby solving a\nsimple intermediate problem providing an improved solution to the global problem (3). We\n\nStarting from an initial solution \b\n\t\n\b\n\t\f\u000b\n8\u0016\b/\t\n\u000b0\u000b , our goal is to update \t\nde\ufb01ning1\n\ufb01rst assume that the Lagrange multipliers\n\nthat1\n\u0002\f\u000b\n\u001f\u0011 \n\u000b .\nRegarding problem (3), 1\nis sub-optimal when \t varies; nevertheless1\n\nis guaranteed to\nbe an admissible solution. Hence, we minimize an upper bound of the original primal\ncost which guarantees that any admissible update (providing a decrease of the cost) of the\nintermediate problem will provide a decrease of the cost of the original problem.\n\nare not affected by \t updates, so\n\nThe intermediate optimization problem is stated as follows:\n\n\b\t\u0006\n\nsubject to\n\n\n\n5\n\t\n\b\n\u0001\n<\n\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0002\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0003\n\u0004\n)\n-\n\t\n\u0006\n\n\u000b\n\f\n:\n<\n\u0006\n\n?\n<\n?\n\n>\n<\n>\n\n@\n\t\n\b\n\u0001\n<\nB\n\u0001\n\n\u000f\n:\n<\n\u0011\n<\n>\n\u0010\n:\n<\n\u0006\n\n?\n\n>\n\n@\n\t\n\b\n\u0001\n\u0001\n\n\u0013\n\u0013\n\u0011\n\u0002\n\u000b\n\u0011\n<\n\u0017\n\u0002\n\u000b\n\u000b\n&\n\u001e\n:\n\n%\n'\n\u0002\n\u000b\n&\n\u0019\n\u000f\n<\n\u0011\n\u000b\n\u000f\n<\n\u0011\n\fFor given values of\t\n\b\t\u0006\n\nsubject to\n\n,\n\nand\n\n\u001f\u0011 \u0012\u0011\n\u001f\u0011 \n\u0013!\u001d\n\nis the solution of the following problem:\n\n<*B\n\n\u000b\u0011798\n\n<\u0018\u0017\n\n\u000b\u0016\u0015\n\nB\u001a\u0019\u001b\u0019\u001b\u0019\u0018B\t\u001c\nB\u001a\u0019\u001b\u0019\u001b\u0019\u0018B\t\u001c\n\nwhose dual formulation is\n\n\u0004\u0001\n\nsubject to\n\n\u001f\u0011 \n\u001f\u0011 \n\n\u001f\" \n\n\u000b\u0016\u0015\n\u0013!\u001d\n\n\u000b\u000b\n\nThis\n\n\u001f\u0011 \n\nfor all positive and negative examples whose sum is positive.\n\nare easily computed. Parameters\nare then updated by a conjugate reduced gradient technique, i.e. a conjugate gradient\n\nlinear problem is\n\nsort\nin descending order for all positive examples on the one\nside and for all negative examples on the other side; 2) compute the pairwise sum of sorted\n\nsolved directly by the following algorithm:\n\n<0B\n< and its derivative with respect to \t\n\n<\t\b\n\u001f\" \n\u000b\u0016\u0015\nvalues; 3) set\u0004\nWith\u0003\n,\r\nalgorithm ensuring that the set of constraints on \t\nAssume now that only the support vectors remain \ufb01xed while optimizing \t\n. At \b\ntogether with\t by computing\u0014\n1. for support vectors of the \ufb01rst category \u001a\n\n3.5 Updating Lagrange multipliers\n\nare always veri\ufb01ed.\n\n. This assump-\ntion is used to derive a rule to update at reasonable computing cost the Lagrange multipliers\n\n8\u000f\u000b , the following holds [3]:\n\u00022>\n\n\u0014'\t\n\n\u001f\u0011 \n\n<*B\n\nFrom these equations, and the assumption that support vectors remain support vectors (and\nthat their category do not change) one derives a system of linear equations de\ufb01ning the\n\n2. for support vectors of the second category (such that\u0011\nderivatives of\n\nand8 with respect to\t\n\n1. for support vectors of the \ufb01rst category\n\n[3]:\n\n(5)\n\n(6)\n\nB\u001a\u0019\u001b\u0019\u001a\u0019\u000fB\t\u001c\u001f\u0019\n\n1)\n\n<\r\f\n<\u000f\u000e\n\n\u001d )?\"<\n\n(7)\n\n.\n\n(8)\n\n\u0014#8\n\u0014\u0014\t\n\n\u000b$7\n\n\u001f\u0011 \n\n\u0014\u0014\t\n\n\u001f\u0011 \n2. for support vectors of the second category\u0014\n\n\u0011\u0010\n?\u0011<\n\u0014'\t\n\n\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0002\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0003\n\u0004\n)\n-\n\n\u000f\n:\n<\n<\n>\n<\n\u000e\n\u0010\n\u000f\n:\n\n?\n\n>\n\n@\n\t\n\b\n\u0001\n\u0001\n\n\u0011\n\u0013\n\u0013\n\u0011\n\u0002\n\u000b\n\u0011\n<\n\u0017\n\u0002\n\u000b\nB\n\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0002\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\u0003\n\u0002\n\u0003\n\u000f\n:\n<\n\u0004\n<\n\n\u0002\n\u0003\n>\n<\n\u000e\n\u0010\n\u000f\n:\n\n?\n\n>\n\n@\n\t\n\b\n\u0001\n<\nB\n\u0001\n\n\u000b\n\u0011\n\u0013\n\u0005\n\u0006\n\u0007\n\u000f\n:\n<\n\u0004\n<\n>\n<\n\u0002\n\u001d\n\n\u0013\n\u0004\n<\n\u0017\n\u0002\n\u000b\n>\n\u000b\n\u000f\n\n?\n\n>\n\n@\n\b\n\u0001\n\u0001\n\n<\n\u0002\n\n\u000b\n\u000f\n<\n\u0011\n\t\n\n\u0015\n\nB\n\t\nB\n\t\n\b\n\u0001\n<\n\u000b\n<\n\u000f\n:\n\n?\n\n>\n\n@\n\t\n\b\n\u0001\n\u0001\n\n\u000b\n7\n8\n\u0002\n>\n\u0002\n\n\u000f\n:\n\n\u0014\n?\n\n>\n\n@\n\t\n\b\n\u0001\n<\nB\n\u0001\n\n\u000f\n:\n\n?\n\n>\n\t\n@\n\t\n\b\n\u0001\n<\nB\n\u0001\n\n\u000b\n7\n\u0002\n\u001d\n\u0002\n\u001d\n\f3. Finally, the system is completed by stating that the Lagrange multipliers should\n\nobey the constraint\n\n\u001f\u0011 \n\n\u001d :\n\n(9)\n\nis updated from these equations, and the step size is limited to ensure that\nis also an\n\nfor support vectors of the \ufb01rst category. Hence, in this version, 1\n\nadmissible sub-optimal solution regarding problem (3).\n\n\u001f\u0011 \n\n\u0014\u0014\t\n\n4 Experiments\n\n\u000e\u0005?\nThe value of\n\nIn the experiments reported below, we used(\nto be \ufb01xed. Finally, the hyper-parameters \b\n\n(3). The scale pa-\nrameters were optimized with the last version, where the set of support vectors is assumed\n\nfor the constraint on\t\n\u000b were chosen using the span bound [3].\n\nAlthough the value of the bound itself was not a faithful estimate of test error, the average\nloss induced by using the minimizer of these bounds was quite small.\n\n4.1 Toy experiment\n\nIn [9], Weston et al. compared two versions of their feature selection algorithm, to standard\nSVMs and \ufb01lter methods (i.e. preprocessing methods selecting features either based on\nPearson correlation coef\ufb01cients, Fisher criterion score, or the Kolmogorov-Smirnov statis-\ntic). Their arti\ufb01cial data benchmarks provide a basis for comparing our approach with\ntheir, which is based on the minimization of error bounds. Two types of distributions are\nprovided, whose detailed characteristics are not given here. In the linear problem, 6 dimen-\nsions out of 202 are relevant. In the nonlinear problem, two features out of 52 are relevant.\nFor each distribution, 30 experiments are conducted, and the average test recognition rate\nmeasures the performance of each method.\n\nFor both problems, standard SVM achieve a 50% error rate in the considered range of\ntraining set sizes. Our results are shown in Figure 1.\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n10\n\n20\n\n30\n\n40\n\n50\n\n75\n\n100\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n10\n\n20\n\n30\n\n40\n\n50\n\n75\n\n100\n\nFigure 1: Results obtained on the benchmarks of [9]. Left: linear problem; right nonlinear\nproblem. The number of training examples is represented on the\n-axis, and the average\n\ntest error rate on the> -axis.\n\nOur test performances are qualitatively similar to the ones obtained by gradient descent on\nthe radius/margin bound in [9], which are only improved by the forward selection algorithm\n\n\u000f\n:\n\n?\n\n>\n\n\u0002\n\u000f\n:\n\n\u0014\n?\n\n>\n\n\u0002\n\u001d\n\n<\n\u000e\n\u001d\n\u0002\n\f\n\u0012\n%\nB\n\n\n\ftraining examples, an average of 26.5 features are\n\nminimizing the span bound. Note however that Weston et al. results are obtained after a\ncorrect number of features was speci\ufb01ed by the user, whereas the present results were\nobtained fully automatically. Knowing the number of features that should be selected by\n\nalthough our feature selection scheme is effective, it should be more stringent: a smaller\n\nfor each\u0012'% .\nthe algorithm is somewhat similar to select the optimal value of parameter(\nIn the non-linear problem, for\u001c\nselected; for\u001c\n\u001d8\u001d , an average of 6.6 features are selected. These \ufb01gures show that\nfor\u001c\nof cases for\u001c\nvalue of( would be more appropriate for this type of problem. The two relevant variables\n\u001d , in\n\u001d . For these two sample sizes, they are even always ranked \ufb01rst and second.\nRegarding training times, the optimization of \t\n\nrequired an average of over 100 times\nmore computing time than standard SVM \ufb01tting for the linear problem and 40 times for the\nnonlinear problem. These increases scale less than linearly with the number of variables,\nand are certainly yet to be improved.\n\nare selected in \u0002\u0001\u0004\u0003\n\nfor n=50, and in\n\n\u0001\u0007\u0006 and\n\n4.2 Expression recognition\n\n8\u001d positive images, and\n\u001d negative ones.\n\nWe also tested our algorithm on a more demanding task to test its ability to handle a large\nnumber of features. The considered problem consists in recognizing the happiness expres-\nsion among the \ufb01ve other facial expressions corresponding to universal emotions (disgust,\nsadness, fear, anger, and surprise). The data sets are made of \u0001\nfrontal faces, with standardized positions of eyes, nose and mouth. The training set com-\nprises\nand\n\n\u001d\t\b\u000b\n8\u001d gray level images of\n\u001d positive images\n\n\u001d negative ones. The test set is made of \n\nWe used the raw pixel representation of images, resulting in 4200 highly correlated fea-\ntures. For this task, the accuracy of standard SVMs is 92.6% (11 test errors). The recogni-\ntion rate is not signi\ufb01cantly affected by our feature selection scheme (10 errors), but more\nthan 1300 pixels are considered to be completely irrelevant at the end of the iterative pro-\nrequired about 80 times more computing time than standard SVM).\nThis selection brings some important clues for building relevant attributes for the facial\nrecognition expression task.\n\ncedure (estimating \t\nFigure 2 represents the scaling factors \t\n\n, where black is zero and white represents the\nhighest value. We see that, according to the classi\ufb01er, the relevant areas for recognizing the\nhappiness expression are mainly in the mouth area, especially on the mouth wrinkles, and\nto a lesser extent in the white of the eyes (which detects open eyes) and the outer eyebrows.\nOn the right hand side of this \ufb01gure, we displayed masked support faces, i.e. support faces\nscaled by the expression mask. Although we lost many important features regarding the\nidentity of people, the expression is still visible on these faces. Areas irrelevant for the\nrecognition task (forehead, nose, and upper cheeks) have been erased or softened by the\nexpression mask.\n\n5 Conclusion\n\nWe have introduced a method to perform automatic relevance determination and feature\nselection in nonlinear SVMs. Our approach considers that the metric in input space de\ufb01nes\na set of parameters of the SVM classi\ufb01er. The update of the scale factors is performed\nby iteratively minimizing an approximation of the SVM cost. The latter is ef\ufb01ciently mini-\nmized with respect to slack variables when the metric varies. The approximation of the cost\nfunction is tight enough to allow large update of the metric when necessary. Furthermore,\nbecause at each step our algorithm guaranties the global cost to decrease, it is stable.\n\n\u0002\n\u000b\n\u001d\n\u0002\n\u000b\n\u0002\n\u000b\n\u0005\n\u001d\n\u0003\n\u000b\n\u001d\n\u001d\n\u0003\n\u0002\n\u001c\n\u0002\n\u000b\n\u001d\n\u000b\n\f\n\u000b\n\u000b\n\fFigure 2: Left: expression mask of happiness provided by the scaling factors \t\n\n; Right,\ntop row: the two positive masked support face; Right, bottom row: four negative masked\nsupport faces.\n\nPreliminary experimental results show that the method provides sensible results in a rea-\nsonable time, even in very high dimensional spaces, as illustrated on a facial expression\nrecognition task. In terms of test recognition rates, our method is comparable with [9, 3].\nFurther comparisons are still needed to demonstrate the practical merits of each paradigm.\n\nFinally, it may also be bene\ufb01cial to mix the two approaches: the method of Cristianini et al.\n. The resulting algorithm would differ from [9, 3],\n\n[4] could be used to determine\u0012\nsince the relative relevance of each feature (as measured by \u0012\n\nempirical risk minimization, instead of being driven by an estimate of generalization error.\n\n% and\n\n\u0001\n\n% ) would be estimated by\n\nReferences\n\n[1] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and\nIn Proc. 15th International Conf. on Machine Learning,\n\nsupport vector machines.\npages 82\u201390. Morgan Kaufmann, San Francisco, CA, 1998.\n\n[2] L. Breiman. Heuristics of instability and stabilization in model selection. The Annals\n\nof Statistics, 24(6):2350\u20132383, 1996.\n\n[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters\n\nfor support vector machines. Machine Learning, 46(1):131\u2013159, 2002.\n\n[4] N. Cristianini, C. Campbell, and J. Shawe-Taylor. Dynamically adapting kernels in\nsupport vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Ad-\nvances in Neural Information Processing Systems 11. MIT Press, 1999.\n\n[5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: data\n\nmining , inference, and prediction. Springer series in statistics. Springer, 2001.\n\n[6] T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy dis-\n\ncrimination. In Uncertainity In Arti\ufb01cial Intellegence, 2000.\n\n[7] R. M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in\n\nStatistics. Springer, 1996.\n\n[8] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Series in Statistics.\n\nSpringer, 1995.\n\n[9] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature\nselection for SVMs. In Advances in Neural Information Processing Systems 13. MIT\nPress, 2000.\n\n\u0012\n\f", "award": [], "sourceid": 2156, "authors": [{"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "St\u00e9phane", "family_name": "Canu", "institution": null}]}