{"title": "A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1072, "abstract": null, "full_text": "A feature selection algorithm based on the global\n minimization of a generalization error bound\n\n\n Dori Peleg Ron Meir\n Department of Electrical Engineering Department of Electrical Engineering\n Technion Technion\n Haifa, Israel Haifa, Israel\n dorip@tx.technion.ac.il rmeir@tx.technion.ac.il\n\n\n Abstract\n\n A novel linear feature selection algorithm is presented based on the\n global minimization of a data-dependent generalization error bound.\n Feature selection and scaling algorithms often lead to non-convex opti-\n mization problems, which in many previous approaches were addressed\n through gradient descent procedures that can only guarantee convergence\n to a local minimum. We propose an alternative approach, whereby the\n global solution of the non-convex optimization problem is derived via\n an equivalent optimization problem. Moreover, the convex optimization\n task is reduced to a conic quadratic programming problem for which effi-\n cient solvers are available. Highly competitive numerical results on both\n artificial and real-world data sets are reported.\n\n\n1 Introduction\n\nThis paper presents a new approach to feature selection for linear classification where the\ngoal is to learn a decision rule from a training set of pairs Sn = x(i), y(i) n , where\n i=1\nx(i) Rd are input patterns and y(i) {-1,1} are the corresponding labels. The goal\nof a classification algorithm is to find a separating function f(), based on the training set,\nwhich will generalize well, i.e. classify new patterns with as few errors as possible. Feature\nselection schemes often utilize, either explicitly or implicitly, scaling variables, {j}d ,\n j=1\nwhich multiply each feature. The aim of such schemes is to optimize an objective function\nover Rd. Feature selection can be viewed as the case j {0,1}, j = 1,...,d, where\na feature j is removed if j = 0. The more general case of feature scaling is considered\nhere, i.e. j R+. Clearly feature selection is a special case of feature scaling.\nThe overwhelming majority of feature selection algorithms in the literature, separate the\nfeature selection and classification tasks, while solving either a combinatorial or a non-\nconvex optimization problem (e.g. [1],[2],[3],[4]). In either case there is no guarantee\nof efficiently locating a global optimum. This is particularly problematic in large scale\nclassification tasks which may initially contain several thousand features. Moreover, the\nobjective function of many feature selection algorithms is unrelated to the Generalization\nError (GE). Even for global solutions of such algorithms there is no theoretical guarantee\nof proximity to the minimum of the GE.\n\n\f\nTo overcome the above shortcomings we propose a feature selection algorithm based on\nthe Global Minimization of an Error Bound (GMEB). This approach is based on simulta-\nneously finding the optimal classifier and scaling factors of each feature by minimizing a\nGE bound. As in previous feature selection algorithms, a non-convex optimization problem\nmust be solved. A novelty of this paper is the use of the equivalent optimization problems\nconcept, whereby a global optimum is guaranteed in polynomial time.\nThe development of the GMEB algorithm begins with the design of a GE bound for fea-\nture selection. This is followed by formulating an optimization problem which minimizes\nthis bound. Invariably, the resulting problem is non-convex. To avoid the drawbacks of\nsolving non-convex optimization problems, an equivalent convex optimization problem is\nformulated whereby the exact global optimum of the non-convex problem can be computed.\nNext the dual problem is derived and formulated as a Conic Quadratic Programming (CQP)\nproblem. This is advantageous because efficient CQP algorithms are available. Compara-\ntive numerical results on both artificial and real-world datsets are reported.\nThe notation and definitions were adopted from [5]. All vectors are column vectors unless\ntransposed. Mathematical operators on scalars such as the square root are expanded to vec-\ntors by operating componentwise. The notation R+ denotes nonnegative real numbers. The\nnotation x y denotes componentwise inequality between vectors x and y respectively.\nA vector with all components equal to one is denoted as 1. The domain of a function f is\ndenoted as dom f. The set of points for which the objective and all the constraint functions\nare defined is called the domain of the optimization problem, D. For lack of space, only\nproof sketches will be presented; the complete proofs are deferred to the full paper.\n\n2 The Generalization Error Bounds\n\nWe establish GE bounds which are used to motivate an effective algorithm for feature scal-\ning. Consider a sample Sn = {(x(1),y(1)),...,(x(n), y(n))}, x(i) X Rd, y(i) Y,\nwhere (x(i), y(i)) are generated independently from some distribution P . A set of nonneg-\native variables = (1, . . . , d)T is introduced to allow the additional freedom of feature\nscaling. The scaling variables transform the linear classifiers from f(x) = wT x + b to\nf (x) = wT x + b, where = diag(). It may seem at first glance that these classifiers\nare essentially the same since w can be redefined as w. However the role of is to offer\nan extra degree of freedom to scale the features independently of w, in a way which can be\nexploited by an optimization algorithm.\nFor a real-valued classifier f, the 0 - 1 loss is the probability of error given by\nP (yf (x) 0) = EI (yf(x) 0), where I() is the indicator function.\nDefinition 1 The margin cost function : R R+ is defined as (z) = 1 - z/ if\nz , and zero otherwise (note that I (yf(x) 0) (yf(x))).\nConsider a classifier f for which the input features have been rescaled, namely f(x) is\nused instead of f(x). Let F be some class of functions and let ^En be the empirical mean.\nUsing standard GE bounds, one can establish that for any choice of , with probability at\nleast 1 - , for any f F\n P (yf (x) 0) ^En (yf(x)) + (f,,), (1)\nfor some appropriate complexity measure depending on the bounding technique.\nUnfortunately, (1) cannot be used directly when attempting to select optimal values of the\nvariables because the bound is not uniform in . In particular, we need a result which\nholds with probability 1 - for every choice of .\n\n\f\nDefinition 2 The indices of training patterns with labels {-1,1} are denoted by I-,I+\nrespectively. The cardinalities of the sets I-,I+ are n-,n+ respectively. The empirical\nmean of the second order moment of the jth feature over the training patterns belonging to\n 2 2\nindices I-,I+ are v- = 1 x(i) , v+ = 1 x(i) respectively.\n j n i j j n j\n - I- + iI+\n\nTheorem 3 Fix B, r, > 0, and suppose that {(x(i),y(i))}n are chosen independently\n i=1\nat random according to some probability distribution P on X {1}, where x r for\nx X. Define the class of functions F\n F = f : f(x) = wTx + b, w B, |b| r, 0 .\nLet 0 be an arbitrary positive number, and set `j = 2 max(j, 0). Then with probability\nat least 1 - , for every function f F\n 2B n d d\n + n \nP (yf (x) 0) ^E -\n n (yf (x)) + v+ `\n 2 + v- `\n 2\n n j j n j j\n j=1 j=1 + ,(2)\nwhere K() = (B ` + 1)r and = (,,)\n ,\n n\n\n\n 2r d `\n 2 2\n (, , ) = + K() 2 ln log j + K() + 1 2 ln .\n 2 0 \n j=1\n\n\nProof sketch We begin by assuming a fixed upper bound on the values of j, say j sj,\nj = 1, 2, . . . , d. This allows us to use the methods developed in [6] in order to establish\nupper bounds on the Rademacher complexity of the class F, where j sj for all j.\nFinally, a simple variant of the union bound (the so-called multiple testing lemma) is used\nin order to obtain a bound which is uniform with respect to (see the proof technique of\nTheorem 10 in [6]).\nIn principle, we would like to minimize the r.h.s. of (2) with respect to the variables w, , b.\nHowever, in this work the focus is only on the data-dependent terms in (2), which include\nthe empirical error term and the weighted norms of . Note that all other terms of (2) are\nof the same order of magnitude (as a function of n), but do not depend explicitly on the\ndata. It should be commented that the extra terms appearing in the bound arise because of\nthe assumed unboundedness of . Assuming to be bounded, e.g. s, as is the case\nin most other bounds in the literature, one may replace by s in all terms except the first\ntwo, thus removing the explicit dependence on .\nThe data-dependent terms of the GE bound (2) are the basis of the objective function\n\n 1 n C d d\n + n+ C n\n v+2 + - - v-2 , (3)\n n y(i)f (x(i)) + n j j n j j\n i=1 j=1 j=1\n\nwhere C+ = C- = 4 and the variables are subject to wTw 1, 0. The transition\nwas performed by setting B = 1, and replacing ` by 2 (assuming that > 0).\nDue to the fact that only the sign of f determines the estimated labels, it can be multiplied\nby any positive factor and produce identical results. The constraint on the norm of w\ninduces a normalization on the classifier f(x) = wT x + b, without which the classifier is\nnot unique. However, by introducing the scale variables , the classifier was transformed to\nf (x) = wT x + b. Hence, despite the constraint on w, the classifier is not unique again. If\nthe variable in (3) is set to an arbitrary positive constant then the solution is unique. This\nis true because appears in (3) only in the expressions b , 1 , . . . , d . We chose = 1.\n \n\n\f\nThe objective function is comprised of two elements: (1) the mean of the penalty on the\ntraining errors (2) and two weighted l2 norms of the scale variables . The second term acts\nas the feature selection element. Note that the values of C+, C- following from Theorem\n3 depend specifically on the bounding technique used in the proof. To allow more gener-\nality and flexibility in practical applications, we propose to turn the norm terms of (3) into\ninequality constraints which are bounded by hyperparameters R+, R- respectively. The\ninterpretation of these hyperparameters is essentially the number of informative features.\nWe propose that R+, R- are chosen via a Cross Validation (CV) scheme. These hyperpa-\nrameters enable fine-tuning a general classifier to a specific classification task as is done in\nmany other classification algorithms such as the SVM algorithm.\nNote that the present bound is sensitive to a shift of the features. Therefore, as a prepro-\ncessing step the features of the training patterns should be set to zero mean and the features\nof the test set shifted accordingly.\n\n3 The primal non-convex optimization problem\n\nThe problem of minimizing (3) with = 1 can then be expressed as\n\n minimize 1T \n subject to wT w 1\n y(i)( d x(i)w\n j=1 j j j + b) 1 - i, i = 1, . . . , n (4)\n R+ d v+2\n j=1 j j\n R- d v-2\n j=1 j j\n , 0,\n\nwith variables w, Rd, Rn, b R. Note that the constant value 1 was discarded.\n n\n\nRemark 4 Consider a solution of problem (4) in which = 0 for some feature j. Only\n j\nthe constraint wT w 1 affects the value of w . A unique solution is established by setting\n j\n = 0 = 0. If the original solution w satisfies the constraint wT w\n j wj 1 then the\namended solution will also satisfy the constraint and won't affect the value of the objective\nfunction.\n\nThe functions wjj in the second inequality constraints are neither convex nor concave (in\nfact they are quasiconcave [5]). To make matters worse, the functions wjj are multiplied\nby constants -y(i)x(i) which can be either positive or negative. Consequently problem (4)\n j\nis not a convex optimization problem. The objective of Section 3.1 is to find the global\nminimum of (4) in polynomial time despite its non-convexity.\n\n3.1 Convexification\n\nIn this paper the informal definition of equivalent optimization problems is adopted from\n[5, pp. 130135]: two optimization problems are called equivalent if from a solution of\none, a solution of the other is found, and vice versa. Instead of detailing a complicated\nformal definition of general equivalence, the specific equivalence relationships utilized in\nthis paper are either formally introduced or cited from [5].\nThe functions wjj in problem (4) are not convex and the signs of the multiplying constants\n-y(i)x(i) are data dependant. The only functions that remain convex irrespective of the sign\n j\nof the constants which multiply them are linear functions. Therefore the functions wjj\nmust be transformed into linear functions.\n\n\f\nHowever, such a transformation must also maintain the convexity of the objective function\nand the remaining constraints. For this purpose the change of variables equivalence rela-\ntionship, described in appendix A, was utilized. The transformation : RdRd RdRd\nwas used on the variables w, :\n ~\n w\n j\n j = + ~\n j, wj = , j = 1, . . . , d, (5)\n ~\n j\n\nwhere dom = {(~, ~w)|~ 0}. If ~j = 0 then j = wj = 0 without regard to the\nvalue of ~\n wj, in accordance with remark 4. Transformation (5) is clearly one-to-one and\n(dom ) D.\nLemma 5 The problem\n\n minimize 1T \n subject to y(i)( ~\n wT x(i) + b) 1 - i, i = 1,...,n\n d ~\n w2\n j\n j=1 ~\n j 1 (6)\n R+ (v+)T ~\n R- (v-)T ~\n , ~\n 0\n\nis convex and equivalent to the primal non-convex problem (4) with transformation (5).\n\nNote that since ~\n wj = wjj, the new classifier is f (x) = ~\n wT x + b. Therefore there is no\nneed to use transformation (5) to obtain the desired classifier. Also one can use Schur's\ncomplement [5] to transform the non-linear constraint into a sparse linear matrix inequality\nconstraint\n w 0.\n wT 1\n\nThus problem (6) can be cast as a Semi-Definite Programming (SDP) problem. The primal\nproblem therefore, consists of n + 2d + 1 variables, 2n + d + 2 linear inequality constraints\nand a linear matrix inequality of [(d + 1) (d + 1)] dimensions. Although the primal\nproblem (6) is convex, it heavily relies on the number of features d which is typically the\nbottleneck for feature selection datasets. To alleviate this dependency the Dual problem is\nformulated.\n\nTheorem 6 (Dual problem) The dual optimization problem associated with problem (6)\nis\n maximize 1T - 1 - R++ - R--\n subject to n , 2 + )\n i=1 iy(i)x(i)\n j 1, (+v+\n j -v-\n j Kr ,j = 1,...,d\n T y = 0 (7)\n 0 1\n +, - 0,\nwhere Kr is the Rotated Quadratic Cone (RQC) Kr = {(x,y,z) Rn R R|xTx \n2yz, y 0,z 0} and with the variables Rn,1,2 R.\nTheorem 7 (Strong duality) Strong duality holds between problems (6) and (7).\n\nThe dual problem (7) is a CQP problem. The number of variables is n + 3, there are\n2n+2 linear inequality constraints, a single linear equality constraint and d RQC inequality\nconstraints. Due to the reduced computational complexity we used the dual formulation in\nall the experiments.\n\n\f\n4 Experiments\n\nSeveral algorithms were comparatively evaluated on a number of artificial and real world\ntwo class problem datasets. The GMEB algorithm was compared to the linear SVM (stan-\ndard SVM with linear kernel) and the l1 SVM classifier [7].\n\n4.1 Experimental Methodology\n\nThe algorithms are compared by two criteria: the number of selected features and the\nerror rates. The weight assigned by a linear classifier to a feature j, determines whether it\nshall be `selected' or `rejected'. This weight must fulfil at least one of the following two\nrequirements:\n\n 1. Absolute measure - |wj| .\n 2. Relative measure - |wj|\n maxj {|wj|} .\nIn this paper = 0.01 was used. Ideally, should be set adaptively. Note that for the\nGMEB algorithm ~\n w should be used.\nThe definition of the error rate is intrinsically entwined with the protocol for determining\nthe hyperparameter. Given an a-priori partitioning of the dataset into training and test sets,\nthe following protocol for determining the value of R+, R- and defining the error rate is\nsuggested:\n\n 1. Define a set R of values of the hyperparameters R+,R- for all datasets. The set R\n consists of a predetermined number of values. For each algorithm the cardinality\n |R| = 49 was used.\n 2. Calculate the N-fold CV error for each value of R+, R- from set R on the training\n set. Five fold CV was used throughout all the datasets.\n 3. Use the classifier with the value of R+, R- which produced the lowest CV error\n to classify the test set. This is the reported error rate.\nIf the dataset is not partitioned a-priori into a training and test set, it is randomly divided\ninto np contiguous training and `test' sets. Each training set contains n np-1 patterns and\n np\nthe corresponding test set consists of n patterns. Once the dataset is thus partitioned, the\n np\nabove steps 1 - 3 can be implemented. The error rate and the number of selected features\nare then defined as the average on the np problems. The value np = 10 was used for all\ndatasets, where an a-priori partitioning was not available.\nThe hyperparameter sets R used for the GMEB algorithm consisted of 77 linearly spaced\nvalues between 1 and 10. For the SVM algorithms the set R consisted of the values \n 1-\nwhere = {0.02,0.04,...,0.98}, i.e. 49 linearly spaced values between 0.02 and 0.98.\n4.2 Data sets\n\nTests were performed on the `Linear problem' synthetic datasets as described in [2], and\neight real-world problems. The number of features, the number of patterns and the parti-\ntioning into train and test sets of the real-world datasets are detailed in Table 2. The datasets\nwere taken form the UCI repository unless stated otherwise. Dataset (1) is termed Wiscon-\nsin Diagnostic Breast Cancer `WDBC', (2) `Multiple Features' dataset, which was first\nintroduced by ([8]), (3) the `Internet Advertisements' dataset, was separated into a training\nand test set randomly, (4) the `Colon' dataset, taken from ([2]), (5) the `BUPA' dataset, (6)\nthe `Pima Indians Diabetes' dataset, (7) the `Cleveland heart disease' dataset, and (8), the\n`Ionosphere' dataset.\n\n\f\nTable 1: Mean and standard deviation of the mean of test error rate percentage on synthetic\ndatasets given n training patterns. The number of selected features is in brackets.\n\n n SVM l1 SVM GMEB\n 10 46.2 1.9 (197.12.1) 49.6 1.9 (77.783.8) 33.8 14.2 (3.72.1)\n 20 44.9 2.1 (196.81.9) 38.5 12.7 (10.76.1) 13.9 7.2 (4.82.7)\n 30 43.6 1.7 (196.72.8) 27.4 12.4 (14.58.7) 7.1 5.6 (5.12.3)\n 40 41.8 1.9 (197.21.8) 19.2 6.9 (16.211.1) 5.0 3.5 (5.52.1)\n 50 41.9 1.8 (196.62.6) 16.0 5.3 (18.411.3) 3.1 2.7 (5.11.8)\n\nTable 2: The real-world datasets and the performance of the algorithms. The set R for the\nlinear SVM algorithm and for datasets 1,5,6 had to be set to to allow convergence.\n\n Feat. Patt. Linear SVM l1 SVM GMEB\n 30 569 5.30.8 (27.30.3) 4.91.1 (16.41.3) 4.20.9 (6.00.3)\n 649 200/1800 0.3 (616) 3.5 (15) 0.2 (32)\n 1558 200/3080 5.3 (322) 4.7 (12) 5.5 (98)\n 2000 62 13.65.9 (1941.81.9) 10.74.4 (23.31.5) 10.74.4 (59.125.0)\n 6 345 33.13.5 (6.00.0) 33.63.6 (5.90.1) 34.24.4 (5.40.5)\n 8 768 22.81.5 (5.80.2) 22.91.4 (5.80.2) 22.51.8 (4.80.2)\n 13 297 17.51.9 (11.60.2) 16.81.6 (10.70.3) 15.52.0 (9.10.3)\n 34 351 11.72.6 (32.80.2) 12.02.3 (27.91.6) 10.02.3 (12.11.7)\n\n\n4.3 Experimental results\n\nTable 1 provides a comparison of the GMEB algorithm with the SVM algorithms on the\nsynthetic datasets. The Bayes error is 0.4%. For further numerical comparison see [3].\nNote that the number of features selected by the l1 SVM and the GMEB algorithms increase\nwith the sample size. A possible explanation for this observation is that with only a few\ntraining patterns a small training error can be achieved by many subsets containing a small\nnumber of features, i.e. a sparse solution. The particular subset selected is essentially\nrandom, leading to a large test error, possibly due to overfitting.\nFor all the synthetic datasets the GMEB algorithm clearly attained the lowest error rates.\nOn the real-world datasets it produced the lowest error rates and the smallest number of\nfeatures for the majority of datasets investigated.\n\n\n4.4 Discussion\n\nThe GMEB algorithm performs comparatively well against the linear and l1 SVM algo-\nrithms, in regard to both the test error and the number of selected features. A possible\nexplanation is that the l1 SVM algorithm performs both classification and feature selection\nwith the same variable w. In contrast, the GMEB algorithm performs the feature selection\nand classification simultaneously, while using variables and w respectively. The use of\ntwo variables also allows the GMEB algorithm to reduce the weight of a feature j with both\nwj and j, while the l1 SVM uses only wj. Perhaps this property of GMEB could explain\nwhy it produces comparable (and at times better) results than the SVM algorithms both in\nclassification problems where feature selection is and is not required.\n\n\f\n5 Summary and future work\n\nThis paper presented a feature selection algorithm motivated by minimizing a GE bound.\nThe global optimum of the objective function is found by solving a non-convex optimiza-\ntion problem. The equivalent optimization problems technique reduces this task to a convex\nproblem. The dual problem formulation depends more weakly on the number of features d\nand this enabled an extension of the GMEB algorithm to large scale classification problems.\nThe GMEB classifier is a linear classifier. Linear classifiers are the most important type of\nclassifiers in a feature selection framework because feature selection is highly susceptible\nto overfitting. We believe that the GMEB algorithm is just the first of a series of algorithms\nwhich may globally minimize increasingly tighter bounds on the generalization error.\nAcknowledgment R.M. is partially supported by the fund for promotion of research at the Technion\nand by the Ollendorff foundation of the Electrical Engineering department at the Technion.\n\nA Change of variables\n\nConsider optimization problem\n minimize f0(x)\n subject to (8)\n fi(x) 0, i = 1,...,m.\nSuppose : Rn Rn is one-to-one, with image covering the problem domain D, i.e.,\n(dom ) D . We define functions ~fi as ~fi(z) = fi((z)),i = 0,...,m. Now consider\nthe problem\n minimize ~\n f0(z) (9)\n subject to ~\n fi(z) 0, i = 1,...,m,\nwith variable z. Problem (8) and (9) are said to be related by the change of variable x =\n(z) and are equivalent: if x solves the problem (8), then z = -1(x) solves problem(9);\nif z solves problem (9), then x = (z) solves problem (8).\n\nReferences\n[1] Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in svms. In S. Thrun S. Becker\n and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 553\n 560. MIT Press, 2003.\n[2] Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and\n Vladimir Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing\n Systems 13, pages 668674, 2000.\n[3] Alain Rakotomamonjy. Variable selection using svm based criteria. The Journal of Machine\n Learning Research, 3:13571370, 2003.\n[4] Jason Weston, Andre Elisseeff, Bernhard Scholkopf, and Mike Tipping. Use of the zero norm\n with linear models and kernel methods. The Journal of Machine Learning Research, 3:1439\n 1461, March 2003.\n[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n 2004. http://www.stanford.edu/boyd/cvxbook.html.\n[6] R. Meir and T. Zhang. Generalization bounds for Bayesian mixture algorithms. Journal of\n Machine Learning Research, 4:839860, 2003.\n[7] Glenn Fung and O. L. Mangasarian. Data selection for support vector machines classifiers. In\n Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and\n Data Mining, pages 6470, 2000.\n[8] Simon Perkins, Kevin Lacker, and James Theiler. Grafting: Fast, incremental feature selection\n by gradient descent in function space. Journal of Machine Learning Research, 3:13331356,\n March 2003.\n\n\f\n", "award": [], "sourceid": 2629, "authors": [{"given_name": "Dori", "family_name": "Peleg", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}