{"title": "1-norm Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 49, "page_last": 56, "abstract": "", "full_text": "1-norm Support Vector Machines\n\nJi Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\n{jzhu,saharon,hastie,tibs}@stat.stanford.edu\n\nAbstract\n\nThe standard 2-norm SVM is known for its good performance in two-\nIn this paper, we consider the 1-norm SVM. We\nclass classi\u00a3cation.\nargue that the 1-norm SVM may have some advantage over the standard\n2-norm SVM, especially when there are redundant noise features. We\nalso propose an ef\u00a3cient algorithm that computes the whole solution path\nof the 1-norm SVM, hence facilitates adaptive selection of the tuning\nparameter for the 1-norm SVM.\n\n1 Introduction\n\nIn standard two-class classi\u00a3cation problems, we are given a set of training data (x1, y1),\n. . . (xn, yn), where the input xi \u2208 Rp, and the output yi \u2208 {1,\u22121} is binary. We wish to\n\u00a3nd a class\u00a3cation rule from the training data, so that when given a new input x, we can\nassign a class y from {1,\u22121} to it.\n\uf8ee\nTo handle this problem, we consider the 1-norm support vector machine (SVM):\nn(cid:1)\n\uf8f01 \u2212 yi\n(cid:2)\u03b2(cid:2)1 = |\u03b21| + \u00b7\u00b7\u00b7 + |\u03b2q| \u2264 s,\n\n(2)\nwhere D = {h1(x), . . . hq(x)} is a dictionary of basis functions, and s is a tuning parame-\nter. The solution is denoted as \u02c6\u03b20(s) and \u02c6\u03b2(s); the \u00a3tted model is\n\n\uf8eb\n\uf8ed\u03b20 +\n\n\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb\n\nq(cid:1)\n\nj=1\n\nmin\n\u03b20,\u03b2\n\ns.t.\n\n\u03b2jhj(xi)\n\n(1)\n\ni=1\n\n+\n\n\u02c6f(x) = \u02c6\u03b20 +\n\n\u02c6\u03b2jhj(x).\n\n(3)\n\nj=1\n\nThe classi\u00a3cation rule is given by sign[ \u02c6f(x)]. The 1-norm SVM has been successfully\nused in [1] and [9]. We argue in this paper that the 1-norm SVM may have some advantage\nover the standard 2-norm SVM, especially when there are redundant noise features.\nTo get a good \u00a3tted model \u02c6f(x) that performs well on future data, we also need to select\nan appropriate tuning parameter s. In practice, people usually pre-specify a \u00a3nite set of\nvalues for s that covers a wide range, then either use a separate validation data set or use\n\nq(cid:1)\n\n\fcross-validation to select a value for s that gives the best performance among the given set.\nIn this paper, we illustrate that the solution path \u02c6\u03b2(s) is piece-wise linear as a function of\ns (in the Rq space); we also propose an ef\u00a3cient algorithm to compute the exact whole\nsolution path { \u02c6\u03b2(s), 0 \u2264 s \u2264 \u221e}, hence help us understand how the solution changes\nwith s and facilitate the adaptive selection of the tuning parameter s. Under some mild\nassumptions, we show that the computational cost to compute the whole solution path \u02c6\u03b2(s)\nis O(nq min(n, q)2) in the worst case and O(nq) in the best case.\nBefore delving into the technical details, we illustrate the concept of piece-wise linearity\nof the solution path \u02c6\u03b2(s) with a simple example. We generate 10 training data in each of\ntwo classes. The \u00a3rst class has two standard normal independent inputs x1, x2. The second\nclass also has two standard normal independent inputs, but conditioned on 4.5 \u2264 x2\n\u2264\n8. The dictionary of basis functions is D = {\u221a\n}. The solution\npath \u02c6\u03b2(s) as a function of s is shown in Figure 1. Any segment between two adjacent\nvertical lines is linear. Hence the right derivative of \u02c6\u03b2(s) with respect to s is piece-wise\nconstant (in Rq). The two solid paths are for x2\n2, which are the two relevant features.\n\n2x1x2, x2\n\n1 and x2\n\n1 +x2\n\n2\n\n1, x2\n2\n\n2x1,\n\n2x2,\n\n\u221a\n\n\u221a\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n\u02c6\u03b2\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\n0.0\n\n0.5\n\ns\n\n1.0\n\n1.5\n\nFigure 1: The solution path \u02c6\u03b2(s) as a function of s.\n\nIn section 2, we motivate why we are interested in the 1-norm SVM. In section 3, we\ndescribe the algorithm that computes the whole solution path \u02c6\u03b2(s). In section 4, we show\nsome numerical results on both simulation data and real world data.\n\n2 Regularized support vector machines\n\nThe standard 2-norm SVM is equivalent to \u00a3t a model that\n\nn(cid:1)\n\n\uf8ee\n\uf8f01 \u2212 yi\n\n\uf8eb\n\uf8ed\u03b20 +\n\ni=1\n\nj=1\n\nmin\n\u03b20,\u03b2j\n\nq(cid:1)\n\n\u03b2jhj(xi)\n\n\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb\n\n+\n\n+ \u03bb(cid:2)\u03b2(cid:2)2\n2,\n\n(4)\n\nwhere \u03bb is a tuning parameter. In practice, people usually choose hj(x)\u2019s to be the basis\nfunctions of a reproducing kernel Hilbert space. Then a kernel trick allows the dimension\nof the transformed feature space to be very large, even in\u00a3nite in some cases (i.e. q = \u221e),\nwithout causing extra computational burden ([2] and [12]). In this paper, however, we will\nconcentrate on the basis representation (3) rather than a kernel representation.\nNotice that (4) has the form loss + penalty, and \u03bb is the tuning parameter that controls\nthe tradeoff between loss and penalty. The loss (1 \u2212 yf)+ is called the hinge loss, and\n\n\fthe penalty is called the ridge penalty. The idea of penalizing by the sum-of-squares of the\nparameters is also used in neural networks, where it is known as weight decay. The ridge\npenalty shrinks the \u00a3tted coef\u00a3cients \u02c6\u03b2 towards zero. It is well known that this shrinkage\nhas the effect of controlling the variances of \u02c6\u03b2, hence possibly improves the \u00a3tted model\u2019s\nprediction accuracy, especially when there are many highly correlated features [6]. So from\na statistical function estimation point of view, the ridge penalty could possibly explain the\nsuccess of the SVM ([6] and [12]). On the other hand, computational learning theory has\nassociated the good performance of the SVM to its margin maximizing property [11], a\nproperty of the hinge loss. [8] makes some effort to build a connection between these two\ndifferent views.\nIn this paper, we replace the ridge penalty in (4) with the L1-norm of \u03b2, i.e.\npenalty [10], and consider the 1-norm SVM problem:\n\nthe lasso\n\nq(cid:1)\n\n\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb\n\n\u03b2jhj(xi)\n\n+ \u03bb(cid:2)\u03b2(cid:2)1,\n\n(5)\n\nn(cid:1)\n\n\uf8ee\n\uf8f01 \u2212 yi\n\n\uf8eb\n\uf8ed\u03b20 +\n\nmin\n\u03b20,\u03b2\n\ni=1\n\nj=1\n\n+\n\nwhich is an equivalent Lagrange version of the optimization problem (1)-(2).\n\nThe lasso penalty was \u00a3rst proposed in [10] for regression problems, where the response y\nis continuous rather than categorical. It has also been used in [1] and [9] for classi\u00a3cation\nproblems under the framework of SVMs. Similar to the ridge penalty, the lasso penalty also\nshrinks the \u00a3tted coef\u00a3cients \u02c6\u03b2\u2019s towards zero, hence (5) also bene\u00a3ts from the reduction\nin \u00a3tted coef\u00a3cients\u2019 variances. Another property of the lasso penalty is that because of the\nL1 nature of the penalty, making \u03bb suf\u00a3ciently large, or equivalently s suf\u00a3ciently small,\nwill cause some of the coef\u00a3cients \u02c6\u03b2j\u2019s to be exactly zero. For example, when s = 1 in\nFigure 1, only three \u00a3tted coef\u00a3cients are non-zero. Thus the lasso penalty does a kind of\ncontinuous feature selection, while this is not the case for the ridge penalty. In (4), none of\nthe \u02c6\u03b2j\u2019s will be equal to zero.\nIt is interesting to note that the ridge penalty corresponds to a Gaussian prior for the \u03b2j\u2019s,\nwhile the lasso penalty corresponds to a double-exponential prior. The double-exponential\ndensity has heavier tails than the Gaussian density. This re\u00a4ects the greater tendency of\nthe lasso to produce some large \u00a3tted coef\u00a3cients and leave others at 0, especially in high\ndimensional problems. Recently, [3] consider a situation where we have a small number of\ntraining data, e.g. n = 100, and a large number of basis functions, e.g. q = 10, 000. [3]\nargue that in the sparse scenario, i.e. only a small number of true coef\u00a3cients \u03b2j\u2019s are non-\nzero, the lasso penalty works better than the ridge penalty; while in the non-sparse scenario,\ne.g.\nthe true coef\u00a3cients \u03b2j\u2019s have a Gaussian distribution, neither the lasso penalty nor\nthe ridge penalty will \u00a3t the coef\u00a3cients well, since there is too little data from which to\nestimate these non-zero coef\u00a3cients. This is the curse of dimensionality taking its toll.\nBased on these observations, [3] further propose the bet on sparsity principle for high-\ndimensional problems, which encourages using lasso penalty.\n\n3 Algorithm\n\nSection 2 gives the motivation why we are interested in the 1-norm SVM. To solve the\n1-norm SVM for a \u00a3xed value of s, we can transform (1)-(2) into a linear programming\nproblem and use standard software packages; but to get a good \u00a3tted model \u02c6f(x) that\nperforms well on future data, we need to select an appropriate value for the tuning paramter\ns. In this section, we propose an ef\u00a3cient algorithm that computes the whole solution path\n\u02c6\u03b2(s), hence facilitates adaptive selection of s.\n\n\f3.1 Piece-wise linearity\n\n(cid:10)\nIf we follow the solution path \u02c6\u03b2(s) of (1)-(2) as s increases, we will notice that since both\ni(1 \u2212 yi \u02c6fi)+ and (cid:2)\u03b2(cid:2)1 are piece-wise linear, the Karush-Kuhn-Tucker conditions will\nnot change when s increases unless a residual (1 \u2212 yi \u02c6fi) changes from non-zero to zero,\n(cid:10)\nor a \u00a3tted coef\u00a3cient \u02c6\u03b2j(s) changes from non-zero to zero, which correspond to the non-\ni(1 \u2212 yi \u02c6fi)+ and (cid:2)\u03b2(cid:2)1. This implies that the derivative of \u02c6\u03b2(s) with\nsmooth points of\nrespect to s is piece-wise constant, because when the Karush-Kuhn-Tucker conditions do\nnot change, the derivative of \u02c6\u03b2(s) will not change either. Hence it indicates that the whole\nsolution path \u02c6\u03b2(s) is piece-wise linear. See [13] for details.\nThus to compute the whole solution path \u02c6\u03b2(s), all we need to do is to \u00a3nd the joints, i.e.\nthe asterisk points in Figure 1, on this piece-wise linear path, then use straight lines to\ninterpolate them, or equivalently, to start at \u02c6\u03b2(0) = 0, \u00a3nd the right derivative of \u02c6\u03b2(s), let\ns increase and only change the derivative when \u02c6\u03b2(s) gets to a joint.\n\nInitial solution (i.e. s = 0)\n\n3.2\nThe following notation is used. Let V = {j : \u02c6\u03b2j(s) (cid:6)= 0}, E = {i : 1 \u2212 yi \u02c6fi = 0},\nL = {i : 1 \u2212 yi \u02c6fi > 0} and u for the right derivative of \u02c6\u03b2V(s): (cid:2)u(cid:2)1 = 1 and \u02c6\u03b2V(s)\ndenotes the components of \u02c6\u03b2(s) with indices in V. Without loss of generality, we assume\n#{yi = 1} \u2265 #{yi = \u22121}; then \u02c6\u03b20(0) = 1, \u02c6\u03b2j(0) = 0. To compute the path that \u02c6\u03b2(s)\nfollows, we need to compute the derivative of \u02c6\u03b2(s) at 0. We consider a modi\u00a3ed problem:\n\n(cid:1)\n\n(cid:1)\n\nmin\n\u03b20,\u03b2j\n\ns.t.\n\n(1 \u2212 yifi)+ +\n\nyi=1\n\nyi=\u22121\n(cid:2)\u03b2(cid:2)1 \u2264 \u2206s, fi = \u03b20 +\n\n(1 \u2212 yifi)\nq(cid:1)\n\n\u03b2jhj(xi).\n\nj=1\n\n(6)\n\n(7)\n\nNotice that if yi = 1, the loss is still (1 \u2212 yifi)+; but if yi = \u22121, the loss becomes\n(1 \u2212 yifi). In this setup, the derivative of \u02c6\u03b2(\u2206s) with respect to \u2206s is the same no matter\nwhat value \u2206s is, and one can show that it coincides with the right derivative of \u02c6\u03b2(s)\nwhen s is suf\u00a3ciently small. Hence this setup helps us \u00a3nd the initial derivative u of \u02c6\u03b2(s).\nSolving (6)-(7), which can be transformed into a simple linear programming problem, we\nget initial V, E and L. |V| should be equal to |E|. We also have:\n\n(cid:11)\n\n(cid:12)\n\n(cid:11)\n\n(cid:12)\n\n=\n\n1\n0\n\n\u02c6\u03b20(\u2206s)\n\u02c6\u03b2V(\u2206s)\n\n(cid:11)\n\n(cid:12)\n\n+ \u2206s \u00b7\n\nu0\nu\n\n.\n\n(8)\n\n\u2206s starts at 0 and increases.\n\n3.3 Main algorithm\n\nThe main algorithm that computes the whole solution path \u02c6\u03b2(s) proceeds as following:\n\n1. Increase \u2206s until one of the following two events happens:\n\n\u2022 A training point hits E, i.e. 1 \u2212 yifi (cid:6)= 0 becomes 1 \u2212 yifi = 0 for some i.\n\u2022 A basis function in V leaves V, i.e. \u02c6\u03b2j (cid:6)= 0 becomes \u02c6\u03b2j = 0 for some j.\nLet the current \u02c6\u03b20, \u02c6\u03b2 and s be denoted by \u02c6\u03b2old\n\n0 , \u02c6\u03b2old and sold.\n\n\f2. For each j\n\nwhere u0, uj and uj\u2217 are the unknowns. We then compute:\n\nj\n\n\u2217\n\n= 1\n\n\u2206lossj\u2217\n\n)uj + |uj\u2217|\n(cid:1)\n\n(cid:13)\n/\u2208 V, we solve:\nu0 +\n\n(cid:10)\n(cid:10)\nV ujhj(xi) + uj\u2217hj\u2217(xi) = 0 for i \u2208 E\nV sign( \u02c6\u03b2old\n(cid:14)\n(cid:15)\n(cid:1)\n=\nL\n(cid:10)\n(cid:2) \u2208 E, we solve:\n(cid:10)\nu0 +\nV sign( \u02c6\u03b2old\n(cid:1)\n\nV ujhj(xi) = 0 for i \u2208 E\\{i\n(cid:15)\n\nujhj(xi) +u j\u2217hj\u2217(xi)\n\n)uj = 1\n\n(cid:1)\n\nu0 +\n\n(cid:14)\n\n(cid:13)\n\n\u2206s\n\n(cid:2)}\n\nyi\n\nV\n\nj\n\n(9)\n\n.\n\n(10)\n\n(11)\n\n3. For each i\n\nwhere u0 and uj are the unknowns. We then compute:\n\n\u2206lossi(cid:2)\n\n\u2206s\n\n=\n\nyi\n\nL\n\nujhj(xi)\n\nu0 +\n(12)\nfrom step 2 and step 3. There are q\u2212|V|+\n\nV\n\n.\n\n4. Compare the computed values of \u2206loss\n\u2206s\n\n|E| = q + 1 such values. Choose the smallest negative \u2206loss\n(cid:12)\n\n\u2022 If the smallest \u2206loss\n\u2022 If the smallest negative \u2206loss\n\n\u2206s corresponds to a j\n\n(cid:11)\n\n\u2206s\n\n\u2217\n\nis non-negative, the algorithm terminates; else\n\n\u2206s . Hence,\n\nin step 2, we update\n\n\u2217}, u \u2190\n\nV \u2190 V \u222a {j\n\nu\nuj\u2217\n\u2022 If the smallest negative \u2206loss\n(cid:2)\n\u2206s corresponds to a i\nE \u2190 E\\{i\n(cid:2)}, L \u2190 L \u222a {i\n(cid:12)\n(cid:12)\nIn either of the last two cases, \u02c6\u03b2(s) changes as:\n\u02c6\u03b2old\n0\u02c6\u03b2oldV\n\n\u02c6\u03b20(sold + \u2206s)\n\u02c6\u03b2V(sold + \u2206s)\n\n(cid:11)\n\n(cid:11)\n\n+ \u2206s \u00b7\n\n=\n\nin step 3, we update u and\n(14)\n\n(cid:2)} if necessary.\n(cid:12)\n\n(cid:11)\n\nu0\nu\n\n,\n\n(15)\n\n.\n\n(13)\n\nand we go back to step 1.\n\nIn the end, we get a path \u02c6\u03b2(s), which is piece-wise linear.\n\n3.4 Remarks\n\nDue to the page limit, we omit the proof that this algorithm does indeed give the exact\nwhole solution path \u02c6\u03b2(s) of (1)-(2) (see [13] for detailed proof). Instead, we explain a little\nwhat each step of the algorithm tries to do.\nStep 1 of the algorithm indicates that \u02c6\u03b2(s) gets to a joint on the solution path and the right\nderivative of \u02c6\u03b2(s) needs to be changed if either a residual (1\u2212 yi \u02c6fi) changes from non-zero\nto zero, or the coef\u00a3cient of a basis function \u02c6\u03b2j(s) changes from non-zero to zero, when s\nincreases. Then there are two possible types of actions that the algorithm can take: (1) add\na basis function into V, or(2) remove a point from E.\nStep 2 computes the possible right derivative of \u02c6\u03b2(s) if adding each basis function hj\u2217(x)\ninto V. Step 3 computes the possible right derivative of \u02c6\u03b2(s) if removing each point i\n(cid:2)\nfrom E. The possible right derivative of \u02c6\u03b2(s) (determined by either (9) or (11)) is such that\nthe training points in E are kept in E when s increases, until the next joint (step 1) occurs.\n\u2206loss/\u2206s indicates how fast the loss will decrease if \u02c6\u03b2(s) changes according to u. Step 4\ntakes the action corresponding to the smallest negative \u2206loss/\u2206s. When the loss can not\nbe decreased, the algorithm terminates.\n\n\fTest Error (SE)\n\nTable 1: Simulation results of 1-norm and 2-norm SVM\n|D|\n5\n14\n27\n44\n65\n\nNo Penalty\n0.08 (0.01)\n0.12 (0.03)\n0.20 (0.05)\n0.22 (0.06)\n0.22 (0.06)\n\n0.073 (0.010)\n0.074 (0.014)\n0.074 (0.009)\n0.082 (0.009)\n0.084 (0.011)\n\n0.08 (0.02)\n0.10 (0.02)\n0.13 (0.03)\n0.15 (0.03)\n0.18 (0.03)\n\n2-norm\n\n1-norm\n\nSimulation\n\n1 No noise input\n2 noise inputs\n2\n4 noise inputs\n3\n6 noise inputs\n4\n8 noise inputs\n5\n\n# Joints\n94 (13)\n149 (20)\n225 (30)\n374 (52)\n499 (67)\n\n3.5 Computational cost\nWe have proposed an algorithm that computes the whole solution path \u02c6\u03b2(s). A natural\nquestion is then what is the computational cost of this algorithm? Suppose |E| = m at a\njoint on the piece-wise linear solution path, then it takes O(qm2) to compute step 2 and\nstep 3 of the algorithm through Sherman-Morrison updating formula. If we assume the\ntraining data are separable by the dictionary D, then all the training data are eventually\ngoing to have loss (1 \u2212 yi \u02c6fi)+ equal to zero. Hence it is reasonable to assume the number\nof joints on the piece-wise linear solution path is O(n). Since the maximum value of m\nis min(n, q) and the minimum value of m is 1, we get the worst computational cost is\nO(nq min(n, q)2) and the best computational cost is O(nq). Notice that this is a rough\ncalculation of the computational cost under some mild assumptions. Simulation results\n(section 4) actually indicate that the number of joints tends to be O(min(n, q)).\n\n4 Numerical results\n\nIn this section, we use both simulation and real data results to illustrate the 1-norm SVM.\n\n4.1 Simulation results\n\nThe data generation mechanism is the same as the one described in section 1, except that\nwe generate 50 training data in each of two classes, and to make harder problems, we\nsequentially augment the inputs with additional two, four, six and eight standard normal\nnoise inputs. Hence the second class almost completely surrounds the \u00a3rst, like the skin\nsurrounding the oragne, in a two-dimensional subspace. The Bayes error rate for this prob-\nlem is 0.0435, irrespective of dimension. In the original input space, a hyperplane cannot\nnomial kernel, hence the dictionary of basis functions is D = {\u221a\nseparate the classes; we use an enlarged feature space corresponding to the 2nd degree poly-\n(cid:2) =\n1, . . . p}. We generate 1000 test data to compare the 1-norm SVM and the standard 2-norm\nSVM. The average test errors over 50 simulations, with different numbers of noise inputs,\nare shown in Table 1. For both the 1-norm SVM and the 2-norm SVM, we choose the\ntuning parameters to minimize the test error, to be as fair as possible to each method. For\ncomparison, we also include the results for the non-penalized SVM.\n\n2xjxj(cid:2) , x2\n\nj , j, j\n\n2xj,\n\n\u221a\n\nFrom Table 1 we can see that the non-penalized SVM performs signi\u00a3cantly worse than the\npenalized ones; the 1-norm SVM and the 2-norm SVM perform similarly when there is no\nnoise input (line 1), but the 2-norm SVM is adversely affected by noise inputs (line 2 - line\n5). Since the 1-norm SVM has the ability to select relevant features and ignore redundant\nfeatures, it does not suffer from the noise inputs as much as the 2-norm SVM. Table 1 also\nshows the number of basis functions q and the number of joints on the piece-wise linear\nsolution path. Notice that q < n and there is a striking linear relationship between |D| and\n#Joints (Figure 2). Figure 2 also shows the 1-norm SVM result for one simulation.\n\n\f0\n1\n\n.\n\n5\n0\n\n.\n\n\u02c6\u03b2\n\n0\n0\n\n.\n\n.\n\n5\n0\n\u2212\n\n0\n2\n0\n\n.\n\n5\n1\n0\n\n.\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0\n1\n0\n\n.\n\n5\n0\n0\n\n.\n\n0\n0\n5\n\n0\n0\n4\n\ni\n\ns\nt\nn\no\nJ\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\n2\n\n4\n\ns\n\n6\n\n0\n\n2\n\n4\n\ns\n\n6\n\n10\n\n20\n50\nNumber of Bases\n\n30\n\n40\n\n60\n\nFigure 2: Left and middle panels: 1-norm SVM when there are 4 noise inputs. The left panel is the\npiece-wise linear solution path \u02c6\u03b2(s). The two upper paths correspond to x2\n2, which are the\nrelevant features. The middle panel is the test error along the solution path. The dash lines correspond\nto the minimum of the test error. The right panel illustrates the linear relationship between the number\nof basis functions and the number of joints on the solution path when q < n.\n\n1 and x2\n\n4.2 Real data results\n\nIn this section, we apply the 1-norm SVM to classi\u00a3cation of gene microarrays. Classi-\n\u00a3cation of patient samples is an important aspect of cancer diagnosis and treatment. The\n2-norm SVM has been successfully applied to microarray cancer diagnosis problems ([5]\nand [7]). However, one weakness of the 2-norm SVM is that it only predicts a cancer class\nlabel but does not automatically select relevant genes for the classi\u00a3cation. Often a primary\ngoal in microarray cancer diagnosis is to identify the genes responsible for the classi\u00a3ca-\ntion, rather than class prediction. [4] and [5] have proposed gene selection methods, which\nwe call univariate ranking (UR) and recursive feature elimination (RFE) (see [14]), that can\nbe combined with the 2-norm SVM. However, these procedures are two-step procedures\nthat depend on external gene selection methods. On the other hand, the 1-norm SVM has\nan inherent gene (feature) selection property due to the lasso penalty. Hence the 1-norm\nSVM achieves the goals of classi\u00a3cation of patients and selection of genes simultaneously.\nWe apply the 1-norm SVM to leukemia data [4]. This data set consists of 38 training data\nand 34 test data of two types of acute leukemia, acute myeloid leukemia (AML) and acute\nlymphoblastic leukemia (ALL). Each datum is a vector of p = 7, 129 genes. We use the\nthe jth gene\u2019s expression level, as the basis function, i.e. q = p.\noriginal input xj, i.e.\nThe tuning parameter is chosen according to 10-fold cross-validation, then the \u00a3nal model\nis \u00a3tted on all the training data and evaluated on the test data. The number of joints on\nthe solution path is 104, which appears to be O(n) (cid:10) O(q). The results are summarized\nin Table 2. We can see that the 1-norm SVM performs similarly to the other methods\nin classi\u00a3cation and it has the advantage of automatically selecting relevant genes. We\nshould notice that the maximum number of genes that the 1-norm SVM can select is upper\nbounded by n, which is usually much less than q in microarray problems.\n\n5 Conclusion\n\nWe have considered the 1-norm SVM in this paper. We illustrate that the 1-norm SVM may\nhave some advantage over the 2-norm SVM, especially when there are redundant features.\nThe solution path \u02c6\u03b2(s) of the 1-norm SVM is a piece-wise linear function in the tuning\n\n\fTable 2: Results on Microarray Classi\u00a3cation\n\nMethod\n2-norm SVM UR\n2-norm SVM RFE\n1-norm SVM\n\nCV Error Test Error\n\n# of Genes\n\n2/38\n2/38\n2/38\n\n3/34\n1/34\n2/34\n\n22\n31\n17\n\nparameter s. We have proposed an ef\u00a3cient algorithm to compute the whole solution path\n\u02c6\u03b2(s) of the 1-norm SVM, and facilitate adaptive selection of the tuning parameter s.\n\nAcknowledgments\n\nHastie was partially supported by NSF grant DMS-0204162, and NIH grant ROI-CA-\n72028-01. Tibshirani was partially supported by NSF grant DMS-9971405, and NIH grant\nROI-CA-72028.\n\nReferences\n\n[1] Bradley, P. & Mangasarian, O. (1998) Feature selection via concave minimization and support\nvector machines. In J. Shavlik (eds), ICML\u201998. Morgan Kaufmann.\n\n[2] Evgeniou, T., Pontil, M. & Poggio., T. (1999) Regularization networks and support vector ma-\nchines. Advances in Large Margin Classi\u00a3ers. MIT Press.\n\n[3] Friedman, J., Hastie, T, Rosset, S, Tibshirani, R. & Zhu, J. (2004) Discussion of \u201cConsistency in\nboosting\u201d by W. Jiang, G. Lugosi, N. Vayatis and T. Zhang. Annals of Statistics. To appear.\n\n[4] Golub,T., Slonim,D., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J., Coller,H., Loh,M.,\nDowning,J. & Caligiuri,M. (1999) Molecular classi\u00a3cation of cancer: class discovery and class pre-\ndiction by gene expression monitoring. Science 286, 531-536.\n\n[5] Guyon,I., Weston,J., Barnhill,S. & Vapnik,V. (2002) Gene selection for cancer classi\u00a3cation using\nsupport vector machines. Machine Learning 46, 389-422.\n\n[6] Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements of Statistical Learning. Springer-\nVerlag, New York.\n\n[7] Mukherjee, S., Tamayo,P., Slonim,D., Verri,A., Golub,T., Mesirov,J. & Poggio, T. (1999) Support\nvector machine classi\u00a3cation of microarray data. Technical Report AI Memo 1677, MIT.\n\n[8] Rosset, S., Zhu, J. & Hastie, T. (2003) Boosting as a regularized path to a maximum margin\nclassi\u00a3er. Technical Report, Department of Statistics, Stanford University, CA.\n\n[9] Song, M., Breneman, C., Bi, J., Sukumar, N., Bennett, K., Cramer, S. & Tugcu, N. (2002) Pre-\ndiction of protein retention times in anion-exchange chromatography systems using support vector\nregression. Journal of Chemical Information and Computer Sciences, September.\n\n[10] Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J.R.S.S.B. 58, 267-288.\n\n[11] Vapnik, V. (1995) Tha Nature of Statistical Learning Theory. Springer-Verlag, New York.\n\n[12] Wahba, G. (1999) Support vector machine, reproducing kernel Hilbert spaces and the random-\nized GACV. Advances in Kernel Methods - Support Vector Learning, 69-88, MIT Press.\n\n[13] Zhu, J. (2003) Flexible statistical modeling. Ph.D. Thesis. Stanford University.\n\n[14] Zhu, J. & Hastie, T. (2003) Classi\u00a3cation of gene microarrays by penalized logistic regression.\nBiostatistics. Accepted.\n\n\f", "award": [], "sourceid": 2450, "authors": [{"given_name": "Ji", "family_name": "Zhu", "institution": null}, {"given_name": "Saharon", "family_name": "Rosset", "institution": null}, {"given_name": "Robert", "family_name": "Tibshirani", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}