{"title": "Efficient and Robust Feature Selection via Joint \u21132,1-Norms Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1813, "page_last": 1821, "abstract": "Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efficient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint \u21132,1-norm minimization on both loss function and regularization. The \u21132,1-norm based loss function is robust to outliers in data points and the \u21132,1-norm regularization selects features across all data points with joint sparsity. An efficient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efficient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies were performed on six data sets to demonstrate the effectiveness of our feature selection method.", "full_text": "Ef\ufb01cient and Robust Feature Selection via Joint\n\n(cid:96)2,1-Norms Minimization\n\nFeiping Nie\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\nfeipingnie@gmail.com\n\nXiao Cai\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\nxiao.cai@mavs.uta.edu\n\nHeng Huang\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\n\nheng@uta.edu\n\nChris Ding\n\nComputer Science and Engineering\n\nUniversity of Texas at Arlington\n\nchqding@uta.edu\n\nAbstract\n\nFeature selection is an important component of many machine learning applica-\ntions. Especially in many bioinformatics tasks, ef\ufb01cient and robust feature se-\nlection methods are desired to extract meaningful features and eliminate noisy\nones. In this paper, we propose a new robust feature selection method with em-\nphasizing joint (cid:96)2,1-norm minimization on both loss function and regularization.\nThe (cid:96)2,1-norm based loss function is robust to outliers in data points and the (cid:96)2,1-\nnorm regularization selects features across all data points with joint sparsity. An\nef\ufb01cient algorithm is introduced with proved convergence. Our regression based\nobjective makes the feature selection process more ef\ufb01cient. Our method has been\napplied into both genomic and proteomic biomarkers discovery. Extensive empir-\nical studies are performed on six data sets to demonstrate the performance of our\nfeature selection method.\n\n1 Introduction\n\nFeature selection, the process of selecting a subset of relevant features, is a key component in build-\ning robust machine learning models for classi\ufb01cation, clustering, and other tasks. Feature section\nhas been playing an important role in many applications since it can speed up the learning process,\nimprove the mode generalization capability, and alleviate the effect of the curse of dimensional-\nity [15]. A large number of developments on feature selection have been made in the literature and\nthere are many recent reviews and workshops devoted to this topic, e.g., NIPS Conference [7].\nIn past ten years, feature selection has seen much activities primarily due to the advances in bioin-\nformatics where a large amount of genomic and proteomic data are produced for biological and\nbiomedical studies. For example, in genomics, DNA microarray data measure the expression levels\nof thousands of genes in a single experiment. Gene expression data usually contain a large number\nof genes, but a small number of samples. A given disease or a biological function is usually asso-\nciated with a few genes [19]. Out of several thousands of genes to select a few of relevant genes\nthus becomes a key problem in bioinformatics research [22]. In proteomics, high-throughput mass\nspectrometry (MS) screening measures the molecular weights of individual biomolecules (such as\nproteins and nucleic acids) and has potential to discover putative proteomic biomarkers. Each spec-\ntrum is composed of peak amplitude measurements at approximately 15,500 features represented\nby a corresponding mass-to-charge value. The identi\ufb01cation of meaningful proteomic features from\nMS is crucial for disease diagnosis and protein-based biomarker pro\ufb01ling [22].\n\n1\n\n\fIn general, there are three models of feature selection methods in the literature: (1) \ufb01lter meth-\nods [14] where the selection is independent of classi\ufb01ers, (2) wrapper methods [12] where the pre-\ndiction method is used as a black box to score subsets of features, and (3) embedded methods where\nthe procedure of feature selection is embedded directly in the training process. In bioinformatics\napplications, many feature selection methods from these categories have been proposed and applied.\nWidely used \ufb01lter-type feature selection methods include F -statistic [4], reliefF [11, 13], mRMR\n[19], t-test, and information gain [21] which compute the sensitivity (correlation or relevance) of a\nfeature with respect to (w.r.t) the class label distribution of the data. These methods can be char-\nacterized by using global statistical information. Wrapper-type feature selection methods is tightly\ncoupled with a speci\ufb01c classi\ufb01er, such as correlation-based feature selection (CFS) [9], support vec-\ntor machine recursive feature elimination (SVM-RFE) [8]. They often have good performance, but\ntheir computational cost is very expensive.\nRecently sparsity regularization in dimensionality reduction has been widely investigated and also\napplied into feature selection studies. (cid:96)1-SVM was proposed to perform feature selection using the\n(cid:96)1-norm regularization that tends to give sparse solution [3]. Because the number of selected features\nusing (cid:96)1-SVM is upper bounded by the sample size, a Hybrid Huberized SVM (HHSVM) was\nproposed combining both (cid:96)1-norm and (cid:96)2-norm to form a more structured regularization [26]. But\nit was designed only for binary classi\ufb01cation. In multi-task learning, in parallel works, Obozinsky\net. al. [18] and Argyriou et. al. [1] have developed a similar model for (cid:96)2,1-norm regularization to\ncouple feature selection across tasks. Such regularization has close connections to group lasso [28].\nIn this paper, we propose a novel ef\ufb01cient and robust feature selection method to employ joint (cid:96)2,1-\nnorm minimization on both loss function and regularization. Instead of using (cid:96)2-norm based loss\nfunction that is sensitive to outliers, a (cid:96)2,1-norm based loss function is adopted in our work to remove\noutliers. Motivated by previous research [1, 18], a (cid:96)2,1-norm regularization is performed to select\nfeatures across all data points with joint sparsity, i.e. each feature (gene expression or mass-to-charge\nvalue in MS) either has small scores for all data points or has large scores over all data points. To\nsolve this new robust feature selection objective, we propose an ef\ufb01cient algorithm to solve such joint\n(cid:96)2,1-norm minimization problem. We also provide the algorithm analysis and prove the convergence\nof our algorithm. Extensive experiments have been performed on six bioinformatics data sets and\nour method outperforms \ufb01ve other commonly used feature selection methods in statistical learning\nand bioinformatics.\n\n2 Notations and De\ufb01nitions\n\nWe summarize the notations and the de\ufb01nition of norms used in this paper. Matrices are written as\nboldface uppercase letters. Vectors are written as boldface lowercase letters. For matrix M = (mij),\nits i-th row, j-th column are denoted by mi, mj respectively.\nThe (cid:96)p-norm of the vector v \u2208 Rn is de\ufb01ned as (cid:107)v(cid:107)p =\nv \u2208 Rn is de\ufb01ned as (cid:107)v(cid:107)0 =\n\n. The (cid:96)0-norm of the vector\n|vi|0. The Frobenius norm of the matrix M \u2208 Rn\u00d7m is de\ufb01ned as\n\n(cid:182) 1\n\nn(cid:80)\n\n|vi|p\n\np\n\ni=1\n\n(cid:181)\n\nn(cid:80)\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\ni=1\n\n(cid:107)M(cid:107)F =\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\nm(cid:88)\n\nm2\n\nij =\n\ni=1\n\nj=1\n\ni=1\n\n(cid:107)mi(cid:107)2\n2.\n\n(1)\n\nThe (cid:96)2,1-norm of a matrix was \ufb01rst introduced in [5] as rotational invariant (cid:96)1 norm and also used\nfor multi-task learning [1, 18] and tensor factorization [10]. It is de\ufb01ned as\n\n(cid:107)M(cid:107)2,1 =\n\nn(cid:88)\n\n(cid:176)(cid:176)mi\n\n(cid:176)(cid:176)\n\n2,\n\nn(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116) m(cid:88)\n\nm2\n\nij =\n\ni=1\n\nj=1\n\ni=1\n\n2\n\n(2)\n\n\f.\n\n(3)\n\n(cid:195)(cid:88)\n\n(cid:33) 1\n\np\n\n(cid:107)ai(cid:107)p\n\nr\n\n+\n\nwhich is rotational invariant for rows: (cid:107)MR(cid:107)2,1 = (cid:107)M(cid:107)2,1 for any rotational matrix R. The\n(cid:96)2,1-norm can be generalized to (cid:96)r,p-norm\n\n\uf8eb\uf8ec\uf8ed n(cid:88)\n\ni=1\n\n(cid:33) 1\n\np\n\n(cid:176)(cid:176)p\n\nr\n\n\uf8f6\uf8f8 p\n\nr\n\nj=1\n\n|mij|r\n\n\uf8eb\uf8ed m(cid:88)\n(cid:80)\ni |vi|p) 1\n(cid:195)(cid:88)\n\np\n\n=\n\n(cid:176)(cid:176)mi\n\n(cid:195)\nn(cid:88)\n\n\uf8f6\uf8f7\uf8f8 1\n(cid:80)\ni |ui + vi|p) 1\n(cid:33) 1\n\ni=1\n\n(cid:195)(cid:88)\n\n(cid:107)M(cid:107)r,p =\n(cid:80)\ni |ui|p) 1\n(cid:195)(cid:88)\n(cid:33) 1\n\n(cid:107)bi(cid:107)p\n\nr\n\nNote that (cid:96)r,p-norm is a valid norm because it satis\ufb01es the three norm conditions, including the\ntriangle inequality (cid:107)A(cid:107)r,p + (cid:107)B(cid:107)r,p \u2265 (cid:107)A + B(cid:107)r,p. This can be proved as follows. Starting from\np and setting ui = (cid:107)ai(cid:107)r and\nthe triangle inequality (\nvi = (cid:107)bi(cid:107)r, we obtain\n\np \u2265 (\n\np + (\n\n(cid:33) 1\n\np\n\np \u2265\n\n| (cid:107)ai(cid:107)r + (cid:107)bi(cid:107)r|p\n\np \u2265\n\n| (cid:107)ai + bi(cid:107)r|p\n\n, (4)\n\ni\n\ni\n\ni\n\ni\n\nwhere the second inequality follows the triangle inequality for (cid:96)r norm: (cid:107)ai(cid:107)r+(cid:107)bi(cid:107)r \u2265 (cid:107)ai+bi(cid:107)r.\nEq. (4) is just (cid:107)A(cid:107)r,p + (cid:107)B(cid:107)r,p \u2265 (cid:107)A + B(cid:107)r,p.\nHowever, the (cid:96)0-norm is not a valid norm because it does not satisfy the positive scalability:\n(cid:107)\u03b1v(cid:107)0 = |\u03b1|(cid:107)v(cid:107)0 for scalar \u03b1. The term \u201cnorm\u201d here is for convenience.\n\n3 Robust Feature Selection Based on (cid:96)2,1-Norms\n\nLeast square regression is one of the popular methods for classi\ufb01cation. Given training data\n{x1, x2,\u00b7\u00b7\u00b7 , xn} \u2208 Rd and the associated class labels {y1, y2,\u00b7\u00b7\u00b7 , yn} \u2208 Rc,\ntraditional\nleast square regression solves the following optimization problem to obtain the projection matrix\nW \u2208 Rd\u00d7c and the bias b \u2208 Rc:\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nmin\nW,b\n\nmin\nW\n\n(cid:176)(cid:176)2\n\n2.\n\nn(cid:88)\n\ni=1\n\n(cid:176)(cid:176)WT xi + b \u2212 yi\nn(cid:88)\n(cid:176)(cid:176)2\n(cid:176)(cid:176)WT xi \u2212 yi\nn(cid:88)\n(cid:176)(cid:176)\n(cid:176)(cid:176)WT xi \u2212 yi\n\n2.\n\n2,\n\nmin\nW\n\ni=1\n\nFor simplicity, the bias b can be absorbed into W when the constant value 1 is added as an additional\ndimension for each data xi(1 \u2264 i \u2264 n). Thus the problem becomes:\n\ni=1\nIn this paper, we use the robust loss function:\n\nwhere the residual (cid:107)WT xi \u2212 yi(cid:107) is not squared and thus outliers have less importance than the\nsquared residual (cid:107)WT xi \u2212 yi(cid:107)2. This loss function has a rotational invariant property while the\npure (cid:96)1-norm loss function does not has such desirable property [5].\nWe now add a regularization term R(W) with parameter \u03b3. The problem becomes:\n\n(cid:176)(cid:176)WT xi \u2212 yi\n\n(cid:176)(cid:176)\n\n2 + \u03b3R(W).\n\nSeveral regularizations are possible:\n\nR1(W) = (cid:107)W(cid:107)2, R2(W) =\n\n(cid:107)wj(cid:107)1, R3(W) =\n\n(cid:176)(cid:176)wi\n\nd(cid:88)\n\ni=1\n\n(cid:176)(cid:176)0\n\n2, R4(W) =\n\n(cid:176)(cid:176)wi\n\n(cid:176)(cid:176)\n\n2.\n\nd(cid:88)\n\ni=1\n\nR1(W) is the ridge regularization. R2(W) is the LASSO regularization. R3(W) and R4(W)\npenalizes all c regression coef\ufb01cients corresponding to a single feature as a whole. This has the\n\n3\n\nmin\nW\n\nn(cid:88)\nc(cid:88)\n\ni=1\n\nj=1\n\n\feffects of feature selection. Although the (cid:96)0-norm of R3(W) is the most desirable [16], in this\npaper, we use R4(W) instead. The reasons are: (A) the (cid:96)1-norm of R4(W) is convex and can be\neasily optimized (the main contribution of this paper); (B) it was shown that results of (cid:96)0-norm is\nidentical or approximately identical to the (cid:96)1-norm results under practical conditions.\nDenote data matrix X = [x1, x2,\u00b7\u00b7\u00b7 , xn] \u2208 Rd\u00d7n and label matrix Y = [y1, y2,\u00b7\u00b7\u00b7 , yn]T \u2208\nRn\u00d7c. In this paper, we optimize\n\nmin\nW\n\nJ(W) =\n\n2 + \u03b3R4(W) =\n\n2,1 + \u03b3 (cid:107)W(cid:107)2,1 .\n\n(10)\n\n(cid:176)(cid:176)XT W \u2212 Y\n\n(cid:176)(cid:176)\n\n(cid:176)(cid:176)WT xi \u2212 yi\n\n(cid:176)(cid:176)\n\nn(cid:88)\n\ni=1\n\nIt seems that solving this joint (cid:96)2,1-norm problem is dif\ufb01cult as both of the terms are non-smooth.\nSurprisingly, we will show in the next section that the problem can be solved using a simple yet\nef\ufb01cient algorithm.\n\n4 An Ef\ufb01cient Algorithm\n\n4.1 Reformulation as A Constrained Problem\n\nFirst, the problem in Eq. (10) is equivalent to\n\n(cid:176)(cid:176)XT W \u2212 Y\n\n(cid:176)(cid:176)\n\n2,1 + (cid:107)W(cid:107)2,1 ,\n\nmin\nW\n\n1\n\u03b3\n\nwhich is further equivalent to\n\n(cid:107)E(cid:107)2,1 + (cid:107)W(cid:107)2,1\n\ns.t.\n\nXT W + \u03b3E = Y.\n\n(11)\n\n(12)\n\nmin\nW,E\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:183)\n\n(cid:184)(cid:176)(cid:176)(cid:176)(cid:176)\n\nRewriting the above problem as\nW\nE\n\nmin\nW,E\n\nW\nE\nwhere I \u2208 Rn\u00d7n is an identity matrix. Denote m = n + d. Let A =\n\u2208 Rm\u00d7c, then the problem in Eq. (13) can be written as:\nU =\n\n(cid:183)\n\n(cid:184)\n\nXT\n\n\u03b3I\n\ns.t.\n\n2,1\n\nW\nE\n\n(cid:163)\n\n(cid:164)(cid:183)\n\n(cid:184)\n(cid:163)\n\n= Y,\n\nXT\n\n\u03b3I\n\n(13)\n\n(cid:164) \u2208 Rn\u00d7m and\n\n(cid:107)U(cid:107)2,1\n\nmin\nU\n\ns.t.\n\nAU = Y\n\n(14)\n\nThis optimization problem Eq. (14) has been widely used in the Multiple Measurement Vector\n(MMV) model in signal processing community. It was generally felt that the (cid:96)2,1-norm minimization\nproblem is much more dif\ufb01cult to solve than the (cid:96)1-norm minimization problem. Existing algorithms\nusually reformulate it as a second-order cone programming (SOCP) or semide\ufb01nite programming\n(SDP) problem, which can be solved by interior point method or the bundle method. However, solv-\ning SOCP or SDP is computationally very expensive, which limits their use in practice. Recently,\nan ef\ufb01cient algorithm was proposed to solve the speci\ufb01c problem Eq. (14) by complicatedly refor-\nmulating the problem as a min-max problem and then applying the proximal method to solve it [25].\nThe reported results show that the algorithm is more ef\ufb01cient than existing algorithms. However, the\nalgorithm is a gradient descent type method and converges very slow. Moreover, the algorithm is de-\nrived to solve the speci\ufb01c problem, and can not be applied directly to solve other general (cid:96)2,1-norm\nminimization problem.\nIn the next subsection, we will propose a very simple but at the same time much more ef\ufb01cient\nmethod to solve this problem. Theoretical analysis guarantees that the proposed method will con-\nverge to the global optimum. More importantly, this method is very easy to implement and can be\nreadily used to solve other general (cid:96)2,1-norm minimization problem.\n\n4.2 An Ef\ufb01cient Algorithm to Solve the Constrained Problem\n\nThe Lagrangian function of the problem in Eq. (14) is\n\nL(U) = (cid:107)U(cid:107)2,1 \u2212 T r(\u039bT (AU \u2212 Y)).\n\n(15)\n\n4\n\n\fTaking the derivative of L(U) w.r.t U, and setting the derivative to zero, we have:\n\n= 2DU \u2212 AT \u039b = 0,\nwhere D is a diagonal matrix with the i-th diagonal element as1\n\n\u2202L(U)\n\u2202U\n\n(16)\n\n(17)\nLeft multiplying the two sides of Eq. (16) by AD\u22121, and using the constraint AU = Y, we have:\n\n2(cid:107)ui(cid:107)2\n\ndii =\n\n.\n\n1\n\n2AU \u2212 AD\u22121AT \u039b = 0\n\u21d2 2Y \u2212 AD\u22121AT \u039b = 0\n\u21d2 \u039b = 2(AD\u22121AT )\u22121Y\n\n(18)\n\n(19)\n\nSubstitute Eq. (18) into Eq. (16), we arrive at:\n\nU = D\u22121AT (AD\u22121AT )\u22121Y.\n\nSince the problem in Eq. (14) is a convex problem, U is a global optimum solution to the problem\nif and only if the Eq. (19) is satis\ufb01ed. Note that D is dependent to U and thus is also a unknown\nvariable. We propose an iterative algorithm in this paper to obtain the solution U such that Eq. (19)\nis satis\ufb01ed, and prove in the next subsection that the proposed iterative algorithm will converge to\nthe global optimum.\nThe algorithm is described in Algorithm 1. In each iteration, U is calculated with the current D,\nand then D is updated based on the current calculated U. The iteration procedure is repeated until\nthe algorithm converges.\n\nData: A \u2208 Rn\u00d7m, Y \u2208 Rn\u00d7c\nResult: U \u2208 Rm\u00d7c\nSet t = 0. Initialize Dt \u2208 Rm\u00d7m as an identity matrix\nrepeat\n\nCalculate Ut+1 = D\u22121\nCalculate the diagonal matrix Dt+1, where the i-th diagonal element is\nt = t + 1.\n\nt AT (AD\u22121\n\nt AT )\u22121Y.\n\n2(cid:107)ui\n\n1\n\nt+1(cid:107)2\n\n.\n\nuntil Converges\n\nAlgorithm 1: An ef\ufb01cient iterative algorithm to solve the optimization problem in Eq. (14).\n\n4.3 Algorithm Analysis\n\nThe Algorithm 1 monotonically decreases the objective of the problem in Eq. (14) in each iteration.\nTo prove it, we need the following lemma:\nLemma 1. For any nonzero vectors u, ut \u2208 Rc, the following inequality holds:\n\n2\n\n\u2264 (cid:107)ut(cid:107)2 \u2212 (cid:107)ut(cid:107)2\n(cid:107)u(cid:107)2 \u2212 (cid:107)u(cid:107)2\n2(cid:107)ut(cid:107)2\n2(cid:107)ut(cid:107)2\n\u221a\nv \u2212 \u221a\nProof. Beginning with an obvious inequality (\nvvt + vt \u2265 0 \u21d2 \u221a\nv \u2212 v\n\u221a\n2\n2 and (cid:107)ut(cid:107)2\n\nSubstitute the v and vt in Eq. (21) by (cid:107)u(cid:107)2\n\n\u221a\nvt)2 \u2265 0 \u21d2 v \u2212 2\n\nv \u2212 \u221a\n\nvt)2 \u2265 0, we have\n\u2264\n\n\u21d2 \u221a\n\n\u221a\n(\n\nvt\n2\n\n\u221a\n\n2\n\n.\n\nvt \u2212 vt\n\u221a\nvt\nvt\n2\n2 respectively, we arrive at the Eq. (20).\n\nv \u2212 v\n\u221a\n2\n\nvt\n\n\u2264 \u221a\n\n(20)\n\n(21)\n\n(cid:176)(cid:176)\n\n(cid:176)(cid:176)ui\n\n1When ui = 0, then dii = 0 is a subgradient of (cid:107)U(cid:107)2,1 w.r.t. ui. However, we can not set dii = 0 when\nui = 0, otherwise the derived algorithm can not be guaranteed to converge. Two methods can be used to solve\nthis problem. First, we will see from Eq.(19) that we only need to calculate D\u22121, so we can let the i-th element\nof D\u22121 as 2\n, and the derived algorithm can be\n\n. Second, we can regularize dii as dii =\n\n2\nproved to minimize the regularized (cid:96)2,1-norms of U (de\ufb01ned as\nof U. It is easy to see that the regularized (cid:96)2,1-norms of U approximates the (cid:96)2,1-norms of U when \u03c2 \u2192 0.\n\n(ui)T ui + \u03c2) instead of the (cid:96)2,1-norms\n\nn(cid:80)\n\n(cid:112)\n\n(ui)T ui+\u03c2\n\n\u221a\n\ni=1\n\n2\n\n1\n\n5\n\n\fThe convergence of the Algorithm 1 is summarized in the following theorem:\nTheorem 1. The Algorithm 1 will monotonically decrease the objective of the problem in Eq. (14)\nin each iteration, and converge to the global optimum of the problem.\n\nProof. It can easily veri\ufb01ed that Eq. (19) is the solution to the following problem:\n\nThus in the t iteration,\n\nmin\nU\n\nT r(UT DU)\n\ns.t.\n\nAU = Y\n\nUt+1 = arg\nU\n\nmin\nAU=Y\n\nT rUT DtU,\n\nt and ui\n\nt+1 denote the i-th row of matrices Ut and Ut+1, respectively.\n\nt DtUt).\n\n,\n\nt\n\n2\n\n2\n\n2\n\n2\n\nt\n\n2\n\ni=1\n\ni=1\n\nt\n\n2\n\nt+1\n\nwhich indicates that\n\nT r(UT\n\nThat is to say,\n\nt+1DtUt+1) \u2264 T r(UT\nm(cid:88)\n\nwhere vectors ui\nOn the other hand, according to Lemma 1, for each i we have\n2 \u2212\n\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n\u2264 m(cid:88)\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n(cid:176)(cid:176)\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n\u2264(cid:176)(cid:176)ui\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n(cid:195)(cid:176)(cid:176)ui\n(cid:33)\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n\u2264 m(cid:88)\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\nThus the following inequality holds:\n2 \u2212\nm(cid:88)\n2 \u2264 m(cid:88)\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n(cid:176)(cid:176)\n\n2\nCombining Eq. (25) and Eq. (27), we arrive at\n\n(cid:176)(cid:176)ui\n(cid:195)(cid:176)(cid:176)ui\n\nm(cid:88)\n\n2 \u2212\n\n(cid:176)(cid:176)\n\n(cid:176)(cid:176)\n\nt+1\n\nt+1\n\nt+1\n\nt+1\n\nt+1\n\nt\n\n2\n\ni=1\n\ni=1\n\n2\n\nt\n\n2\n\n2\n\nt\n\n2\n\n2.\n\nt\n\ni=1\n\ni=1\n\nt\n\n2\n\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n(cid:176)(cid:176)2\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n(cid:176)(cid:176)ui\n(cid:176)(cid:176)\n\n2 \u2212\n\n2\n\n2\n\n2\n\n.\n\nt\n\nt\n\n2\n\nt\n\nt\n\n2\n\n(cid:33)\n\n.\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n(cid:176)(cid:176)ui\n\n(cid:176)(cid:176)\n\nThat is to say,\n\n(cid:107)Ut+1(cid:107)2,1 \u2264 (cid:107)Ut(cid:107)2,1 .\n\n(29)\nThus the Algorithm 1 will monotonically decrease the objective of the problem in Eq. (14) in each\niteration t. In the convergence, Ut and Dt will satisfy the Eq. (19). As the problem in Eq. (14)\nis a convex problem, satisfying the Eq. (19) indicates that U is a global optimum solution to the\nproblem in Eq. (14). Therefore, the Algorithm 1 will converge to the global optimum of the problem\n(14).\n\nNote that in each iteration, the Eq. (19) can be solved ef\ufb01ciently. First, D is a diagonal matrix and\nthus D\u22121 is also diagonal with the i-th diagonal element as d\u22121\n2. Second, the term\nZ = (AD\u22121AT )\u22121Y in Eq. (19) can be ef\ufb01ciently obtained by solving the linear equation:\n\nii = 2\n\n(AD\u22121AT )Z = Y.\n\n(30)\nEmpirical results show that the convergence is fast and only a few iterations are needed to converge.\nTherefore, the proposed method can be applied to large scale problem in practice.\nIt is worth to point out that the proposed method can be easily extended to solve other (cid:96)2,1-norm\nminimization problem. For example, considering a general (cid:96)2,1-norm minimization problem as\nfollows:\n\n(cid:88)\n\n(cid:107)AkU + Bk(cid:107)2,1\n\ns.t.\n\nU \u2208 C\n\n(31)\n\nThe problem can be solved by solve the following problem iteratively:\n\nmin\nU\n\nf(U) +\n\nT r((AkU + Bk)T Dk(AkU + Bk))\n\ns.t.\n\nU \u2208 C\n\n(32)\n\nwhere Dk is a diagonal matrix with the i-th diagonal element as\n. Similar theoretical\nanalysis can be used to prove that the iterative method will converge to a local minimum. If the\nproblem Eq. (31) is a convex problem, i.e., f(U) is a convex function and C is a convex set, then the\niterative method will converge to the global minimum.\n\n2(cid:107)(AkU+Bk)i(cid:107)2\n\n1\n\n6\n\nmin\nU\n\nf(U) +\n\nk\n\n(cid:88)\n\nk\n\n\f(a) ALLAML\n\n(b) GLIOMA\n\n(c) LUNG\n\n(d) Carcinomas\n\n(e) PROSTATE-GE\n\n(f) PROSTATE-MS\n\nFigure 1: Classi\ufb01cation accuracy comparisons of six feature selection algorithms on 6 data sets.\nSVM with 5-fold cross validation is used for classi\ufb01cation. RFS is our method.\n\n5 Experimental Results\n\nIn order to validate the performance of our feature selection method, we applied our method into two\nbioinformatics applications, gene expression and mass spectrometry classi\ufb01cations. In our experi-\nments, we used \ufb01ve publicly available microarray data sets and one Mass Spectrometry (MS) data\nsets: ALLAML data set [6], the malignant glioma (GLIOMA) data set [17], the human lung carcino-\nmas (LUNG) data set [2], Human Carcinomas (Carcinomas) data set [24, 27], Prostate Cancer gene\nexpression (Prostate-GE) data set [23] for microarray data; and Prostate Cancer (Prostate-MS) [20]\nfor MS data. The Support Vector Machine (SVM) classi\ufb01er is employed to these data sets, using\n5-fold cross-validation.\n\n5.1 Data Sets Descriptions\n\nWe give a brief description on all data sets used in our experiments as follows.\nALLAML data set contains in total 72 samples in two classes, ALL and AML, which contain 47\nand 25 samples, respectively. Every sample contains 7,129 gene expression values.\nGLIOMA data set contains in total 50 samples in four classes, cancer glioblastomas (CG), non-\ncancer glioblastomas (NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas\n(NO), which have 14, 14, 7,15 samples, respectively. Each sample has 12625 genes. Genes with\nminimal variations across the samples were removed. For this data set, intensity thresholds were\nset at 20 and 16,000 units. Genes whose expression levels varied < 100 units between samples, or\nvaried < 3 fold between any two samples, were excluded. After preprocessing, we obtained a data\nset with 50 samples and 4433 genes.\nLUNG data set contains in total 203 samples in \ufb01ve classes, which have 139, 21, 20, 6,17 samples,\nrespectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50\nexpression units were removed and we obtained a data set with 203 samples and 3312 genes.\nCarcinomas data set composed of total 174 samples in eleven classes, prostate, bladder/ureter,\nbreast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas, and lung\nsquamous cell carcinoma, which have 26, 8, 26, 23, 12, 11, 7, 27, 6, 14, 14 samples, respectively.\nIn the original data [24], each sample contains 12533 genes. In the preprocessed data set [27], there\nare 174 samples and 9182 genes.\n\n7\n\n01020304050607080707580859095the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS010203040506070803035404550556065707580the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS010203040506070807580859095100the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS01020304050607080102030405060708090100the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS0102030405060708080828486889092949698the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS01020304050607080707580859095100the number of features selectedthe classification accuracy ReliefFFscoreRankT\u2212testInformation gainmRMRRFS\fTable 1: Classi\ufb01cation Accuracy of SVM using 5-fold cross validation. Six feature selection meth-\nods are compared. RF: ReliefF, F-s: F-score, IG: Information Gain, and RFS: our method.\n\nAverage accuracy of top 20 features (%) Average accuracy of top 80 features (%)\nRF\n\nF-s T-test\n\nRF\nIG mRMR\n95.89 96.07 94.29 95.71 94.46\n\n50\n\nF-s T-test\n\nIG mRMR\nALLAML 90.36 89.11 92.86 93.21 93.21\nGLIOMA 50\nLUNG\n91.68 87.7 89.22 93.1 92.61\nCarcinom. 79.88 65.48 49.9 85.09 78.22\n92.18 95.09 92.18 92.18 93.18\nPro-GE\n76.41 98.89 95.56 98.89 95.42\nPro-MS\nAverage\n80.09 81.04 79.29 87.09 85.78\n\n62\n\n56\n\n60\n\nRFS\n95.89\n74\n93.63\n91.38\n95.09\n98.89\n91.48\n\n66\n\n66\n\n54\n\n60\n\n58\n\n93.63 91.63 90.66 95.1 94.12\n90.24 83.33 68.91 89.65 87.92\n91.18 93.18 93.18 89.27 86.36\n89.93 98.89 94.44 98.89 93.14\n85.81 87.18 83.25 89.10\n\n87\n\nRFS\n97.32\n70\n96.07\n93.66\n95.09\n100\n92.02\n\nProstate-GE data set has in total 102 samples in two classes tumor and normal, which have 52 and\n50 samples, respectively. The original data set contains 12600 genes. In our experiment, intensity\nthresholds were set at 100 C16000 units. Then we \ufb01ltered out the genes with max/min \u2264 5 or\n(max-min) \u2264 50. After preprocessing, we obtained a data set with 102 samples and 5966 genes.\nProstate-MS data can be obtained from the FDA-NCI Clinical Proteomics Program Databank [20].\nThis MS data set consists of 190 samples diagnosed as benign prostate hyperplasia, 63 samples\nconsidered as no evidence of disease, and 69 samples diagnosed as prostate cancer. The samples\ndiagnosed as benign prostate hyperplasia as well as samples having no evidence of prostate cancer\nwere pooled into one set making 253 control samples, whereas the other 69 samples are the cancer\nsamples.\n\n5.2 Classi\ufb01cation Accuracy Comparisons\n\nAll data sets are standardized to be zero-mean and normalized by standard deviation. SVM classi\ufb01er\nhas been individually performed on all data sets using 5-fold cross-validation. We utilize the linear\nkernel with the parameter C = 1. We compare our feature selection method (called as RFS) to\nseveral popularly used feature selection methods in bioinformatics, such as F -statistic [4], reliefF\n[11, 13], mRMR [19], t-test, and information gain [21]. Because the above data sets are for multi-\nclass classi\ufb01cation problem, we don\u2019t compare to (cid:96)1-SVM, HHSVM and other methods that were\ndesigned for binary classi\ufb01cation.\nFig. 1 shows the classi\ufb01cation accuracy comparisons of all \ufb01ve feature selection methods on six data\nsets. Table 1 shows the detailed experimental results using SVM. We compute the average accuracy\nusing the top 20 and top 80 features for all feature selection approaches. Obviously our approaches\noutperform other methods signi\ufb01cantly. With top 20 features, our method is around 5%-12% better\nthan other methods all six data sets.\n\n6 Conclusions\n\nIn this paper, we proposed a new ef\ufb01cient and robust feature selection method with emphasizing joint\n(cid:96)2,1-norm minimization on both loss function and regularization. The (cid:96)2,1-norm based regression\nloss function is robust to outliers in data points and also ef\ufb01cient in calculation. Motivated by\nprevious work, the (cid:96)2,1-norm regularization is used to select features across all data points with\njoint sparsity. We provided an ef\ufb01cient algorithm with proved convergence. Our method has been\napplied into both genomic and proteomic biomarkers discovery. Extensive empirical studies have\nbeen performed on two bioinformatics tasks, six data sets, to demonstrate the performance of our\nmethod.\n\n7 Acknowledgements\n\nThis research was funded by US NSF-CCF-0830780, 0939187, 0917274, NSF DMS-0915228, NSF\nCNS-0923494, 1035913.\n\n8\n\n\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. NIPS, pages 41\u201348, 2007.\n[2] A. Bhattacharjee, W. G. Richards, and et. al. Classi\ufb01cation of human lung carcinomas by mRNA ex-\npression pro\ufb01ling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of\nSciences, 98(24):13790\u201313795, 2001.\n\n[3] P. Bradley and O. Mangasarian. Feature selection via concave minimization and support vector machines.\n\nICML, 1998.\n\n[4] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data.\n\nProceedings of the Computational Systems Bioinformatics, 2003.\n\n[5] C. Ding, D. Zhou, X. He, and H. Zha. R1-PCA: Rotational invariant L1-norm principal component\n\nanalysis for robust subspace factorization. Proc. Int\u2019l Conf. Machine Learning (ICML), June 2006.\n\n[6] S. P. Fodor. DNA SEQUENCING: Massively Parallel Genomics. Science, 277(5324):393\u2013395, 1997.\n[7] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Machine Learning Re-\n\nsearch, 2003.\n\n[8] I. Guyon, J.Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi\ufb01cation using support\n\nvector machines. Machine Learning, 46(1):389, 2002.\n\n[9] M. A. Hall and L. A. Smith. Feature selection for machine learning: Comparing a correlation-based \ufb01lter\n\napproach to the wrapper. 1999.\n\n[10] H. Huang and C. Ding. Robust tensor factorization using r1 norm. CVPR 2008, pages 1\u20138, 2008.\n[11] K. Kira and L. A. Rendell. A practical approach to feature selection. In A Practical Approach to Feature\n\nSelection, pages 249\u2013256, 1992.\n\n[12] R. Kohavi and G. H. John. Wrappers for feature subset selection. Arti\ufb01cial Intelligence, 97(1-2):273\u2013324,\n\n1997.\n\n[13] I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on\n\nMachine Learning, pages 171\u2013182, 1994.\n\n[14] P. Langley. Selection of relevant features in machine learning. In AAAI Fall Symposium on Relevance,\n\npages 140\u2013144, 1994.\n\n[15] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Springer, 1998.\n[16] D. Luo, C. Ding, and H. Huang. Towards structural sparsity: An explicit (cid:96)2/(cid:96)0 approach. ICDM, 2010.\n[17] C. L. Nutt, D. R. Mani, R. A. Betensky, P. Tamayo, J. G. Cairncross, C. Ladd, U. Pohl, C. Hartmann,\nand M. E. Mclaughlin. Gene expression-based classi\ufb01cation of malignant gliomas correlates better with\nsurvival than histological classi\ufb01cation. Cancer Res., 63:1602\u20131607, 2003.\n\n[18] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. Technical report, Department of\n\nStatistics, University of California, Berkeley, 2006.\n\n[19] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-depe\nndency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence,\n27, 2005.\n\n[20] P. C. Petricoin EF, Ornstein DK. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer\n\nInst., 94(20):1576\u20138, 2002.\n\n[21] L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index and information gain\n\ncriteria. Univeristy of Neuchatel, 2000.\n\n[22] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioinformatics. Bioinfor-\n\nmatics, 23(19):2507\u20132517, 2007.\n\n[23] D. Singh, P. Febbo, K. Ross, and et al. Gene expression correlates of clinical prostate cancer behavior.\n\nCancer Cell, pages 203\u2013209, 2002.\n\n[24] A. I. Su, J. B. Welsh, L. M. Sapinoso, and et al. Molecular classi\ufb01cation of human carcinomas by use of\n\ngene expression signatures. Cancer Research, 61:7388\u20137393, 2001.\n\n[25] L. Sun, J. Liu, J. Chen, and J. Ye. Ef\ufb01cient recovery of jointly sparse vectors. In Neural Information\n\nProcessing Systems, 2009.\n\n[26] L. Wang, J. Zhu, and H. Zou. Hybrid huberized support vector machines for microarray classi\ufb01cation.\n\nICML, 2007.\n\n[27] K. Yang, Z. Cai, J. Li, and G. Lin. A stable gene selection in microarray data analysis. BMC Bioinfor-\n\nmatics, 7:228, 2006.\n\n[28] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B, 68:49\u201367, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1149, "authors": [{"given_name": "Feiping", "family_name": "Nie", "institution": null}, {"given_name": "Heng", "family_name": "Huang", "institution": null}, {"given_name": "Xiao", "family_name": "Cai", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}]}