{"title": "A Safe Screening Rule for Sparse Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1053, "page_last": 1061, "abstract": "The l1-regularized logistic regression (or sparse logistic regression) is a widely used method for simultaneous classification and feature selection. Although many recent efforts have been devoted to its efficient implementation, its application to high dimensional data still poses significant challenges. In this paper, we present a fast and effective sparse logistic regression screening rule (Slores) to identify the zero components in the solution vector, which may lead to a substantial reduction in the number of features to be entered to the optimization. An appealing feature of Slores is that the data set needs to be scanned only once to run the screening and its computational cost is negligible compared to that of solving the sparse logistic regression problem. Moreover, Slores is independent of solvers for sparse logistic regression, thus Slores can be integrated with any existing solver to improve the efficiency. We have evaluated Slores using high-dimensional data sets from different applications. Extensive experimental results demonstrate that Slores outperforms the existing state-of-the-art screening rules and the efficiency of solving sparse logistic regression is improved by one magnitude in general.", "full_text": "A Safe Screening Rule for Sparse Logistic Regression\n\nJie Wang\n\nArizona State University\n\nTempe, AZ 85287\n\njie.wang.ustc@asu.edu\n\nJiayu Zhou\n\nArizona State University\n\nTempe, AZ 85287\n\njiayu.zhou@asu.edu\n\nJun Liu\n\nSAS Institute Inc.\nCary, NC 27513\n\nPeter Wonka\n\nArizona State University\n\nTempe, AZ 85287\n\njun.liu@sas.com\n\npeter.wonka@asu.edu\n\nJieping Ye\n\nArizona State University\n\nTempe, AZ 85287\n\njieping.ye@asu.edu\n\nAbstract\n\nThe (cid:96)1-regularized logistic regression (or sparse logistic regression) is a widely\nused method for simultaneous classi\ufb01cation and feature selection. Although many\nrecent efforts have been devoted to its ef\ufb01cient implementation, its application to\nhigh dimensional data still poses signi\ufb01cant challenges. In this paper, we present a\nfast and effective sparse logistic regression screening rule (Slores) to identify the\n\u201c0\u201d components in the solution vector, which may lead to a substantial reduction\nin the number of features to be entered to the optimization. An appealing feature\nof Slores is that the data set needs to be scanned only once to run the screening and\nits computational cost is negligible compared to that of solving the sparse logistic\nregression problem. Moreover, Slores is independent of solvers for sparse logis-\ntic regression, thus Slores can be integrated with any existing solver to improve\nthe ef\ufb01ciency. We have evaluated Slores using high-dimensional data sets from\ndifferent applications. Experiments demonstrate that Slores outperforms the ex-\nisting state-of-the-art screening rules and the ef\ufb01ciency of solving sparse logistic\nregression can be improved by one magnitude.\n\n1\n\nIntroduction\n\nLogistic regression (LR) is a popular and well established classi\ufb01cation method that has been widely\nused in many domains such as machine learning [4, 7], text mining [3, 8], image processing [9, 15],\nbioinformatics [1, 13, 19, 27, 28], medical and social sciences [2, 17] etc. When the number of\nfeature variables is large compared to the number of training samples, logistic regression is prone\nto over-\ufb01tting. To reduce over-\ufb01tting, regularization has been shown to be a promising approach.\nTypical examples include (cid:96)2 and (cid:96)1 regularization. Although (cid:96)1 regularized LR is more challenging\nto solve compared to (cid:96)2 regularized LR, it has received much attention in the last few years and\nthe interest in it is growing [20, 25, 28] due to the increasing prevalence of high-dimensional data.\nThe most appealing property of (cid:96)1 regularized LR is the sparsity of the resulting models, which is\nequivalent to feature selection.\nIn the past few years, many algorithms have been proposed to ef\ufb01ciently solve the (cid:96)1 regularized\nLR [5, 12, 11, 18]. However, for large-scale problems, solving the (cid:96)1 regularized LR with higher\naccuracy remains challenging. One promising solution is by \u201cscreening\u201d, that is, we \ufb01rst identify\nthe \u201cinactive\u201d features, which have 0 coef\ufb01cients in the solution and then discard them from the\noptimization. This would result in a reduced feature matrix and substantial savings in computational\ncost and memory size. In [6], El Ghaoui et al. proposed novel screening rules, called \u201cSAFE\u201d,\nto accelerate the optimization for a class of (cid:96)1 regularized problems, including LASSO [21], (cid:96)1\n\n1\n\n\fregularized LR and (cid:96)1 regularized support vector machines. Inspired by SAFE, Tibshirani et al.\n[22] proposed \u201cstrong rules\u201d for a large class of (cid:96)1 regularized problems, including LASSO, elastic\nnet, (cid:96)1 regularized LR and more general convex problems. In [26], Xiang et al. proposed \u201cDOME\u201d\nrules to further improve SAFE rules for LASSO based on the observation that SAFE rules can be\nunderstood as a special case of the general \u201csphere test\u201d. Although both strong rules and the sphere\ntests are more effective in discarding features than SAFE for solving LASSO, it is worthwhile to\nmention that strong rules may mistakenly discard features that have non-zero coef\ufb01cients in the\nsolution and the sphere tests are not easy to be generalized to handle the (cid:96)1 regularized LR. To the\nbest of our knowledge, the SAFE rule is the only screening test for the (cid:96)1 regularized LR that is\n\u201csafe\u201d, that is, it only discards features that are guaranteed to be absent from the resulting models.\nIn this paper, we develop novel screening rules, called\n\u201cSlores\u201d, for the (cid:96)1 regularized LR. The proposed screen-\ning tests detect inactive features by estimating an upper\nbound of the inner product between each feature vector\nand the \u201cdual optimal solution\u201d of the (cid:96)1 regularized L-\nR, which is unknown. The more accurate the estimation\nis, the more inactive features can be detected. An accu-\nrate estimation of such an upper bound turns out to be\nquite challenging. Indeed most of the key ideas/insights\nbehind existing \u201csafe\u201d screening rules for LASSO heavi-\nly rely on the least square loss, which are not applicable\nfor the (cid:96)1 regularized LR case due to the presence of the\nlogistic loss. To this end, we propose a novel framework\nto accurately estimate an upper bound. Our key techni-\ncal contribution is to formulate the estimation of an upper\nbound of the inner product as a constrained convex optimization problem and show that it admits\na closed form solution. Therefore, the estimation of the inner product can be computed ef\ufb01ciently.\nOur extensive experiments have shown that Slores discards far more features than SAFE yet requires\nmuch less computational efforts. In contrast with strong rules, Slores is \u201csafe\u201d, i.e., it never discards\nfeatures which have non-zero coef\ufb01cients in the solution.\nTo illustrate the effectiveness of Slores, we compare Slores, strong rule and SAFE on a data set of\nprostate cancer along a sequence of 86 parameters equally spaced on the \u03bb/\u03bbmax scale from 0.1 to\n0.95, where \u03bb is the parameter for the (cid:96)1 penalty and \u03bbmax is the smallest tuning parameter [10] such\nthat the solution of the (cid:96)1 regularized LR is 0 [please refer to Eq. (1)]. The data matrix contains 132\npatients with 15154 features. To measure the performance of different screening rules, we compute\nthe rejection ratio which is the ratio between the number of features discarded by screening rules\nand the number of features with 0 coef\ufb01cients in the solution. Therefore, the larger the rejection\nratio is, the more effective the screening rule is. The results are shown in Fig. 1. We can see that\nSlores discards far more features than SAFE especially when \u03bb/\u03bbmax is large while the strong rule\nis not applicable when \u03bb/\u03bbmax \u2264 0.5. We present more results and discussions to demonstrate the\neffectiveness of Slores in Section 6. For proofs of the lemmas, corollaries, and theorems, please\nrefer to the long version of this paper [24].\n\nFigure 1: Comparison of Slores, strong\nrule and SAFE on the prostate cancer\ndata set.\n\n2 Basics and Motivations\n\nIn this section, we brie\ufb02y review the basics of the (cid:96)1 regularized LR and then motivate the general\nscreening rules via the KKT conditions. Suppose we are given a set of training samples {xi}m\nand the associate labels b \u2208 (cid:60)m, where xi \u2208 (cid:60)p and bi \u2208 {1,\u22121} for all i \u2208 {1, . . . , m}. The (cid:96)1\ni=1\nregularized logistic regression is:\n\nm(cid:88)\n\ni=1\n\nmin\n\u03b2,c\n\n1\nm\n\nlog(1 + exp(\u2212(cid:104)\u03b2, \u00afxi(cid:105) \u2212 bic)) + \u03bb(cid:107)\u03b2(cid:107)1,\n\n(LRP\u03bb)\n\nwhere \u03b2 \u2208 (cid:60)p and c \u2208 (cid:60) are the model parameters to be estimated, \u00afxi = bixi, and \u03bb > 0 is the\ntuning parameter. We denote by X \u2208 (cid:60)m\u00d7p the data matrix with the ith row being \u00afxi and the jth\ncolumn being \u00afxj.\n\n2\n\n\fLet C = {\u03b8 \u2208 (cid:60)m : \u03b8i \u2208 (0, 1), i = 1, . . . , m} and f (y) = y log(y) + (1 \u2212 y) log(1 \u2212 y) for\ny \u2208 (0, 1). The dual problem of (LRP\u03bb) [24] is given by\n\n(cid:41)\n\nf (\u03b8i) : (cid:107) \u00afXT \u03b8(cid:107)\u221e \u2264 m\u03bb,(cid:104)\u03b8, b(cid:105) = 0, \u03b8 \u2208 C\n\n.\n\n(cid:40)\n\nmin\n\n\u03b8\n\ng(\u03b8) =\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(LRD\u03bb)\n\u03bb, c\u2217\n\nTo simplify notations, we denote the feasible set of problem (LRD\u03bb) as F\u03bb, and let (\u03b2\u2217\n\u03bb) and\n\u03b8\u2217\n\u03bb be the optimal solutions of problems (LRP\u03bb) and (LRD\u03bb) respectively. In [10], the authors have\nshown that for some special choice of the tuning parameter \u03bb, both of (LRP\u03bb) and (LRD\u03bb) have\nclosed form solutions. In fact, let P = {i : bi = 1}, N = {i : bi = \u22121}, and m+ and m\u2212 be the\ncardinalities of P and N respectively. We de\ufb01ne\n\u03bbmax = 1\n\n(cid:107)\u221e,\n\n(1)\n\n\u03bbmax\n\nwhere\n\n[\u03b8\u2217\n\n\u03bbmax\n\n]i =\n\nm(cid:107) \u00afXT \u03b8\u2217\n\n(cid:40) m\u2212\nm , if i \u2208 P,\nm , if i \u2208 N ,\n\nm+\n\ni = 1, . . . , m.\n\n(2)\n\n([\u00b7]i denotes the ith component of a vector.) Then, it is known [10] that \u03b2\u2217\nwhenever \u03bb \u2265 \u03bbmax. When \u03bb \u2208 (0, \u03bbmax], it is known that (LRD\u03bb) has a unique optimal solution\n[24]. We can now write the KKT conditions of problems (LRP\u03bb) and (LRD\u03bb) as\n\n\u03bb = 0 and \u03b8\u2217\n\n\u03bb = \u03b8\u2217\n\n\u03bbmax\n\n\uf8f1\uf8f2\uf8f3m\u03bb,\n\n\u2212m\u03bb,\n[\u2212m\u03bb, m\u03bb],\n\n(cid:104)\u03b8\u2217\n\u03bb, \u00afxj(cid:105) \u2208\n\nif [\u03b2\u2217\nif [\u03b2\u2217\nif [\u03b2\u2217\n\n\u03bb]j > 0,\n\u03bb]j < 0,\n\u03bb]j = 0.\n\nj = 1, . . . , p.\n\n(3)\n\nIn view of Eq. (3), we can see that\n\n|(cid:104)\u03b8\u2217\n\n\u03bb, \u00afxj(cid:105)| < m\u03bb \u21d2 [\u03b2\u2217\n\n\u03bb]j = 0.\n\n\u03bb. Although it is unknown, we can still estimate a region A\u03bb which contains \u03b8\u2217\n\n(R1)\nIn other words, if |(cid:104)\u03b8\u2217\n\u03bb, \u00afxj(cid:105) < m\u03bb, then the KKT conditions imply that the coef\ufb01cient of \u00afxj in the\nsolution \u03b2\u2217\n\u03bb is 0 and thus the jth feature can be safely removed from the optimization of (LRP\u03bb).\nHowever, for the general case in which \u03bb < \u03bbmax, (R1) is not applicable since it assumes the\nknowledge of \u03b8\u2217\n\u03bb. As\na result, if max\u03b8\u2208A\u03bb |(cid:104)\u03b8, \u00afxj(cid:105)| < m\u03bb, we can also conclude that [\u03b2\u2217\n\u03bb]j = 0 by (R1). In other words,\n(R1) can be relaxed as\n|(cid:104)\u03b8, \u00afxj(cid:105)| < m\u03bb \u21d2 [\u03b2\u2217\n(R1(cid:48))\nIn this paper, (R1(cid:48)) serves as the foundation for constructing our screening rules, Slores. From\n(R1(cid:48)), it is easy to see that screening rules with smaller T (\u03b8\u2217\n\u03bb, \u00afxj) are more aggressive in discarding\n\u03bb, \u00afxj), we need to restrict the region A\u03bb which includes\nfeatures. To give a tight estimation of T (\u03b8\u2217\n\u03b8\u2217\n\u03bb as small as possible. In Section 3, we show that the estimation of the upper bound T (\u03b8\u2217\n\u03bb, \u00afxj)\ncan be obtained via solving a convex optimization problem. We show in Section 4 that the convex\noptimization problem admits a closed form solution and derive Slores in Section 5 based on (R1(cid:48)).\n\n\u03bb, \u00afxj) := max\n\u03b8\u2208A\u03bb\n\n\u03bb]j = 0.\n\nT (\u03b8\u2217\n\n3 Estimating the Upper Bound via Solving a Convex Optimization Problem\n\u03bb, \u00afxj(cid:105)|. In\nIn this section, we present a novel framework to estimate an upper bound T (\u03b8\u2217\nthe subsequent development, we assume a parameter \u03bb0 and the corresponding dual optimal \u03b8\u2217\nare\ngiven. In our Slores rule to be presented in Section 5, we set \u03bb0 and \u03b8\u2217\n\u03bb0\ngiven\nin Eqs. (1) and (2). We formulate the estimation of T (\u03b8\u2217\n\u03bb, \u00afxj) as a constrained convex optimization\nproblem in this section, which will be shown to admit a closed form solution in Section 4.\n\u03b8i(1\u2212\u03b8i) \u2265 4\n), [\u22072g(\u03b8)]i,i = 1\nFor the dual function g(\u03b8), it follows that [\u2207g(\u03b8)]i = 1\nm log( \u03b8i\nm .\n1\u2212\u03b8i\nSince \u22072g(\u03b8) is a diagonal matrix, it follows that \u22072g(\u03b8) (cid:23) 4\nm I, where I is the identity matrix.\nThus, g(\u03b8) is strongly convex with modulus \u00b5 = 4\nm [16]. Rigorously, we have the following lemma.\nLemma 1. Let \u03bb > 0 and \u03b81, \u03b82 \u2208 F\u03bb, then\n\n\u03bb, \u00afxj) of |(cid:104)\u03b8\u2217\nto be \u03bbmax and \u03b8\u2217\n\n\u03bbmax\n\n\u03bb0\n\nm\n\n1\n\na).\n(4)\nb). If \u03b81 (cid:54)= \u03b82, the inequality in (4) becomes a strict inequality, i.e., \u201c\u2265\u201d becomes \u201c>\u201d.\n\ng(\u03b82) \u2212 g(\u03b81) \u2265 (cid:104)\u2207g(\u03b81), \u03b82 \u2212 \u03b81(cid:105) + 2\n\nm(cid:107)\u03b82 \u2212 \u03b81(cid:107)2\n2.\n\n3\n\n\fbelong to F\u03bb0. Therefore, Lemma 1 can\n\n\u03bb0\n. In fact, we have the following theorem.\n\nGiven \u03bb \u2208 (0, \u03bb0], it is easy to see that both of \u03b8\u2217\nbe a useful tool to bound \u03b8\u2217\nTheorem 2. Let \u03bbmax \u2265 \u03bb0 > \u03bb > 0, then the following holds:\n\n\u03bb with the knowledge of \u03b8\u2217\n\n\u03bb and \u03b8\u2217\n\n\u03bb0\n\n(cid:104)\n\n(cid:16) \u03bb\n\n(cid:17) \u2212 g(\u03b8\u2217\n\n(cid:16)\n\na).\nb). If \u03b8\u2217\n\n(cid:107)\u03b8\u2217\n\u03bb (cid:54)= \u03b8\u2217\n\n\u03bb0\n\ng\n\n\u03bb0\n\n\u03b8\u2217\n\n2 \u2264 m\n(cid:107)2\n2\n\n\u03bb \u2212 \u03b8\u2217\n, the inequality in (5) becomes a strict inequality, i.e., \u201c\u2264\u201d becomes \u201c<\u201d.\n\u03bb is inside a ball centred at \u03b8\u2217\n\nwith radius\n\n1 \u2212 \u03bb\n\n), \u03b8\u2217\n\n) +\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n(5)\n\nTheorem 2 implies that \u03b8\u2217\n\n(cid:17)(cid:104)\u2207g(\u03b8\u2217\n\n(cid:105)(cid:105)\n\n(cid:114)\n\n(cid:104)\n\n(cid:16) \u03bb\n\n\u03bb0\n\n\u03b8\u2217\n\n\u03bb0\n\n(cid:17) \u2212 g(\u03b8\u2217\n\nr =\n\nm\n2\n\ng\n\n) + (1 \u2212 \u03bb\n\n\u03bb0\n\n)(cid:104)\u2207g(\u03b8\u2217\n\n\u03bb0\n\n), \u03b8\u2217\n\n\u03bb0\n\n\u03bb0\n\n.\n\n(6)\n\n(cid:105)(cid:105)\n\nRecall that to make our screening rules more aggressive in discarding features, we need to get a tight\n\u03bb, \u00afxj(cid:105)| [please see (R1(cid:48))]. Thus, it is desirable to further restrict the\nupper bound T (\u03b8\u2217\npossible region A\u03bb of \u03b8\u2217\n\n\u03bb, \u00afxj) of |(cid:104)\u03b8\u2217\n\n\u03bb. Clearly, we can see that\n\n(cid:104)\u03b8\u2217\n\u03bb, b(cid:105) = 0\n\n(7)\n, \u00afxj(cid:105) =\n\n(8)\n\n(9)\n\n\u03bb is feasible for problem (LRD\u03bb). On the other hand, we call the set I\u03bb0 = {j : (cid:104)\u03b8\u2217\n\u03bb0\n. We have the following lemma for the active set.\n\u03bb of problem (LRD\u03bb), the active set I\u03bb = {j : |(cid:104)\u03b8\u2217\n\nsince \u03b8\u2217\n|m\u03bb0|, j = 1, . . . , p} the \u201cactive set\u201d of \u03b8\u2217\nLemma 3. Given the optimal solution \u03b8\u2217\nm\u03bb, j = 1, . . . , p} is not empty if \u03bb \u2208 (0, \u03bbmax].\nSince \u03bb0 \u2208 (0, \u03bbmax], we can see that I\u03bb0 is not empty by Lemma 3. We pick j0 \u2208 I\u03bb0 and set\n\n\u03bb0\n\n\u03bb, \u00afxj(cid:105)| =\n\nIt follows that (cid:104)\u00afx\u2217, \u03b8\u2217\n\n\u03bb0\n\n\u03bb for problem (LRD\u03bb), \u03b8\u2217\n\n\u03bb satis\ufb01es\n\n\u00afx\u2217 = sign((cid:104)\u03b8\u2217\n\n, \u00afxj0(cid:105))\u00afxj0 .\n\n\u03bb0\n\n(cid:105) = m\u03bb0. Due to the feasibility of \u03b8\u2217\n(cid:104)\u03b8\u2217\n\u03bb, \u00afx\u2217(cid:105) \u2264 m\u03bb.\n\nAs a result, Theorem 2, Eq. (7) and (9) imply that \u03b8\u2217\n\n\u03bb is contained in the following set:\n\nSince \u03b8\u2217\n\n\u03bb \u2208 A\u03bb\n\n\u03bb0\n\nA\u03bb\n\n\u03bb0\n\n:= {\u03b8 : (cid:107)\u03b8 \u2212 \u03b8\u2217\n\n\u03bb0\n\n2 \u2264 r2,(cid:104)\u03b8, b(cid:105) = 0,(cid:104)\u03b8, \u00afx\u2217(cid:105) \u2264 m\u03bb}.\n(cid:107)2\n\n, we can see that |(cid:104)\u03b8\u2217\nT (\u03b8\u2217\n\n\u03bb, \u00afxj(cid:105)| \u2264 max\u03b8\u2208A\u03bb\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n\u03bb0\n\n) := max\n\u03b8\u2208A\u03bb\n\u03bb0\n\n|(cid:104)\u03b8, \u00afxj(cid:105)|. Therefore, (R1(cid:48)) implies that if\n|(cid:104)\u03b8, \u00afxj(cid:105)|\n\n(UBP)\n\nis smaller than m\u03bb, we can conclude that [\u03b2\u2217\nof (LRP\u03bb). Notice that, we replace the notations A\u03bb and T (\u03b8\u2217\n\u03bb, \u00afxj; \u03b8\u2217\nto emphasize their dependence on \u03b8\u2217\nwould be an applicable screening rule to discard features which have 0 coef\ufb01cients in \u03b2\u2217\nclosed form solution of problem (UBP) in the next section.\n\n\u03bb]j = 0 and \u00afxj can be discarded from the optimization\n) and A\u03bb\n\u03bb0\n), (R1(cid:48))\n\u03bb. We give a\n\n. Clearly, as long as we can solve for T (\u03b8\u2217\n\n\u03bb, \u00afxj) with T (\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n4\n\nSolving the Convex Optimization Problem (UBP)\n\nIn this section, we show how to solve the convex optimization problem (UBP) based on the standard\nLagrangian multiplier method. We \ufb01rst transform problem (UBP) into a pair of convex minimization\nproblem (UBP(cid:48)) via Eq. (11) and then show that the strong duality holds for (UBP(cid:48)) in Lemma 6. The\nstrong duality guarantees the applicability of the Lagrangian multiplier method. We then give the\nclosed form solution of (UBP(cid:48)) in Theorem 8. After we solve problem (UBP(cid:48)), it is straightforward\nto compute the solution of problem (UBP) via Eq. (11).\nBefore we solve (UBP) for the general case, it is worthwhile to mention a special case in which\nP\u00afxj = \u00afxj \u2212 (cid:104)\u00afxj ,b(cid:105)\nb = 0. Clearly, P is the projection operator which projects a vector onto the\n(cid:107)b(cid:107)2\northogonal complement of the space spanned by b. In fact, we have the following theorem.\nTheorem 4. Let \u03bbmax \u2265 \u03bb0 > \u03bb > 0, and assume \u03b8\u2217\nthen T (\u03b8\u2217\n\nis known. For j \u2208 {1, . . . , p}, if P\u00afxj = 0,\n\n) = 0.\n\n\u03bb0\n\n2\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n4\n\n\fBecause of (R1(cid:48)), we immediately have the following corollary.\nCorollary 5. Let \u03bb \u2208 (0, \u03bbmax) and j \u2208 {1, . . . , p}. If P\u00afxj = 0, then [\u03b2\u2217\nFor the general case in which P\u00afxj (cid:54)= 0, let\n\n\u03bb]j = 0.\n\nT+(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n) := max\n\u03b8\u2208A\u03bb\n\u03bb0\n\nClearly, we have\n\n(cid:104)\u03b8, +\u00afxj(cid:105), T\u2212(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n(cid:104)\u03b8,\u2212\u00afxj(cid:105).\n\n) := max\n\u03b8\u2208A\u03bb\n\u03bb0\n\nT (\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n) = max{T+(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n), T\u2212(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n)}.\n\nTherefore, we can solve problem (UBP) by solving the two sub-problems in (10).\nLet \u03be \u2208 {+1,\u22121}. Then problems in (10) can be written uniformly as\n\nT\u03be(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n(cid:104)\u03b8, \u03be\u00afxj(cid:105).\n\n) = max\n\u03b8\u2208A\u03bb\n\u03bb0\n\n(10)\n\n(11)\n\n(UBPs)\n\nTo make use of the standard Lagrangian multiplier method, we transform problem (UBPs) to the\nfollowing minimization problem:\n\n(UBP(cid:48))\n\n\u2212T\u03be(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n(cid:104)\u03b8,\u2212\u03be\u00afxj(cid:105)\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n) = min\n\u03b8\u2208A\u03bb\n\u03bb0\n(cid:104)\u03b8,\u2212\u03be\u00afxj(cid:105).\nis known. The strong duality holds for problem\n\n(cid:104)\u03b8, \u03be\u00afxj(cid:105) = \u2212 min\u03b8\u2208A\u03bb\nby noting that max\u03b8\u2208A\u03bb\nLemma 6. Let \u03bbmax \u2265 \u03bb0 > \u03bb > 0 and assume \u03b8\u2217\n(UBP(cid:48)). Moreover, problem (UBP(cid:48)) admits an optimal solution in A\u03bb\nBecause the strong duality holds for problem (UBP(cid:48)) by Lemma 6, the Lagrangian multiplier method\nis applicable for (UBP(cid:48)). In general, we need to \ufb01rst solve the dual problem and then recover the\noptimal solution of the primal problem via KKT conditions. Recall that r and \u00afx\u2217 are de\ufb01ned by\nEq. (6) and (8) respectively. Lemma 7 derives the dual problems of (UBP(cid:48)) for different cases.\nLemma 7. Let \u03bbmax \u2265 \u03bb0 > \u03bb > 0 and assume \u03b8\u2217\nlet \u00afx = \u2212\u03be\u00afxj. Denote\n\nis known. For j \u2208 {1, . . . , p} and P\u00afxj (cid:54)= 0,\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n.\n\n(cid:110)\n\n(cid:111)\n\n.\n\nU1 = {(u1, u2) : u1 > 0, u2 \u2265 0} and U2 =\n\n(u1, u2) : u1 = 0, u2 = \u2212(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n(cid:107)P\u00afx\u2217(cid:107)2\n\n2\n\na). If\n\n(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n\n(cid:107)P\u00afx(cid:107)2(cid:107)P\u00afx\u2217(cid:107)2\n\n\u2208 (\u22121, 1], the dual problem of (UBP(cid:48)) is equivalent to:\n2 + u2m(\u03bb0 \u2212 \u03bb) + (cid:104)\u03b8\u2217\n\n(cid:107)P\u00afx + u2P\u00afx\u2217(cid:107)2\n\n\u00afg(u1, u2) = \u2212 1\n\n\u03bb0\n\n2u1\n\nmax\n\n(u1,u2)\u2208U1\n\n, \u00afx(cid:105) \u2212 1\n\n2 u1r2.\n\n(UBD(cid:48))\n\nMoreover, \u00afg(u1, u2) attains its maximum in U1.\nb). If\n\n(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n\n= \u22121, the dual problem of (UBP(cid:48)) is equivalent to:\n\n(cid:107)P\u00afx(cid:107)2(cid:107)P\u00afx\u2217(cid:107)2\n\n(cid:40)\n\nmax\n\n(u1,u2)\u2208U1\u222aU2\n\n\u00af\u00afg(u1, u2) =\n\n\u00afg(u1, u2),\n\u2212 (cid:107)P\u00afx(cid:107)2\n(cid:107)P\u00afx\u2217(cid:107)2\n\nm\u03bb,\n\nif (u1, u2) \u2208 U1,\nif (u1, u2) \u2208 U2.\n\n(UBD(cid:48)(cid:48))\n\nWe can now solve problem (UBP(cid:48)) in the following theorem.\nTheorem 8. Let \u03bbmax \u2265 \u03bb0 > \u03bb > 0, d = m(\u03bb0\u2212\u03bb)\nr(cid:107)P\u00afx\u2217(cid:107)2\nand P\u00afxj (cid:54)= 0, let \u00afx = \u2212\u03be\u00afxj.\n\nand assume \u03b8\u2217\n\n\u03bb0\n\nis known. For j \u2208 {1, . . . , p}\n\na). If\n\n(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n\n(cid:107)P\u00afx(cid:107)2(cid:107)P\u00afx\u2217(cid:107)2\n\n\u2265 d, then\n\nT\u03be(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n) = r(cid:107)P\u00afx(cid:107)2 \u2212 (cid:104)\u03b8\u2217\n\n\u03bb0\n\n, \u00afx(cid:105);\n\n(12)\n\n5\n\n\fb). If\n\n(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n\n(cid:107)P\u00afx(cid:107)2(cid:107)P\u00afx\u2217(cid:107)2\n\n< d, then\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\nT\u03be(\u03b8\u2217\n\n) = r(cid:107)P\u00afx + u\u2217\n\n2P\u00afx\u2217(cid:107)2 \u2212 u\u2217\n\n2m(\u03bb0 \u2212 \u03bb) \u2212 (cid:104)\u03b8\u2217\n\n\u03bb0\n\nwhere\n\n\u2206\n\n2a2\n\n\u221a\n2 = \u2212a1+\nu\u2217\n,\na2 = (cid:107)P\u00afx\u2217(cid:107)4\n2(1 \u2212 d2),\na1 = 2(cid:104)P\u00afx, P\u00afx\u2217(cid:105)(cid:107)P\u00afx\u2217(cid:107)2\na0 = (cid:104)P\u00afx, P\u00afx\u2217(cid:105)2 \u2212 d2(cid:107)P\u00afx(cid:107)2\n\u2206 = a2\n\n2(cid:107)P\u00afx\u2217(cid:107)2\n2,\n1 \u2212 4a2a0 = 4d2(1 \u2212 d2)(cid:107)P\u00afx\u2217(cid:107)4\n\n2(1 \u2212 d2),\n\n, \u00afx(cid:105),\n\n(13)\n\n(14)\n\n2((cid:107)P\u00afx(cid:107)2\n\n2(cid:107)P\u00afx\u2217(cid:107)2\n\n2 \u2212 (cid:104)P\u00afx, P\u00afx\u2217(cid:105)2).\n\nNotice that, although the dual problems of (UBP(cid:48)) in Lemma 7 are different, the resulting upper\nbound T\u03be(\u03b8\u2217\n) can be given by Theorem 8 in a uniform way. The tricky part is how to deal\nwith the extremal cases in which\n\n\u2208 {\u22121, +1}.\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n(cid:104)P\u00afx,P\u00afx\u2217(cid:105)\n\n\u03bb0\n\n(cid:107)P\u00afx(cid:107)2(cid:107)P\u00afx\u2217(cid:107)2\n\n5 The proposed Slores Rule for (cid:96)1 Regularized Logistic Regression\nUsing (R1(cid:48)), we are now ready to construct the screening rules for the (cid:96)1 Regularized Logistic\nRegression. By Corollary 5, we can see that the orthogonality between the jth feature and the\nresponse vector b implies the absence of \u00afxj from the resulting model. For the general case in which\nP\u00afxj (cid:54)= 0, (R1(cid:48)) implies that if T (\u03b8\u2217\n)} < m\u03bb,\nthen the jth feature can be discarded from the optimization of (LRP\u03bb). Notice that, letting \u03be = \u00b11,\n\u03bb, \u00afxj; \u03b8\u2217\nT+(\u03b8\u2217\n) have been solved by Theorem 8. Rigorously, we have the\nfollowing theorem.\nTheorem 9 (Slores). Let \u03bb0 > \u03bb > 0 and assume \u03b8\u2217\n\n) = max{T+(\u03b8\u2217\n\n) and T\u2212(\u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n), T\u2212(\u03b8\u2217\n\nis known.\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n\u03bb0\n\n1. If \u03bb \u2265 \u03bbmax, then \u03b2\u2217\n2. If \u03bbmax \u2265 \u03bb0 > \u03bb > 0 and either of the following holds:\n\n\u03bb = 0;\n\n(a) P\u00afxj = 0,\n(b) max{T\u03be(\u03b8\u2217\nthen [\u03b2\u2217\n\u03bb]j = 0.\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0\n\n) : \u03be = \u00b11} < m\u03bb,\n\nthe Slores rule as summarized below in Algorithm 1.\n\nAlgorithm 1 R = Slores(X, b, \u03bb, \u03bb0, \u03b8\u2217\n\n)\n\n\u03bb0\n\n\u03bb]R and c\u2217\n\u03bb.\n\nBased on Theorem 9, we construct\nNotice that, the output R of Slores is the indices\nof the features that need to be entered to the\noptimization. As a result, suppose the output\nof Algorithm 1 is R = {j1, . . . , jk}, we can\nsubstitute the full matrix X in problem (LRP\u03bb)\nwith the sub-matrix XR = (\u00afxj1 , . . . , \u00afxjk ) and\njust solve for [\u03b2\u2217\nOn the other hand, Algorithm 1 implies that\nSlores needs \ufb01ve inputs. Since X and b come\nwith the data and \u03bb is chosen by the user, we on-\nly need to specify \u03b8\u2217\nand \u03bb0. In other words,\nwe need to provide Slores with a dual opti-\nmal solution of problem (LRD\u03bb) for an arbi-\ntrary parameter. A natural choice is by setting\n\u03bb0 = \u03bbmax and \u03b8\u2217\n= \u03b8\u2217\ngiven by Eq. (1)\nand Eq. (2) in closed form.\n\n\u03bbmax\n\n\u03bb0\n\n\u03bb0\n\nInitialize R := {1, . . . , p};\nif \u03bb \u2265 \u03bbmax then\nelse\n\nset R = \u2205;\n\nfor j = 1 to p do\n\nremove j from R;\n\nif P\u00afxj = 0 then\nelse if max{T\u03be(\u03b8\u2217\nthen\n\nremove j from R;\n\nend if\nend for\nend if\nReturn: R\n\n\u03bb, \u00afxj; \u03b8\u2217\n\n\u03bb0 ) : \u03be = \u00b11} < m\u03bb\n\n6 Experiments\n\nWe evaluate our screening rules using the newgroup data set [10] and Yahoo web pages data sets\n[23]. The newgroup data set is cultured from the data by Koh et al. [10]. The Yahoo data set-\ns include 11 top-level categories, each of which is further divided into a set of subcategories. In\n\n6\n\n\four experiment we construct \ufb01ve balanced binary classi\ufb01cation datasets from the topics of Com-\nputers, Education, Health, Recreation, and Science. For each topic, we choose samples from\none subcategory as the positive class and randomly sample an equal number of samples from the\nrest of subcategories as the negative class. The statistics of the data sets are given in Table 1.\n\nHealth\n\nRecreation\n\nScience\n\nm\n\nTable 1: Statistics of the test data sets.\nData set\nnewsgroup\nComputers\nEducation\n\np\n\nno. nonzeros\n\n61188\n25259\n20782\n18430\n25095\n24002\n\n11269\n216\n254\n228\n370\n222\n\n1467345\n23181\n28287\n40145\n49986\n37227\n\nTable 2: Running time (in seconds) of Slores,\nstrong rule, SAFE and the solver.\nSAFE\n1128.65\n\nWe compare the performance of Slores and the\nstrong rule which achieves state-of-the-art per-\nformance for (cid:96)1 regularized LR. We do not in-\nclude SAFE because it is less effective in dis-\ncarding features than strong rules and requires\nmuch higher computational time [22]. Fig. 1\nhas shown the performance of Slores, strong\nrule and SAFE. We compare the ef\ufb01ciency of\nthe three screening rules on the same prostate\ncancer data set in Table 2. All of the screen-\ning rules are tested along a sequence of 86 pa-\nrameter values equally spaced on the \u03bb/\u03bbmax\nscale from 0.1 to 0.95. We repeat the procedure\n100 times and during each time we undersam-\nple 80% of the data. We report the total running time of the three screening rules over the 86 values\nof \u03bb/\u03bbmax in Table 2. For reference, we also report the total running time of the solver1. We observe\nthat the running time of Slores and strong rule is negligible compared to that of the solver. However,\nSAFE takes much longer time even than the solver.\nIn Section 6.1, we evaluate the performance of Slores and strong rule. Recall that we use the re-\njection ratio, i.e., the ratio between the number of features discarded by the screening rules and the\nnumber of features with 0 coef\ufb01cients in the solution, to measure the performance of screening rules.\nNote that, because no features with non-zero coef\ufb01cients in the solution would be mistakenly dis-\ncarded by Slores, its rejection ratio is no larger than one. We then compare the ef\ufb01ciency of Slores\nand strong rule in Section 6.2.\nThe experiment settings are as follows. For each data set, we undersample 80% of the date and\nrun Slores and strong rules along a sequence of 86 parameter values equally spaced on the \u03bb/\u03bbmax\nscale from 0.1 to 0.95. We repeat the procedure 100 times and report the average performance and\nrunning time at each of the 86 values of \u03bb/\u03bbmax. Slores, strong rules and SAFE are all implemented\nin Matlab. All of the experiments are carried out on a Intel(R) (i7-2600) 3.4Ghz processor.\n\nSolver\n10.56\n\nSlores\n0.37\n\nStrong Rule\n\n0.33\n\n6.1 Comparison of Performance\n\nIn this experiment, we evaluate the performance of the Slores and the strong rule via the rejection\nratio. Fig. 2 shows the rejection ratio of Slores and strong rule on six real data sets. When \u03bb/\u03bbmax >\n0.5, we can see that both Slores and strong rule are able to identify almost 100% of the inactive\nfeatures, i.e., features with 0 coef\ufb01cients in the solution vector. However, when \u03bb/\u03bbmax \u2264 0.5,\nstrong rule can not detect the inactive features. In contrast, we observe that Slores exhibits much\nstronger capability in discarding inactive features for small \u03bb, even when \u03bb/\u03bbmax is close to 0.1.\nTaking the data point at which \u03bb/\u03bbmax = 0.1 for example, Slores discards about 99% inactive\nfeatures for the newsgroup data set. For the other data sets, more than 80% inactive features are\nidenti\ufb01ed by Slores. Thus, in terms of rejection ratio, Slores signi\ufb01cantly outperforms the strong\nrule. Moreover, the discarded features by Slores are guaranteed to have 0 coef\ufb01cients in the solution.\nBut strong rule may mistakenly discard features which have non-zero coef\ufb01cients in the solution.\n\n6.2 Comparison of Ef\ufb01ciency\n\nWe compare ef\ufb01ciency of Slores and the strong rule in this experiment. The data sets for evaluating\nthe rules are the same as Section 6.1. The running time of the screening rules reported in Fig. 3\nincludes the computational cost of the rules themselves and that of the solver after screening. We\nplot the running time of the screening rules against that of the solver without screening. As indicated\nby Fig. 2, when \u03bb/\u03bbmax > 0.5, Slores and strong rule discards almost 100% of the inactive features.\n\n1In this paper, the ground truth is computed by SLEP [14].\n\n7\n\n\f(a) newsgroup\n\n(b) Computers\n\n(c) Education\n\n(d) Health\n\n(e) Recreation\n\n(f) Science\n\nFigure 2: Comparison of the performance of Slores and strong rules on six real data sets.\n\n(a) newsgroup\n\n(b) Computers\n\n(c) Education\n\n(d) Health\n\n(e) Recreation\n\n(f) Science\n\nFigure 3: Comparison of the ef\ufb01ciency of Slores and strong rule on six real data sets.\n\nAs a result, the size of the feature matrix involved in the optimization of problem (LRP\u03bb) is greatly\nreduced. From Fig. 3, we can observe that the ef\ufb01ciency is improved by about one magnitude on\naverage compared to that of the solver without screening. However, when \u03bb/\u03bbmax < 0.5, strong\nrule can not identify any inactive features and thus the running time is almost the same as that of the\nsolver without screening. In contrast, Slores is still able to identify more than 80% of the inactive\nfeatures for the data sets cultured from the Yahoo web pages data sets and thus the ef\ufb01ciency is\nimproved by roughly 5 times. For the newgroup data set, about 99% inactive features are identi\ufb01ed\nby Slores which leads to about 10 times savings in running time. These results demonstrate the\npower of the proposed Slores rule in improving the ef\ufb01ciency of solving the (cid:96)1 regularized LR.\n\n7 Conclusions\nIn this paper, we propose novel screening rules to effectively discard features for (cid:96)1 regularized\nLR. Extensive numerical experiments on real data demonstrate that Slores outperforms the existing\nstate-of-the-art screening rules. We plan to extend the framework of Slores to more general sparse\nformulations, including convex ones, like group Lasso, fused Lasso, (cid:96)1 regularized SVM, and non-\nconvex ones, like (cid:96)p regularized problems where 0 < p < 1.\n\n8\n\n\fReferences\n\n[1] M. Asgary, S. Jahandideh, P. Abdolmaleki, and A. Kazemnejad. Analysis and identi\ufb01cation of \u03b2-turn\ntypes using multinomial logistic regression and arti\ufb01cial neural network. Bioinformatics, 23(23):3125\u2013\n3130, 2007.\n\n[2] C. Boyd, M. Tolson, and W. Copes. Evaluating trauma care: The TRISS method, trauma score and the\n\ninjury severity score. Journal of Trauma, 27:370\u2013378, 1987.\n\n[3] J. R. Brzezinski and G. J. Kna\ufb02. Logistic regression modeling for context-based classi\ufb01cation. In DEXA\n\nWorkshop, pages 755\u2013759, 1999.\n\n[4] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2008.\n[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32:407\u2013499,\n\n2004.\n\n[6] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination for the lasso and sparse supervised\n\nlearning problems. arXiv:1009.4219v2.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The\n\nAnnals of Statistics, 38(2), 2000.\n\n[8] A. Genkin, D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization.\n\nTechnometrics, 49:291\u2013304(14), 2007.\n\n[9] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation with relative location\n\nprior. International Journal of Computer Vision, 80(3):300\u2013316, 2008.\n\n[10] K. Koh, S. J. Kim, and S. Boyd. An interior-point method for large scale (cid:96)1-regularized logistic regression.\n\nJ. Mach. Learn. Res., 8:1519\u20131555, 2007.\n\n[11] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse multinomial logistic regression:\nFast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27:957\u2013968, 2005.\n[12] S. Lee, H. Lee, P. Abbeel, and A. Ng. Ef\ufb01cient l1 regularized logistic regression. In In AAAI-06, 2006.\n[13] J. Liao and K. Chin. Logistic regression for disease classi\ufb01cation using microarray data: model selection\n\nin a large p and small n case. Bioinformatics, 23(15):1945\u20131951, 2007.\n\n[14] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Ef\ufb01cient Projections. Arizona State University, 2009.\n[15] S. Martins, L. Sousa, and J. Martins. Additive logistic regression applied to retina modelling. In ICIP (3),\n\npages 309\u2013312. IEEE, 2007.\n\n[16] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.\n[17] S. Palei and S. Das. Logistic regression model for prediction of roof fall risks in bord and pillar workings\n\nin coal mines: An approach. Safety Science, 47:88\u201396, 2009.\n\n[18] M. Park and T. Hastie. (cid:96)1 regularized path algorithm for generalized linear models. J. R. Statist. Soc. B,\n\n69:659\u2013677, 2007.\n\n[19] M. Sartor, G. Leikauf, and M. Medvedovic. LRpath: a logistic regression approach for identifying en-\n\nriched biological groups in gene expression data. Bioinformatics, 25(2):211\u2013217, 2009.\n\n[20] D. Sun, T. Erp, P. Thompson, C. Bearden, M. Daley, L. Kushan, M. Hardt, K. Nuechterlein, A. Toga, and\nT. Cannon. Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis:\nclassi\ufb01cation analysis using probabilistic brain atlas and machine learning algorithms. Biological Psychi-\natry, 66:1055\u20131\u201360, 2009.\n\n[21] R. Tibshirani. Regression shringkage and selection via the lasso. J. R. Statist. Soc. B, 58:267\u2013288, 1996.\n[22] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. Tibshirani. Strong rules for\n\ndiscarding predictors in lasso-type problems. J. R. Statist. Soc. B, 74:245\u2013266, 2012.\n\n[23] N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. Advances in neural information\n\nprocessing systems, 15:721\u2013728, 2002.\n\n[24] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A safe screening rule for sparse logistic regression.\n\narXiv:1307.4145v2, 2013.\n\n[25] T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genome-wide association analysis by lasso\n\npenalized logistic regression. Bioinformatics, 25:714\u2013721, 2009.\n\n[26] Z. J. Xiang and P. J. Ramadge. Fast lasso screening tests based on correlations. In IEEE ICASSP, 2012.\n[27] J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine.\nIn T. G. Dietterich,\nS. Becker, and Z. Ghahramani, editors, NIPS, pages 1081\u20131088. MIT Press, 2001.\n\n[28] J. Zhu and T. Hastie. Classi\ufb01cation of gene microarrays by penalized logistic regression. Biostatistics,\n\n5:427\u2013443, 2004.\n\n9\n\n\f", "award": [], "sourceid": 627, "authors": [{"given_name": "Jie", "family_name": "Wang", "institution": "Arizona State University"}, {"given_name": "Jiayu", "family_name": "Zhou", "institution": "Arizona State University"}, {"given_name": "Jun", "family_name": "Liu", "institution": "SAS Institute"}, {"given_name": "Peter", "family_name": "Wonka", "institution": "KAUST"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "Arizona State University"}]}