{"title": "Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 2132, "page_last": 2140, "abstract": "Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the l1 and l2 norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel two-layer feature reduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable for sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. Experiments on both synthetic and real data sets show that TLFre improves the efficiency of SGL by orders of magnitude.", "full_text": "Two-Layer Feature Reduction for Sparse-Group\n\nLasso via Decomposition of Convex Sets\n\nJie Wang, Jieping Ye\n\nComputer Science and Engineering\n\nArizona State University, Tempe, AZ 85287\n\n{jie.wang.ustc, jieping.ye}@asu.edu\n\nAbstract\n\nSparse-Group Lasso (SGL) has been shown to be a powerful regression tech-\nnique for simultaneously discovering group and within-group sparse patterns by\nusing a combination of the (cid:96)1 and (cid:96)2 norms. However, in large-scale applications,\nthe complexity of the regularizers entails great computational challenges. In this\npaper, we propose a novel two-layer feature reduction method (TLFre) for SGL\nvia a decomposition of its dual feasible set. The two-layer reduction is able to\nquickly identify the inactive groups and the inactive features, respectively, which\nare guaranteed to be absent from the sparse representation and can be removed\nfrom the optimization. Existing feature reduction methods are only applicable\nfor sparse models with one sparsity-inducing regularizer. To our best knowledge,\nTLFre is the \ufb01rst one that is capable of dealing with multiple sparsity-inducing\nregularizers. Moreover, TLFre has a very low computational cost and can be inte-\ngrated with any existing solvers. Experiments on both synthetic and real data sets\nshow that TLFre improves the ef\ufb01ciency of SGL by orders of magnitude.\n\nIntroduction\n\n1\nSparse-Group Lasso (SGL) [5, 16] is a powerful regression technique in identifying important\ngroups and features simultaneously. To yield sparsity at both group and individual feature levels,\nSGL combines the Lasso [18] and group Lasso [28] penalties. In recent years, SGL has found great\nsuccess in a wide range of applications, including but not limited to machine learning [20, 27], sig-\nnal processing [17], bioinformatics [14] etc. Many research efforts have been devoted to developing\nef\ufb01cient solvers for SGL [5, 16, 10, 21]. However, when the feature dimension is extremely high,\nthe complexity of the SGL regularizers imposes great computational challenges. Therefore, there\nis an increasingly urgent need for nontraditional techniques to address the challenges posed by the\nmassive volume of the data sources.\nRecently, El Ghaoui et al. [4] proposed a promising feature reduction method, called SAFE screen-\ning, to screen out the so-called inactive features, which have zero coef\ufb01cients in the solution, from\nthe optimization. Thus, the size of the data matrix needed for the training phase can be signi\ufb01cantly\nreduced, which may lead to substantial improvement in the ef\ufb01ciency of solving sparse models.\nInspired by SAFE, various exact and heuristic feature screening methods have been proposed for\nmany sparse models such as Lasso [25, 11, 19, 26], group Lasso [25, 22, 19], etc. It is worthwhile\nto mention that the discarded features by exact feature screening methods such as SAFE [4], DOME\n[26] and EDPP [25] are guaranteed to have zero coef\ufb01cients in the solution. However, heuristic fea-\nture screening methods like Strong Rule [19] may mistakenly discard features which have nonzero\ncoef\ufb01cients in the solution. More recently, the idea of exact feature screening has been extended\nto exact sample screening, which screens out the nonsupport vectors in SVM [13, 23] and LAD\n[23]. As a promising data reduction tool, exact feature/sample screening would be of great practical\nimportance because they can effectively reduce the data size without sacri\ufb01cing the optimality [12].\n\n1\n\n\fHowever, all of the existing feature/sample screening methods are only applicable for the sparse\nmodels with one sparsity-inducing regularizer. In this paper, we propose an exact two-layer feature\nscreening method, called TLFre, for the SGL problem. The two-layer reduction is able to quickly\nidentify the inactive groups and the inactive features, respectively, which are guaranteed to have zero\ncoef\ufb01cients in the solution. To the best of our knowledge, TLFre is the \ufb01rst screening method which\nis capable of dealing with multiple sparsity-inducing regularizers.\nWe note that most of the existing exact feature screening methods involve an estimation of the dual\noptimal solution. The dif\ufb01culty in developing screening methods for sparse models with multiple\nsparsity-inducing regularizers like SGL is that the dual feasible set is the sum of simple convex\nsets. Thus, to determine the feasibility of a given point, we need to know if it is decomposable with\nrespect to the summands, which is itself a nontrivial problem (see Section 2). One of our major\ncontributions is that we derive an elegant decomposition method of any dual feasible solutions of\nSGL via the framework of Fenchel\u2019s duality (see Section 3). Based on the Fenchel\u2019s dual problem\nof SGL, we motivate TLFre by an in-depth exploration of its geometric properties and the optimality\nconditions. We derive the set of the regularization parameter values corresponding to zero solutions.\nTo develop TLFre, we need to estimate the upper bounds involving the dual optimal solution. To this\nend, we \ufb01rst give an accurate estimation of the dual optimal solution via the normal cones. Then,\nwe formulate the estimation of the upper bounds via nonconvex optimization problems. We show\nthat these nonconvex problems admit closed form solutions. Experiments on both synthetic and real\ndata sets demonstrate that the speedup gained by TLFre in solving SGL can be orders of magnitude.\nAll proofs are provided in the long version of this paper [24].\nNotation: Let (cid:107)\u00b7(cid:107)1, (cid:107)\u00b7(cid:107) and (cid:107)\u00b7(cid:107)\u221e be the (cid:96)1, (cid:96)2 and (cid:96)\u221e norms, respectively. Denote by Bn\n1 , Bn, and\nBn\u221e the unit (cid:96)1, (cid:96)2, and (cid:96)\u221e norm balls in Rn (we omit the superscript if it is clear from the context).\nFor a set C, let intC be its interior. If C is closed and convex, we de\ufb01ne the projection operator as\nPC(w) := argminu\u2208C(cid:107)w \u2212 u(cid:107). We denote by IC(\u00b7) the indicator function of C, which is 0 on C and\n\u221e elsewhere. Let \u03930(Rn) be the class of proper closed convex functions on Rn. For f \u2208 \u03930(Rn),\nlet \u2202f be its subdifferential. The domain of f is the set dom f := {w : f (w) < \u221e}. For w \u2208 Rn,\nlet [w]i be its ith component. For \u03b3 \u2208 R, let sgn(\u03b3) = sign(\u03b3) if \u03b3 (cid:54)= 0, and sgn(0) = 0. We de\ufb01ne\n\n(cid:26)\n\nSGN(w) =\n\ns \u2208 Rn : [s]i \u2208\n\n(cid:26)sign([w]i),\n\n[\u22121, 1],\n\nif [w]i (cid:54)= 0;\nif [w]i = 0.\n\n(cid:27)\n\nWe denote by \u03b3+ = max(\u03b3, 0). Then, the shrinkage operator S\u03b3(w) : Rn \u2192 Rn with \u03b3 \u2265 0 is\n\n[S\u03b3(w)]i = (|[w]i| \u2212 \u03b3)+sgn([w]i), i = 1, . . . , n.\n\n(1)\n\n2 Basics and Motivation\nIn this section, we brie\ufb02y review some basics of SGL. Let y \u2208 RN be the response vector and\nX \u2208 RN\u00d7p be the matrix of features. With the group information available, the SGL problem [5] is\n\nmin\n\u03b2\u2208Rp\n\n(2)\nwhere ng is the number of features in the gth group, Xg \u2208 RN\u00d7ng denotes the predictors in that\ngroup with the corresponding coef\ufb01cient vector \u03b2g, and \u03bb1, \u03bb2 are positive regularization parame-\nters. Without loss of generality, let \u03bb1 = \u03b1\u03bb and \u03bb2 = \u03bb with \u03b1 > 0. Then, problem (2) becomes:\n\n+ \u03bb1\n\ng=1\n\nXg\u03b2g\n\nng(cid:107)\u03b2g(cid:107) + \u03bb2(cid:107)\u03b2(cid:107)1,\n\n1\n2\n\ng=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)y \u2212(cid:88)G\n(cid:13)(cid:13)(cid:13)(cid:13)y \u2212(cid:88)G\n(cid:13)(cid:13) y\n\u03bb \u2212 \u03b8(cid:13)(cid:13)2\n\ng=1\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:88)G\n\n\u221a\n\n(cid:18)\n\n(cid:88)G\n\ng=1\n\n\u221a\n\n(cid:19)\n\n.\n\n(cid:111)\n\n.\n\n(3)\n\n(4)\n\nXg\u03b2g\n\n+ \u03bb\n\n\u03b1\n\n\u221a\n\nng(cid:107)\u03b2g(cid:107) + (cid:107)\u03b2(cid:107)1\n\n1\n2\n\nmin\n\u03b2\u2208Rp\n\n(cid:110) 1\n2(cid:107)y(cid:107)2 \u2212 1\n\nsup\n\n\u03b8\n\nBy the Lagrangian multipliers method [24], the dual problem of SGL is\n\n: XT\n\ng \u03b8 \u2208 D\u03b1\n\ng := \u03b1\n\nngB + B\u221e, g = 1, . . . , G\n\nIt is well-known that the dual feasible set of Lasso is the intersection of closed half spaces (thus a\npolytope); for group Lasso, the dual feasible set is the intersection of ellipsoids. Surprisingly, the\ngeometric properties of these dual feasible sets play fundamentally important roles in most of the\nexisting screening methods for sparse models with one sparsity-inducing regularizer [23, 11, 25, 4].\nWhen we incorporate multiple sparse-inducing regularizers to the sparse models, problem (4) indi-\ncates that the dual feasible set can be much more complicated. Although (4) provides a geometric\n\n2\n\n\f\u221a\n\ng \u03b8 \u2208 D\u03b1\n\ng \u03b8 = b1 + b2, with one belonging to \u03b1\n\ndescription of the dual feasible set of SGL, it is not suitable for further analysis. Notice that, even\nthe feasibility of a given point \u03b8 is not easy to determine, since it is nontrivial to tell if XT\ng \u03b8 can\nbe decomposed into b1 + b2 with b1 \u2208 \u03b1\nngB and b2 \u2208 B\u221e. Therefore, to develop screening\nmethods for SGL, it is desirable to gain deeper understanding of the sum of simple convex sets.\nIn the next section, we analyze the dual feasible set of SGL in depth via the Fenchel\u2019s Duality\nTheorem. We show that for each XT\ng , Fenchel\u2019s duality naturally leads to an explicit decom-\nngB and the other one belonging to B\u221e. This\nposition XT\nlays the foundation of the proposed screening method for SGL.\n3 The Fenchel\u2019s Dual Problem of SGL\nIn Section 3.1, we derive the Fenchel\u2019s dual of SGL via Fenchel\u2019s Duality Theorem. We then\nmotivate TLFre and sketch our approach in Section 3.2. In Section 3.3, we discuss the geometric\nproperties of the Fenchel\u2019s dual of SGL and derive the set of (\u03bb, \u03b1) leading to zero solutions.\n3.1 The Fenchel\u2019s Dual of SGL via Fenchel\u2019s Duality Theorem\nTo derive the Fenchel\u2019s dual problem of SGL, we need the Fenchel\u2019s Duality Theorem as stated in\nTheorem 1. The conjugate of f \u2208 \u03930(Rn) is the function f\u2217 \u2208 \u03930(Rn) de\ufb01ned by\n\n\u221a\n\nf\u2217(z) = supw (cid:104)w, z(cid:105) \u2212 f (w).\n\nTheorem 1. [Fenchel\u2019s Duality Theorem] Let f \u2208 \u03930(RN ), \u2126 \u2208 \u03930(Rp), and T (\u03b2) = y \u2212 X\u03b2\nbe an af\ufb01ne mapping from Rp to RN . Let p\u2217, d\u2217 \u2208 [\u2212\u221e,\u221e] be primal and dual values de\ufb01ned,\nrespectively, by the Fenchel problems:\n\np\u2217 = inf \u03b2\u2208Rp f (y \u2212 X\u03b2) + \u03bb\u2126(\u03b2); d\u2217 = sup\u03b8\u2208RN \u2212f\u2217(\u03bb\u03b8) \u2212 \u03bb\u2126\u2217(XT \u03b8) + \u03bb(cid:104)y, \u03b8(cid:105).\nOne has p\u2217 \u2265 d\u2217. If, furthermore, f and \u2126 satisfy the condition 0 \u2208 int (dom f \u2212 y + Xdom \u2126),\nthen the equality holds, i.e., p\u2217 = d\u2217, and the supreme is attained in the dual problem if \ufb01nite.\nWe omit the proof of Theorem 1 since it is a slight modi\ufb01cation of Theorem 3.3.5 in [2].\n2(cid:107)w(cid:107)2, and \u03bb\u2126(\u03b2) be the second term in (3). Then, SGL can be written as\nLet f (w) = 1\n\nmin\u03b2 f (y \u2212 X\u03b2) + \u03bb\u2126(\u03b2).\n\nTo derive the Fenchel\u2019s dual problem of SGL, Theorem 1 implies that we need to \ufb01nd f\u2217 and \u2126\u2217. It\n2(cid:107)z(cid:107)2. Therefore, we only need to \ufb01nd \u2126\u2217, where the concept in\ufb01mal\nis well-known that f\u2217(z) = 1\nconvolution is needed. Let h, g \u2208 \u03930(Rn). The in\ufb01mal convolution of h and g is de\ufb01ned by\n\n(h(cid:3)g)(\u03be) = inf \u03b7 h(\u03b7) + g(\u03be \u2212 \u03b7),\n\ng=1\n\n\u221a\n\ng = \u03b1\n\n1 (\u03b2) = \u03b1(cid:80)G\n1 )\u2217 (cid:0) (\u21262)\u2217) (\u03be) =(cid:80)G\n\n(\u03beg) ,\n\nand it is exact at a point \u03be if there exists a \u03b7\u2217(\u03be) such that (h(cid:3)g)(\u03be) = h(\u03b7\u2217(\u03be)) + g(\u03be \u2212 \u03b7\u2217(\u03be)).\nh(cid:3)g is exact if it is exact at every point of its domain, in which case it is denoted by h (cid:0) g.\nLemma 2. Let \u2126\u03b1\n\u221a\nover, let C\u03b1\n\nng(cid:107)\u03b2g(cid:107), \u21262(\u03b2) = (cid:107)\u03b2(cid:107)1 and \u2126(\u03b2) = \u2126\u03b1\n\n1 (\u03b2) + \u21262(\u03b2). More-\n\n1 )\u2217(\u03be) =(cid:80)G\n\nngB \u2282 Rng, g = 1, . . . , G. Then, the following hold:\ng=1 IC\u03b1\n\n(\u21262)\u2217(\u03be) =(cid:80)G\n\ng=1 IB\u221e (\u03beg),\n\ng=1 IB\n\n(i): (\u2126\u03b1\n(ii): \u2126\u2217(\u03be) = ((\u2126\u03b1\nwhere \u03beg \u2208 Rng is the sub-vector of \u03be corresponding to the gth group.\nNote that PB\u221e(\u03beg) admits a closed form solution, i.e., [PB\u221e (\u03beg)]i = sgn ([\u03beg]i) min (|[\u03beg]i| , 1).\nCombining Theorem 1 and Lemma 2, the Fenchel\u2019s dual of SGL can be derived as follows.\nTheorem 3. For the SGL problem in (3), the following hold:\n(i): The Fenchel\u2019s dual of SGL is given by:\n\n(cid:16) \u03beg\u2212PB\u221e (\u03beg)\n\n(cid:17)\n\n\u221a\n\u03b1\n\nng\n\n,\n\ng\n\n\u03bb \u2212 \u03b8(cid:107)2 \u2212 1\n\ng \u03b8 \u2212 PB\u221e(XT\n\ninf\n\u03b8\n\n(5)\n(ii): Let \u03b2\u2217(\u03bb, \u03b1) and \u03b8\u2217(\u03bb, \u03b1) be the optimal solutions of problems (3) and (5), respectively. Then,\n(6)\n(7)\n\n\u03bb\u03b8\u2217(\u03bb, \u03b1) =y \u2212 X\u03b2\u2217(\u03bb, \u03b1),\ng \u03b8\u2217(\u03bb, \u03b1) \u2208\u03b1\nXT\n\ng (\u03bb, \u03b1)(cid:107)1, g = 1, . . . , G.\n\ng (\u03bb, \u03b1)(cid:107) + \u2202(cid:107)\u03b2\u2217\n\nng\u2202(cid:107)\u03b2\u2217\n\n\u221a\n\n(cid:8) 1\n2(cid:107) y\n\n2(cid:107)y(cid:107)2 :(cid:13)(cid:13)XT\n\ng \u03b8)(cid:13)(cid:13) \u2264 \u03b1\n\nng, g = 1, . . . , G(cid:9) .\n\n\u221a\n\n3\n\n\fRemark 1. We note that the shrinkage operator can also be expressed by\n\n(8)\n\nTherefore, problem (5) can be written more compactly as\n\nS\u03b3(w) = w \u2212 P\u03b3B\u221e(w), \u03b3 \u2265 0.\n\n2(cid:107)y(cid:107)2 :(cid:13)(cid:13)S1(XT\n\ng \u03b8)(cid:13)(cid:13) \u2264 \u03b1\n\n\u221a\n\nng, g = 1, . . . , G(cid:9) .\n\n\u03bb \u2212 \u03b8(cid:107)2 \u2212 1\n\n(cid:8) 1\n2(cid:107) y\n\ninf\n\u03b8\n\n(9)\nRemark 2. Eq. (6) and Eq. (7) can be obtained by the Fenchel-Young inequality [2, 24]. They\nare the so-called KKT conditions [3] and can also be obtained by the Lagrangian multiplier method\n[24]. Moreover, for the SGL problem, its Lagrangian dual in (4) and Fenchel\u2019s dual in (5) are indeed\nequivalent to each other [24].\nRemark 3. An appealing advantage of the Fenchel\u2019s dual in (5) is that we have a natural decompo-\nsition of all points \u03beg \u2208 D\u03b1\ng . As a\nresult, this leads to a convenient way to determine the feasibility of any dual variable \u03b8 by checking\nif S1(XT\n3.2 Motivation of the Two-Layer Screening Rules\nWe motive the two-layer screening rules via the KKT condition in Eq. (7). As implied by the name,\nthere are two layers in our method. The \ufb01rst layer aims to identify the inactive groups, and the\nsecond layer is designed to detect the inactive features for the remaining groups.\nby Eq. (7), we have the following cases by noting \u2202(cid:107)w(cid:107)1 = SGN(w) and\n\ng : \u03beg = PB\u221e (\u03beg) +S1(\u03beg)) with PB\u221e(\u03beg) \u2208 B\u221e and S1(\u03beg) \u2208 C\u03b1\n\ng , g = 1, . . . , G.\n\ng \u03b8) \u2208 C\u03b1\n\n(cid:40)(cid:110) w(cid:107)w(cid:107)\n\n(cid:111)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(R1)\n\n(R2)\n\nif w (cid:54)= 0,\nif w = 0.\n\n,\n\n[XT\n\nCase 1. If \u03b2\u2217\n\n{u : (cid:107)u(cid:107) \u2264 1},\n\n\u2202(cid:107)w(cid:107) =\n(cid:40)\ng (\u03bb, \u03b1) (cid:54)= 0, we have\n\u221a\n[\u03b2\u2217\ng (\u03bb,\u03b1)(cid:107) + sign([\u03b2\u2217\ng \u03b8\u2217(\u03bb, \u03b1)]i \u2208\nng\n\u03b1\n(cid:107)\u03b2\u2217\n[\u22121, 1],\nIn view of Eq. (10), we can see that\ng \u03b8\u2217(\u03bb, \u03b1)) = \u03b1\ng \u03b8\u2217(\u03bb, \u03b1]i\ng (\u03bb, \u03b1) = 0, we have\n\n(cid:12)(cid:12) \u2264 1 then [\u03b2\u2217\n\n(b): If (cid:12)(cid:12)[XT\n\nCase 2. If \u03b2\u2217\n\n(a): S1(XT\n\ng (\u03bb,\u03b1)]i\n\n\u221a\n\nng\n\ng (\u03bb, \u03b1)]i),\n\nif [\u03b2\u2217\nif [\u03b2\u2217\n\ng (\u03bb, \u03b1)]i (cid:54)= 0,\ng (\u03bb, \u03b1)]i = 0.\n\n\u03b2\u2217\ng (\u03bb1,\u03bb2)(cid:107) and (cid:107)S1(XT\ng (\u03bb1,\u03bb2)\n(cid:107)\u03b2\u2217\ng (\u03bb, \u03b1)]i = 0.\n\ng \u03b8\u2217(\u03bb, \u03b1))(cid:107) = \u03b1\n\n\u221a\n\nng,\n\n[XT\n\ng \u03b8\u2217(\u03bb, \u03b1)]i \u2208 \u03b1\n\nng[ug]i + [\u22121, 1], (cid:107)ug(cid:107) \u2264 1.\n\n\u221a\n\nThe \ufb01rst layer (group-level) of TLFre From (11) in Case 1, we have\n\nng \u21d2 \u03b2\u2217\n\ng (\u03bb, \u03b1) = 0.\n\nClearly, (R1) can be used to identify the inactive groups and thus a group-level screening rule.\nThe second layer (feature-level) of TLFre Let xgi be the ith column of Xg. We have\n[XT\n\n\u03b8\u2217(\u03bb, \u03b1). In view of (12) and (13), we can see that\n\ng \u03b8\u2217(\u03bb, \u03b1)]i = xT\n\ngi\n\n\u221a\n\ng \u03b8\u2217(\u03bb, \u03b1))(cid:13)(cid:13) < \u03b1\n(cid:13)(cid:13)S1(XT\n(cid:12)(cid:12)xT\n\u03b8\u2217(\u03bb, \u03b1)(cid:12)(cid:12) \u2264 1 \u21d2 [\u03b2\u2217\n\ngi\n\ng (\u03bb, \u03b1)]i = 0.\n\nDifferent from (R1), (R2) detects the inactive features and thus it is a feature-level screening rule.\nHowever, we cannot directly apply (R1) and (R2) to identify the inactive groups/features because\nboth need to know \u03b8\u2217(\u03bb, \u03b1).\nInspired by the SAFE rules [4], we can \ufb01rst estimate a region \u0398\ng \u03b8 : \u03b8 \u2208 \u0398}. Then, (R1) and (R2) can be relaxed as follows:\ncontaining \u03b8\u2217(\u03bb, \u03b1). Let XT\n(R1\u2217)\n(R2\u2217)\n\n(cid:8)(cid:107)S1(\u03beg)(cid:107) : \u03beg \u2208 \u039eg \u2287 XT\n(cid:8)(cid:12)(cid:12)xT\n\u03b8(cid:12)(cid:12) : \u03b8 \u2208 \u0398(cid:9) \u2264 1 \u21d2 [\u03b2\u2217\n\ng \u0398(cid:9) < \u03b1\n\ng \u0398 = {XT\n\nng \u21d2 \u03b2\u2217\n\ng (\u03bb, \u03b1) = 0,\n\ng (\u03bb, \u03b1)]i = 0.\n\nsup\u03beg\nsup\u03b8\n\n\u221a\n\nInspired by (R1\u2217) and (R2\u2217), we develop TLFre via the following three steps:\nStep 1. Given \u03bb and \u03b1, we estimate a region \u0398 that contains \u03b8\u2217(\u03bb, \u03b1).\nStep 2. We solve for the supreme values in (R1\u2217) and (R2\u2217).\nStep 3. By plugging in the supreme values from Step 2, (R1\u2217) and (R2\u2217) result in the desired\n\ngi\n\ntwo-layer screening rules for SGL.\n\n4\n\n\f3.3 The Set of Parameter Values Leading to Zero Solution\n\u221a\nFor notational convenience, let F \u03b1\nng}, g = 1, . . . , G; and thus the\ng \u03b8)(cid:107) \u2264 \u03b1\nfeasible set of the Fenchel\u2019s dual of SGL is F \u03b1 = \u2229g=1,...,G F \u03b1\ng . In view of problem (5) [or (9)],\nwe can see that \u03b8\u2217(\u03bb, \u03b1) is the projection of y/\u03bb on F \u03b1, i.e., \u03b8\u2217(\u03bb, \u03b1) = PF \u03b1 (y/\u03bb). Thus, if\ny/\u03bb \u2208 F \u03b1, we have \u03b8\u2217(\u03bb, \u03b1) = y/\u03bb. Moreover, by (R1), we can see that \u03b2\u2217(\u03bb, \u03b1) = 0 if y/\u03bb is an\ninterior point of F \u03b1. Indeed, we have the following stronger result.\n\ng = {\u03b8 : (cid:107)S1(XT\n\nmax = maxg {\u03c1g :(cid:13)(cid:13)S1(XT\n\ng y/\u03c1g)(cid:13)(cid:13) = \u03b1\n\nng}. Then,\n\n\u221a\n\nTheorem 4. For the SGL problem, let \u03bb\u03b1\n\u03bb \u2208 F \u03b1 \u21d4 \u03b8\u2217(\u03bb, \u03b1) = y\n\ny\n\n\u03bb \u21d4 \u03b2\u2217(\u03bb, \u03b1) = 0 \u21d4 \u03bb \u2265 \u03bb\u03b1\n\nmax.\n\n2\n\n1\n\n1\n\n1\u221a\nng\n\n1\n1\u221a\nng\n\n:= maxg\n\n(\u03bb2) = maxg\n\n(cid:107)S\u03bb2 (XT\n\ng y)(cid:107). Then,\n\n:= (cid:107)XT y(cid:107)\u221e, then \u00af\u03b2\u2217(\u03bb1, \u03bb2) = 0.\n\n\u03c1g in the de\ufb01nition of \u03bb\u03b1\nmax admits a closed form solution [24]. Theorem 4 implies that the optimal\nsolution \u03b2\u2217(\u03bb, \u03b1) is 0 as long as y/\u03bb \u2208 F \u03b1. This geometric property also leads to an explicit\ncharacterization of the set of (\u03bb1, \u03bb2) such that the corresponding solution of problem (2) is 0. We\ndenote by \u00af\u03b2\u2217(\u03bb1, \u03bb2) the optimal solution of problem (2).\nCorollary 5. For the SGL problem in (2), let \u03bbmax\n(i): \u00af\u03b2\u2217(\u03bb1, \u03bb2) = 0 \u21d4 \u03bb1 \u2265 \u03bbmax\n(\u03bb2).\n(cid:107)XT\ng y(cid:107) or \u03bb2 \u2265 \u03bbmax\n(ii): If \u03bb1 \u2265 \u03bbmax\n4 The Two-Layer Screening Rules for SGL\nWe follow the three steps in Section 3.2 to develop TLFre. In Section 4.1, we give an accurate\nestimation of \u03b8\u2217(\u03bb, \u03b1) via normal cones [15]. Then, we compute the supreme values in (R1\u2217) and\n(R2\u2217) by solving nonconvex problems in Section 4.2. We present the TLFre rules in Section 4.3.\n4.1 Estimation of the Dual Optimal Solution\nBecause of the geometric property of the dual problem in (5), i.e., \u03b8\u2217(\u03bb, \u03b1) = PF \u03b1 (y/\u03bb), we have\na very useful characterization of the dual optimal solution via the so-called normal cones [15].\nDe\ufb01nition 1. [15] For a closed convex set C \u2208 Rn and a point w \u2208 C, the normal cone to C at w is\n(14)\nmax. Thus, we can estimate \u03b8\u2217(\u03bb, \u03b1) in terms of \u03b8\u2217(\u00af\u03bb, \u03b1).\n\nBy Theorem 4, \u03b8\u2217(\u00af\u03bb, \u03b1) is known if \u00af\u03bb = \u03bb\u03b1\nDue to the same reason, we only consider the cases with \u03bb < \u03bb\u03b1\nRemark 4. In many applications, the parameter values that perform the best are usually unknown.\nTo determine appropriate parameter values, commonly used approaches such as cross validation\nand stability selection involve solving SGL many times over a grip of parameter values. Thus, given\n{\u03b1(i)}I\ni=1 and \u03bb(1) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bb(J ), we can \ufb01x the value of \u03b1 each time and solve SGL by varying the\nvalue of \u03bb. We repeat the process until we solve SGL for all of the parameter values.\nTheorem 6. For the SGL problem in (3), suppose that \u03b8\u2217(\u00af\u03bb, \u03b1) is known with \u00af\u03bb \u2264 \u03bb\u03b1\ng = 1, . . . , G, be de\ufb01ned by Theorem 4. For any \u03bb \u2208 (0, \u00af\u03bb), we de\ufb01ne\n\nNC(w) = {v : (cid:104)v, w(cid:48) \u2212 w(cid:105) \u2264 0, \u2200w(cid:48) \u2208 C}.\n\nmax for \u03b8\u2217(\u03bb, \u03b1) to be estimated.\n\nmax. Let \u03c1g,\n\nif \u00af\u03bb < \u03bb\u03b1\nif \u00af\u03bb = \u03bb\u03b1\n\nmax,\nmax,\n\nwhere X\u2217 = argmaxXg\n\n\u03c1g,\n\nmax),\n\nn\u03b1(\u00af\u03bb) =\n\n(cid:26)y/\u00af\u03bb \u2212 \u03b8\u2217(\u00af\u03bb, \u03b1),\n\nX\u2217S1(XT\u2217 y/\u03bb\u03b1\n\u03bb \u2212 \u03b8\u2217(\u00af\u03bb, \u03b1),\n\nv\u03b1(\u03bb, \u00af\u03bb) = y\nv\u03b1(\u03bb, \u00af\u03bb)\u22a5 = v\u03b1(\u03bb, \u00af\u03bb) \u2212 (cid:104)v\u03b1(\u03bb,\u00af\u03bb),n\u03b1(\u00af\u03bb)(cid:105)\n\n(cid:107)n\u03b1(\u00af\u03bb)(cid:107)2\n\nn\u03b1(\u00af\u03bb).\n\nThen, the following hold:\n(i): n\u03b1(\u00af\u03bb) \u2208 NF \u03b1 (\u03b8\u2217(\u00af\u03bb, \u03b1)),\n(ii): (cid:107)\u03b8\u2217(\u03bb, \u03b1) \u2212 (\u03b8\u2217(\u00af\u03bb, \u03b1) + 1\nFor notational convenience, let o\u03b1(\u03bb, \u00af\u03bb) = \u03b8\u2217(\u00af\u03bb, \u03b1) + 1\nlies inside the ball of radius 1\n4.2 Solving for the supreme values via Nonconvex Optimization\nWe solve the optimization problems in (R1\u2217) and (R2\u2217). To simplify notations, let\n\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107) centered at o\u03b1(\u03bb, \u00af\u03bb).\n\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107).\n2 v\u22a5\n\n\u03b1 (\u03bb, \u00af\u03bb))(cid:107) \u2264 1\n\n2(cid:107)v\u22a5\n\n2(cid:107)v\u22a5\n\n2 v\u22a5\n\n\u0398 = {\u03b8 : (cid:107)\u03b8 \u2212 o\u03b1(\u03bb, \u00af\u03bb)(cid:107) \u2264 1\n\n\u039eg =(cid:8)\u03beg : (cid:107)\u03beg \u2212 XT\n\n2(cid:107)v\u22a5\ng o\u03b1(\u03bb, \u00af\u03bb)(cid:107) \u2264 1\n\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107)},\n2(cid:107)v\u22a5\n\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107)(cid:107)Xg(cid:107)2\n\n\u03b1 (\u03bb, \u00af\u03bb). Theorem 6 shows that \u03b8\u2217(\u03bb, \u03b1)\n\n(cid:9) , g = 1, . . . , G.\n\n(15)\n(16)\n\n5\n\n\fgi\n\n(cid:0)s\u2217\ng(\u03bb, \u00af\u03bb; \u03b1)(cid:1)2\n\n(cid:8) 1\n2(cid:107)S1(\u03beg)(cid:107)2 : \u03beg \u2208 \u039eg\n\n(cid:9) .\n\nTheorem 6 indicates that \u03b8\u2217(\u03bb, \u03b1) \u2208 \u0398. Moreover, we can see that XT\ng \u0398 \u2286 \u039eg, g = 1, . . . , G. To\ndevelop the TLFre rule by (R1\u2217) and (R2\u2217), we need to solve the following optimization problems:\n(17)\n(18)\n\ng(\u03bb, \u00af\u03bb; \u03b1) = sup\u03beg {(cid:107)S1(\u03beg)(cid:107) : \u03beg \u2208 \u039eg}, g = 1, . . . , G,\ns\u2217\n(\u03bb, \u00af\u03bb; \u03b1) = sup\u03b8 {|xT\nt\u2217\n\n\u03b8| : \u03b8 \u2208 \u0398}, i = 1, . . . , ng, g = 1, . . . , G.\n\ngi\n\nSolving problem (17) We consider the following equivalent problem of (17):\n\n1\n2\n\n= sup\u03beg\n\n(19)\nWe can see that the objective function of problem (19) is continuously differentiable and the feasible\nset is a ball. Thus, (19) is a nonconvex problem because we need to maximize a convex function\nsubject to a convex set. We derive the closed form solutions of problems (17) and (19) as follows.\n2(cid:107)v\u22a5\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107)(cid:107)Xg(cid:107)2 and \u039e\u2217\nTheorem 7. For problems (17) and (19), let c = XT\nthe set of the optimal solutions.\n(i) Suppose that c /\u2208 B\u221e, i.e., (cid:107)c(cid:107)\u221e > 1. Let u = rS1(c)/(cid:107)S1(c)(cid:107). Then,\ng = {c + u}.\n\ng(\u03bb, \u00af\u03bb; \u03b1) = (cid:107)S1(c)(cid:107) + r and \u039e\u2217\ns\u2217\n\ng o\u03b1(\u03bb, \u00af\u03bb), r = 1\n\ng be\n\n(20)\n\n(ii) Suppose that c is a boundary point of B\u221e, i.e., (cid:107)c(cid:107)\u221e = 1. Then,\n\ns\u2217\ng(\u03bb, \u00af\u03bb; \u03b1) = r and \u039e\u2217\n\ng = {c + u : u \u2208 NB\u221e(c),(cid:107)u(cid:107) = r} .\n\n(iii) Suppose that c \u2208 intB\u221e, i.e., (cid:107)c(cid:107)\u221e < 1. Let i\u2217 \u2208 I\u2217 = {i : |[c]i| = (cid:107)c(cid:107)\u221e}. Then,\n\n(21)\n\n(22)\n\n\uf8f1\uf8f2\uf8f3\u039eg,\n\ng(\u03bb, \u00af\u03bb; \u03b1) = ((cid:107)c(cid:107)\u221e + r \u2212 1)+ ,\ns\u2217\n\n\u039e\u2217\ng =\n\n{c + r \u00b7 sgn([c]i\u2217 )ei\u2217 : i\u2217 \u2208 I\u2217} ,\n{r \u00b7 ei\u2217 ,\u2212r \u00b7 ei\u2217 : i\u2217 \u2208 I\u2217} ,\n\nif \u039eg \u2282 B\u221e,\nif \u039eg (cid:54)\u2282 B\u221e and c (cid:54)= 0,\nif \u039eg (cid:54)\u2282 B\u221e and c = 0,\n\nwhere ei is the ith standard basis vector.\n\ngi\n\ngi\n\ngi\n\n2(cid:107)v\u22a5\n\n(cid:12)(cid:12)(cid:12)xT\n\n\u02c6gi\n\ng (\u03bb(j+1), \u03b1) = 0.\n\no\u03b1(\u03bb, \u00af\u03bb)| + 1\n\n(\u03bb, \u00af\u03bb; \u03b1) = |xT\n\n(cid:16) y\u2212X\u03b2\u2217(\u03bb(j),\u03b1)\n\nSolving problem in (18) Problem (18) can be solved directly via the Cauchy-Schwarz inequality.\nTheorem 8. For problem (18), we have t\u2217\n4.3 The Proposed Two-Layer Screening Rules\nTo develop the two-layer screening rules for SGL, we only need to plug the supreme values\ng(\u03bb2, \u00af\u03bb2; \u03bb1) and t\u2217\ns\u2217\nTheorem 9. For the SGL problem in (3), suppose that we are given \u03b1 and a sequence of parameter\nmax = \u03bb(0) > \u03bb(1) > . . . > \u03bb(J ). For each integer 0 \u2264 j < J , we assume that \u03b2\u2217(\u03bb(j), \u03b1)\nvalues \u03bb\u03b1\n\u03b1 (\u03bb(j+1), \u03bb(j)) and s\u2217\nis known. Let \u03b8\u2217(\u03bb(j), \u03b1), v\u22a5\ng(\u03bb(j+1), \u03bb(j); \u03b1) be given by Eq. (6), Theorems 6\nand 7, respectively. Then, for g = 1, . . . , G, the following holds\n\u221a\n(L1)\nng \u21d2 \u03b2\u2217\n\n(\u03bb2, \u00af\u03bb2; \u03bb1) in (R1\u2217) and (R2\u2217). We present the TLFre rule as follows.\n\ns\u2217\ng(\u03bb(j+1), \u03bb(j); \u03b1) < \u03b1\n\n\u03b1 (\u03bb, \u00af\u03bb)(cid:107)(cid:107)xgi(cid:107).\n\n\u02c6g (\u03bb(j+1), \u03b1)]i = 0 if\n\nFor the \u02c6gth group that does not pass the rule in (L1), we have [\u03b2\u2217\n2(cid:107)v\u22a5\n\n\u03b1 (\u03bb(j+1), \u03bb(j))(cid:107)(cid:107)x\u02c6gi(cid:107) \u2264 1.\n(L1) and (L2) are the \ufb01rst layer and second layer screening rules of TLFre, respectively.\n5 Experiments\nWe evaluate TLFre on both synthetic and real data sets. To measure the performance of TLFre,\nwe compute the rejection ratios of (L1) and (L2), respectively. Let m be the number of features\nwith 0 coef\ufb01cients in the solution, G be the index set of groups discarded by (L1), and p be the\nnumber of inactive features detected by (L2). The rejection ratios of (L1) and (L2) are de\ufb01ned by\n|p|\nm , respectively. We also report the speedup gained by TLFre: the ratio of\nr1 =\nthe running time of solver without screening to the running time of solver with TLFre. The code of\nTLFre integrated with the solver from SLEP [9] is available at dpc-screening.github.io.\nTo determine appropriate values of \u03b1 and \u03bb by cross validation or stability selection, we can run\nTLFre with as many parameter values as we need. Given a data set, for illustrative purposes only,\nwe select seven values of \u03b1 from {tan(\u03c8) : \u03c8 = 5\u25e6, 15\u25e6, 30\u25e6, 45\u25e6, 60\u25e6, 75\u25e6, 85\u25e6}. Then, for each\nvalue of \u03b1, we run TLFre along a sequence of 100 values of \u03bb equally spaced on the logarithmic\nscale of \u03bb/\u03bb\u03b1\nmax from 1 to 0.01. Thus, 700 pairs of parameter values of (\u03bb, \u03b1) are sampled in total.\n\n(cid:17)(cid:12)(cid:12)(cid:12) + 1\n\n\u03b1 (\u03bb(j+1), \u03bb(j))\n\n+ 1\n\n2 v\u22a5\n\nand r2 =\n\ng\u2208G ng\nm\n\n(L2)\n\n(cid:80)\n\n\u03bb(j)\n\n6\n\n\f(a)\n\n(b) \u03b1 = tan(5\u25e6)\n\n(c) \u03b1 = tan(15\u25e6)\n\n(d) \u03b1 = tan(30\u25e6)\n\n(e) \u03b1 = tan(45\u25e6)\n\n(f) \u03b1 = tan(60\u25e6)\n\n(g) \u03b1 = tan(75\u25e6)\n\n(h) \u03b1 = tan(85\u25e6)\n\nFigure 1: Rejection ratios of TLFre on the Synthetic 1 data set.\n\n(a)\n\n(b) \u03b1 = tan(5\u25e6)\n\n(c) \u03b1 = tan(15\u25e6)\n\n(d) \u03b1 = tan(30\u25e6)\n\n(e) \u03b1 = tan(45\u25e6)\n\n(f) \u03b1 = tan(60\u25e6)\n\n(g) \u03b1 = tan(75\u25e6)\n\n(h) \u03b1 = tan(85\u25e6)\n\nFigure 2: Rejection ratios of TLFre on the Synthetic 2 data set.\n\n5.1 Simulation Studies\nWe perform experiments on two synthetic data sets that are commonly used in the literature [19, 29].\nThe true model is y = X\u03b2\u2217 + 0.01\u0001, \u0001 \u223c N (0, 1). We generate two data sets with 250 \u00d7 10000\nentries: Synthetic 1 and Synthetic 2. We randomly break the 10000 features into 1000 groups. For\nSynthetic 1, the entries of the data matrix X are i.i.d. standard Gaussian with pairwise correlation\nzero, i.e., corr(xi, xi) = 0. For Synthetic 2, the entries of the data matrix X are drawn from i.i.d.\nstandard Gaussian with pairwise correlation 0.5|i\u2212j|, i.e., corr(xi, xj) = 0.5|i\u2212j|. To construct \u03b2\u2217,\nwe \ufb01rst randomly select \u03b31 percent of groups. Then, for each selected group, we randomly select \u03b32\npercent of features. The selected components of \u03b2\u2217 are populated from a standard Gaussian and the\nremaining ones are set to 0. We set \u03b31 = \u03b32 = 10 for Synthetic 1 and \u03b31 = \u03b32 = 20 for Synthetic 2.\n(\u03bb2) (see Corollary\nThe \ufb01gures in the upper left corner of Fig. 1 and Fig. 2 show the plots of \u03bbmax\n5) and the sampled parameter values of \u03bb and \u03b1 (recall that \u03bb1 = \u03b1\u03bb and \u03bb2 = \u03bb). For the other\n\ufb01gures, the blue and red regions represent the rejection ratios of (L1) and (L2), respectively. We\ncan see that TLFre is very effective in discarding inactive groups/features; that is, more than 90%\nof inactive features can be detected. Moreover, we can observe that the \ufb01rst layer screening (L1)\nbecomes more effective with a larger \u03b1. Intuitively, this is because the group Lasso penalty plays a\nmore important role in enforcing the sparsity with a larger value of \u03b1 (recall that \u03bb1 = \u03b1\u03bb). The top\nand middle parts of Table 1 indicate that the speedup gained by TLFre is very signi\ufb01cant (up to 30\ntimes) and TLFre is very ef\ufb01cient. Compared to the running time of the solver without screening,\nthe running time of TLFre is negligible. The running time of TLFre includes that of computing\n(cid:107)Xg(cid:107)2, g = 1, . . . , G, which can be ef\ufb01ciently computed by the power method [6]. Indeed, this can\nbe shared for TLFre with different parameter values.\n5.2 Experiments on Real Data Set\nWe perform experiments on the Alzheimer\u2019s Disease Neuroimaging Initiative (ADNI) data set\n(http://adni.loni.usc.edu/). The data matrix consists of 747 samples with 426040 single\n\n1\n\n7\n\n02004006008000100200300400\u03bb2\u03bb1 \u03bbmax1(\u03bb2)\u03b1=tan(5\u25e6)\u03b1=tan(15\u25e6)\u03b1=tan(30\u25e6)\u03b1=tan(45\u25e6)\u03b1=tan(60\u25e6)\u03b1=tan(75\u25e6)\u03b1=tan(85\u25e6)0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio050010000200400600\u03bb2\u03bb1 \u03bbmax1(\u03bb2)\u03b1=tan(5\u25e6)\u03b1=tan(15\u25e6)\u03b1=tan(30\u25e6)\u03b1=tan(45\u25e6)\u03b1=tan(60\u25e6)\u03b1=tan(75\u25e6)\u03b1=tan(85\u25e6)0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio\fTable 1: Running time (in seconds) for solving SGL along a sequence of 100 tuning parameter\nvalues of \u03bb equally spaced on the logarithmic scale of \u03bb/\u03bb\u03b1\nmax from 1.0 to 0.01 by (a): the solver [9]\nwithout screening; (b): the solver combined with TLFre. The top and middle parts report the results\nof TLFre on Synthetic 1 and Synthetic 2. The bottom part reports the results of TLFre on the ADNI\ndata set with the GMV data as response.\n\nSynthetic 1\n\nSynthetic 2\n\nADNI+GMV\n\n\u03b1\n\nsolver\nTLFre\n\nTLFre+solver\n\nspeedup\nsolver\nTLFre\n\nTLFre+solver\n\nspeedup\nsolver\nTLFre\n\nTLFre+solver\n\nspeedup\n\ntan(5\u25e6) tan(15\u25e6) tan(30\u25e6) tan(45\u25e6) tan(60\u25e6) tan(75\u25e6) tan(85\u25e6)\n291.24\n298.36\n0.77\n0.77\n10.26\n22.53\n12.93\n29.09\n292.24\n294.64\n0.82\n0.79\n22.80\n11.05\n26.66\n12.82\n\n301.74\n0.78\n12.47\n24.19\n294.92\n0.80\n12.89\n22.88\n\n307.71\n0.79\n17.69\n17.40\n297.50\n0.81\n18.90\n15.74\n\n308.69\n0.79\n15.73\n19.63\n297.29\n0.80\n16.08\n18.49\n\n311.33\n0.81\n19.71\n15.79\n297.59\n0.81\n20.45\n14.55\n\n307.53\n0.79\n21.95\n14.01\n295.51\n0.81\n21.58\n13.69\n\n30652.56 30755.63\n\n30838.29\n\n31096.10\n\n30850.78\n\n30728.27\n\n30572.35\n\n64.08\n372.04\n82.39\n\n64.56\n383.17\n80.27\n\n64.96\n386.80\n79.73\n\n65.00\n402.72\n77.22\n\n64.89\n391.63\n78.78\n\n65.17\n385.98\n79.61\n\n65.05\n382.62\n79.90\n\n(a)\n\n(b) \u03b1 = tan(5\u25e6)\n\n(c) \u03b1 = tan(15\u25e6)\n\n(d) \u03b1 = tan(30\u25e6)\n\n(e) \u03b1 = tan(45\u25e6)\n\n(h) \u03b1 = tan(85\u25e6)\nFigure 3: Rejection ratios of TLFre on the ADNI data set with grey matter volume as response.\n\n(g) \u03b1 = tan(75\u25e6)\n\n(f) \u03b1 = tan(60\u25e6)\n\n1\n\nnucleotide polymorphisms (SNPs), which are divided into 94765 groups. The response vector is the\ngrey matter volume (GMV).\nThe \ufb01gure in the upper left corner of Fig. 3 shows the plots of \u03bbmax\n(\u03bb2) (see Corollary 5) and the\nsampled parameter values of \u03b1 and \u03bb. The other \ufb01gures present the rejection ratios of (L1) and (L2)\nby blue and red regions, respectively. We can see that almost all of the inactive groups/features are\ndiscarded by TLFre. The rejection ratios of r1 + r2 are very close to 1 in all cases. The bottom part\nof Table 1 shows that TLFre leads to a very signi\ufb01cant speedup (about 80 times). In other words, the\nsolver without screening needs about eight and a half hours to solve the 100 SGL problems for each\nvalue of \u03b1. However, combined with TLFre, the solver needs only six to eight minutes. Moreover,\nwe can observe that the computational cost of TLFre is negligible compared to that of the solver\nwithout screening. This demonstrates the ef\ufb01ciency of TLFre.\n6 Conclusion\nIn this paper, we propose a novel feature reduction method for SGL via decomposition of convex\nsets. We also derive the set of parameter values that lead to zero solutions of SGL. To the best\nof our knowledge, TLFre is the \ufb01rst method which is applicable to sparse models with multiple\nsparsity-inducing regularizers. More importantly, the proposed approach provides novel framework\nfor developing screening methods for complex sparse models with multiple sparsity-inducing regu-\nlarizers, e.g., (cid:96)1 SVM that performs both sample and feature selection, fused Lasso and tree Lasso\nwith more than two regularizers. Experiments on both synthetic and real data sets demonstrate the\neffectiveness and ef\ufb01ciency of TLFre. We plan to generalize the idea of TLFre to (cid:96)1 SVM, fused\nLasso and tree Lasso, which are expected to consist of multiple layers of screening.\n\n8\n\n050100150050100\u03bb2\u03bb1 \u03bbmax1(\u03bb2)\u03b1=tan(5\u25e6)\u03b1=tan(15\u25e6)\u03b1=tan(30\u25e6)\u03b1=tan(45\u25e6)\u03b1=tan(60\u25e6)\u03b1=tan(75\u25e6)\u03b1=tan(85\u25e6)0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio0.010.020.040.10.20.410.10.30.50.70.91\u03bb/\u03bb\u03b1maxRejection Ratio\fReferences\n[1] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer, 2011.\n\n[2] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization, Second Edition. Canadian\n\nMathematical Society, 2006.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Paci\ufb01c\n\nJournal of Optimization, 8:667\u2013698, 2012.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. arX-\n\niv:1001.0736.\n\n[6] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algorithms for\n\nconstructing approximate matrix decompositions. SIAM Review, 53:217\u2013288, 2011.\n\n[7] J.-B. Hiriart-Urruty. From convex optimization to nonconvex optimization. necessary and suf\ufb01cient con-\n\nditions for global optimality. In Nonsmooth optimization and related topics. Springer, 1988.\n\n[8] J.-B. Hiriart-Urruty. A note on the Legendre-Fenchel transform of convex composite functions. In Nons-\n\nmooth Mechanics and Analysis. Springer, 2006.\n\n[9] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Ef\ufb01cient Projections. Arizona State University, 2009.\n[10] J. Liu and J. Ye. Moreau-Yosida regularization for grouped tree structure learning. In Advances in neural\n\ninformation processing systems, 2010.\n\n[11] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe screening with variational inequalities and its application to\n\nlasso. In International Conference on Machine Learning, 2014.\n\n[12] K. Ogawa, Y. Suzuki, S. Suzumura, and I. Takeuchi. Safe sample screening for Support Vector Machine.\n\narXiv:1401.6740, 2014.\n\n[13] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computa-\n\ntion. In ICML, 2013.\n\n[14] J. Peng, J. Zhu, A. Bergamaschi, W. Han, D. Noh, J. Pollack, and P. Wang. Regularized multivariate\nregression for indentifying master predictors with application to integrative genomics study of breast\ncancer. The Annals of Appliced Statistics, 4:53\u201377, 2010.\n\n[15] A. Ruszczy\u00b4nski. Nonlinear Optimization. Princeton University Press, 2006.\n[16] N. Simon, J. Friedman., T. Hastie., and R. Tibshirani. A Sparse-Group Lasso. Journal of Computational\n\nand Graphical Statistics, 22:231\u2013245, 2013.\n\n[17] P. Sprechmann, I. Ram\u00b4\u0131rez, G. Sapiro., and Y. Eldar. C-HiLasso: a collaborative hierarchical sparse\n\nmodeling framework. IEEE Transactions on Signal Processing, 59:4183\u20134198, 2011.\n\n[18] R. Tibshirani. Regression shringkage and selection via the lasso. Journal of the Royal Statistical Society\n\nSeries B, 58:267\u2013288, 1996.\n\n[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. Tibshirani. Strong rules for\ndiscarding predictors in lasso-type problems. Journal of the Royal Statistical Society Series B, 74:245\u2013\n266, 2012.\n\n[20] M. Vidyasagar. Machine learning methods in the cocomputation biology of cancer. In Proceedings of the\n\nRoyal Society A, 2014.\n\n[21] M. Vincent and N. Hansen. Sparse group lasso and high dimensional multinomial classi\ufb01cation. Compu-\n\ntational Statistics and Data Analysis, 71:771\u2013786, 2014.\n\n[22] J. Wang, J. Jun, and J. Ye. Ef\ufb01cient mixed-norm regularization: Algorithms and safe screening methods.\n\narXiv:1307.4156v1.\n\n[23] J. Wang, P. Wonka, and J. Ye. Scaling svm and least absolute deviations via exact data reduction. In\n\nInternational Conference on Machine Learning, 2014.\n\n[24] J. Wang and J. Ye. Two-Layer feature reduction for sparse-group lasso via decomposition of convex sets.\n\narXiv:1410.4210v1, 2014.\n\n[25] J. Wang, J. Zhou, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. In Advances\n\nin neural information processing systems, 2013.\n\n[26] Z. J. Xiang and P. J. Ramadge. Fast lasso screening tests based on correlations. In IEEE ICASSP, 2012.\n[27] D. Yogatama and N. Smith. Linguistic structured sparsity in text categorization. In Proceedings of the\n\nAnnual Meeting of the Association for Computational Linguistics, 2014.\n\n[28] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society Series B, 68:49\u201367, 2006.\n\n[29] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society Series B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1132, "authors": [{"given_name": "Jie", "family_name": "Wang", "institution": "Arizona State University"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "Arizona State University"}]}