{"title": "GAP Safe Screening Rules for Sparse-Group Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 396, "abstract": "For statistical learning in high dimension, sparse regularizations have proven useful to boost both computational and statistical efficiency. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. Sparse-Group Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature and at the group level. We propose the first (provably) safe screening rules for Sparse-Group Lasso, i.e., rules that allow to discard early in the solver features/groups that are inactive at optimal solution. Thanks to efficient dual gap computations relying on the geometric properties of $\\epsilon$-norm, safe screening rules for Sparse-Group Lasso lead to significant gains in term of computing time for our coordinate descent implementation.", "full_text": "GAP Safe Screening Rules for Sparse-Group Lasso\n\nEugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon\n\nLTCI, CNRS, T\u00e9l\u00e9com ParisTech\n\nUniversit\u00e9 Paris-Saclay\n\n75013 Paris, France\n\nfirst.last@telecom-paristech.fr\n\nAbstract\n\nFor statistical learning in high dimension, sparse regularizations have proven useful\nto boost both computational and statistical ef\ufb01ciency. In some contexts, it is natural\nto handle more re\ufb01ned structures than pure sparsity, such as for instance group\nsparsity. Sparse-Group Lasso has recently been introduced in the context of linear\nregression to enforce sparsity both at the feature and at the group level. We propose\nthe \ufb01rst (provably) safe screening rules for Sparse-Group Lasso, i.e., rules that allow\nto discard early in the solver features/groups that are inactive at optimal solution.\nThanks to ef\ufb01cient dual gap computations relying on the geometric properties of\n\u0001-norm, safe screening rules for Sparse-Group Lasso lead to signi\ufb01cant gains in\nterm of computing time for our coordinate descent implementation.\n\n1\n\nIntroduction\n\nSparsity is a critical property for the success of regression methods, especially in high dimension.\nOften, group (or block) sparsity is helpful when a known group structure needs to be enforced. This\nis for instance the case in multi-task learning [1] or multinomial logistic regression [5, Chapter 3]. In\nthe multi-task setting, the group structure appears natural since one aims at jointly recovering signals\nwhose supports are shared. In this context, sparsity and group sparsity are generally obtained by\nadding a regularization term to the data-\ufb01tting: (cid:96)1 norm for sparsity and (cid:96)1,2 norm for group sparsity.\nAlong with recent works on hierarchical regularization [12, 17] have focused on a speci\ufb01c case:\nthe Sparse-Group Lasso. This method is the solution of a (convex) optimization program with a\nregularization term that is a convex combination of the two aforementioned norms, enforcing sparsity\nand group sparsity at the same time.\nWith such advanced regularizations, the computational burden can be particularly heavy in high\ndimension. Yet, it can be signi\ufb01cantly reduced if one can exploit the known sparsity of the solution\nin the optimization. Following the seminal paper on \u201csafe screening rules\u201d [9], many contributions\nhave investigated such strategies [21, 20, 3]. These so called safe screening rules compute some\ntests on dual feasible points to eliminate primal variables whose coef\ufb01cients are guaranteed to be\nzero in the exact solution. Still, the computation of a dual feasible point can be challenging when\nthe regularization is more complex than (cid:96)1 or (cid:96)1,2 norms. This is the case for the Sparse-Group\nLasso as it is not straightforward to characterize if a dual point is feasible or not [20]. Here, we\npropose an ef\ufb01cient computation of the associated dual norm. It is all the more crucial since the naive\nimplementation computes the Sparse-Group Lasso dual norm with a quadratic complexity w.r.t the\ngroups dimensions.\nWe propose here ef\ufb01cient safe screening rules for the Sparse-Group Lasso that combine sequential\nrules (i.e., rules that perform screening thanks to solutions obtained for a previously processed tuning\nparameter) and dynamic rules (i.e., rules that perform screening as the algorithm proceeds) in a\nuni\ufb01ed way. We elaborate on GAP safe rules, a strategy relying on dual gap computations introduced\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffor the Lasso [10] and to more general learning tasks in [15]. Note that alternative (unsafe) screening\nrules, for instance the \u201cstrong rules\u201d [19], have been applied to the Lasso and its simple variants.\nOur contributions are two fold here. First, we introduce the \ufb01rst safe screening rules for this problem,\nother alleged safe rules [20] for Sparse-Group Lasso were in fact not safe, as explained in detail in\n[15], and could lead to non-convergent implementation. Second, we link the Sparse-Group Lasso\npenalties to the \u0001-norm in [6]. This allows to provide a new algorithm to ef\ufb01ciently compute the\nrequired dual norms, adapting an algorithm introduced in [7]. We incorporate our proposed GAP Safe\nrules in a block coordinate descent algorithm and show its practical ef\ufb01ciency in climate prediction\ntasks. Another strategy leveraging dual gap computations and active sets has recently been proposed\nunder the name Blitz [13]. It could naturally bene\ufb01t from our fast dual norm evaluations in this\ncontext.\nNotation For any integer d P N, we denote by rds the set t1, . . . , du. The standard Euclidean norm\nis written (cid:107)\u00a8(cid:107), the (cid:96)1 norm (cid:107)\u00a8(cid:107)1, the (cid:96)8 norm (cid:107)\u00a8(cid:107)8, and the transpose of a matrix Q is denoted by\nQJ. We also denote ptq` \u201c maxp0, tq. Our observation vector is y P Rn and the design matrix\nX \u201c rX1, . . . , Xps P Rn\u02c6p has p features, stored column-wise. We consider problems where the\nvector of parameters \u03b2 \u201c p\u03b21, . . . , \u03b2pqJ admits a natural group structure. A group of features is a\nsubset g \u0102 rps and ng is its cardinality. The set of groups is denoted by G and we focus only on\nnon-overlapping groups that form a partition of rps. We denote by \u03b2g the vector in Rng which is the\nrestriction of \u03b2 to the indexes in g. We write r\u03b2gsj the j-th coordinate of \u03b2g. We also use the notation\nXg P Rn\u02c6ng for the sub-matrix of X assembled from the columns with indexes j P g; similarly\nrXgsj is the j-th column of rXgs.\nFor any norm \u2126, B\u2126 refers to the corresponding unit ball, and B (resp. B8) stands for the Euclidean\n(resp. (cid:96)8) unit ball. The soft-thresholding operator (at level \u03c4 \u011b 0), S\u03c4 , is de\ufb01ned for any x P Rd\nby rS\u03c4pxqsj \u201c signpxjqp|xj| \u00b4 \u03c4q`, while the group soft-thresholding (at level \u03c4) is S gp\n\u03c4 pxq \u201c\np1 \u00b4 \u03c4{(cid:107)x(cid:107)q`x. Denoting \u03a0C the projection on a closed convex set C, this yields S\u03c4 \u201c Id\u00b4\u03a0\u03c4B8.\nThe sub-differential of a convex function f : Rd \u00d1 R at x is de\ufb01ned by Bfpxq \u201c tz P Rd : @y P\nRd, fpxq \u00b4 fpyq \u011b zJpx \u00b4 yqu. We recall that the sub-differential B(cid:107)\u00a8(cid:107)1 of the (cid:96)1 norm is signp\u00a8q,\nde\ufb01ned element-wise by @j P rds, signpxqj \u201c\n\n\"\ntsignpxjqu ,\nr\u00b41, 1s,\n\nif xj \u2030 0,\n\"\nif xj \u201c 0.\ntx{(cid:107)x(cid:107)u ,\nB,\n\u0159\n\nNote that the sub-differential B(cid:107)\u00a8(cid:107) of the Euclidean norm is B(cid:107)\u00a8(cid:107)pxq \u201c\nFor any norm \u2126 on Rd, \u2126D is the dual norm of \u2126, and is de\ufb01ned for any x P Rd by \u2126Dpxq \u201c\nmaxvPB\u2126 vJx, e.g., (cid:107)\u00a8(cid:107)D\n1 \u201c (cid:107)\u00a8(cid:107)8 and (cid:107)\u00a8(cid:107)D \u201c (cid:107)\u00a8(cid:107). We only focus on the Sparse-Group Lasso\nnorm, so we assume that \u2126 \u201c \u2126\u03c4,w, where \u2126\u03c4,wp\u03b2q :\u201c \u03c4(cid:107)\u03b2(cid:107)1 ` p1 \u00b4 \u03c4q\ngPG wg(cid:107)\u03b2g(cid:107), for\n\u03c4 P r0, 1s, w \u201c pwgqgPG with wg \u011b 0 for all g P G. The case where wg \u201c 0 for some g P G together\nwith \u03c4 \u201c 0 is excluded (\u2126\u03c4,w is not a norm in such a case).\n\nif x \u2030 0,\nif x \u201c 0.\n\n2 Sparse-Group Lasso regression\nFor \u03bb \u0105 0 and \u03c4 P r0, 1s, the Sparse-Group Lasso estimator denoted by \u02c6\u03b2p\u03bb,\u2126q is de\ufb01ned as a\nminimizer of the primal objective P\u03bb,\u2126 de\ufb01ned by:\n\n\u02c6\u03b2p\u03bb,\u2126q P arg min\n\u03b2PRp\n\n1\n2\n\n(cid:107)y \u00b4 X\u03b2(cid:107)2 ` \u03bb\u2126p\u03b2q :\u201c P\u03bb,\u2126p\u03b2q.\n\nA dual formulation (see [4, Th. 3.3.5]) of (1) is given by\n(cid:107)y(cid:107)2 \u00b4 \u03bb2\n2\n\n\u02c6\u03b8p\u03bb,\u2126q \u201c arg max\n\u03b8P\u2206X,\u2126\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)\u03b8 \u00b4 y\n\n\u03bb\n\n(cid:13)(cid:13)(cid:13)2\n\n:\u201c D\u03bbp\u03b8q,\n\n(1)\n\n(2)\n\nwhere \u2206X,\u2126 \u201c t\u03b8 P Rn : \u2126DpXJ\u03b8q \u010f 1u. The parameter \u03bb \u0105 0 controls the trade-off between\ndata-\ufb01tting and sparsity, and \u03c4 controls the trade-off between features sparsity and group sparsity. In\nparticular one recovers the Lasso [18] if \u03c4 \u201c 1, and the Group-Lasso [22] if \u03c4 \u201c 0.\n\n2\n\n\fFor the primal problem, Fermat\u2019s rule (cf. Appendix for details) reads:\n\n\u03bb\u02c6\u03b8p\u03bb,\u2126q \u201c y \u00b4 X \u02c6\u03b2p\u03bb,\u2126q\nXJ \u02c6\u03b8p\u03bb,\u2126q P B\u2126p \u02c6\u03b2p\u03bb,\u2126qq\n\n(link-equation) ,\n(sub-differential inclusion).\n\n(3)\n(4)\nRemark 1 (Dual uniqueness). The dual solution \u02c6\u03b8p\u03bb,\u2126q is unique, while the primal solution \u02c6\u03b2p\u03bb,\u2126q\nmight not be. Indeed, the dual formulation (2) is equivalent to \u02c6\u03b8p\u03bb,\u2126q \u201c arg min\u03b8P\u2206X,\u2126 (cid:107)\u03b8 \u00b4 y{\u03bb(cid:107),\nso \u02c6\u03b8p\u03bb,\u2126q \u201c \u03a0\u2206X,\u2126py{\u03bbq is the projection of y{\u03bb over the dual feasible set \u2206X,\u2126.\nRemark 2 (Critical parameter: \u03bbmax). There is a critical value \u03bbmax such that 0 is a primal solution\nof (1) for all \u03bb \u011b \u03bbmax. Indeed, the Fermat\u2019s rule states 0 P arg min\u03b2PRp(cid:107)y\u00b4 X\u03b2(cid:107)2{2` \u03bb\u2126p\u03b2q\u00f0\u00f1\n0 P tXJyu ` \u03bbB\u2126p0q\u00f0\u00f1\u2126DpXJyq \u010f \u03bb. Hence, the critical parameter is given by: \u03bbmax :\u201c\n\u2126DpXJyq. Note that evaluating \u03bbmax highly relies on the ability to (ef\ufb01ciently) compute the dual\nnorm \u2126D.\n\n3 GAP safe rule for the Sparse-Group Lasso\n\nThe safe rule we propose here is an extension to the Sparse-Group Lasso of the GAP safe rules\nintroduced for Lasso and Group-Lasso [10, 15]. For the Sparse-Group Lasso, the geometry of the\ndual feasible set \u2206X,\u2126 is more complex (an illustration is given in Fig. 1). Hence, computing a dual\nfeasible point is more intricate. As seen in Section 3.2, the computation of a dual feasible point\nstrongly relies on the ability to evaluate the dual norm \u2126D. This crucial evaluation is discussed in\nSection 4. We \ufb01rst detail how GAP safe screening rules can be obtained for the Sparse-Group Lasso.\n\n3.1 Description of the screening rules\n\nSafe screening rules exploit the known sparsity of the solutions of problems such as (1). They discard\ninactive features/groups whose coef\ufb01cients are guaranteed to be zero for optimal solutions. Then, a\nsigni\ufb01cant reduction in computing time can be obtained ignoring \u201cirrelevant\u201d features/groups. The\nSparse-Group Lasso bene\ufb01ts from two levels of screening: the safe rules can detect both group-wise\nzeros in the vector \u02c6\u03b2p\u03bb,\u2126q and coordinate-wise zeros in the remaining groups.\nTo obtain useful screening rules one needs a safe region, i.e., a set containing the optimal dual\nsolution \u02c6\u03b8p\u03bb,\u2126q. Following [9], when we choose a ball Bp\u03b8c, rq with radius r and centered at \u03b8c as a\nsafe region, we call it a safe sphere. A safe sphere is all the more useful that r is small and \u03b8c close to\n\u02c6\u03b8p\u03bb,\u2126q. The safe rules for the Sparse-Group Lasso work as follows: for any group g in G and any safe\nsphere Bp\u03b8c, rq\n\nGroup level safe screening rule:\n\nFeature level safe screening rule:\n\nmax\n\n\u03b8PBp\u03b8c,rq\n\n(cid:13)(cid:13)S\u03c4pXJ\n\ng \u03b8q(cid:13)(cid:13) \u0103 p1 \u00b4 \u03c4qwg \u00f1 \u02c6\u03b2p\u03bb,\u2126q\n\u03b8PBp\u03b8c,rq|XJ\n\nj \u03b8| \u0103 \u03c4 \u00f1 \u02c6\u03b2\n\ng\np\u03bb,\u2126q\nj\n\n@j P g, max\n\n\u201c 0,\n\u201c 0.\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nThis means that provided one the last two test is true, the corresponding group or feature can be\n(safely) discarded. For screening variables, we rely on the following upper-bounds:\nProposition 1. For all group g P G and j P g,\n\u03b8PBp\u03b8c,rq|XJ\n#(cid:13)(cid:13)S\u03c4pXJ\np(cid:13)(cid:13)XJ\n\ng \u03b8cq(cid:13)(cid:13) ` r (cid:107)Xg(cid:107)\n(cid:13)(cid:13)8 ` r (cid:107)Xg(cid:107) \u00b4 \u03c4q`\n\ng \u03b8q(cid:13)(cid:13) \u010f Tg :\u201c\n\n(cid:13)(cid:13)8 \u0105 \u03c4,\n\n(cid:13)(cid:13)S\u03c4pXJ\n\nif (cid:13)(cid:13)XJ\n\nj \u03b8c| ` r (cid:107)Xj(cid:107) .\n\ng \u03b8c\notherwise.\n\nj \u03b8| \u010f |XJ\n\nmax\n\n\u03b8PBp\u03b8c,rq\n\nmax\n\ng \u03b8c\n\nand\n\nAssume now that one has found a safe sphere Bp\u03b8c, rq (their creation is deferred to Section 3.2), then\nthe safe screening rules given by (5) and (6) read:\nTheorem 1 (Safe rules for the Sparse-Group Lasso). Using Tg de\ufb01ned in (8), we can state the\nfollowing safe screening rules:\n\u201c 0,\nGroup level safe screening:\n\u201c 0.\nFeature level safe screening:\n\nif Tg \u0103 p1 \u00b4 \u03c4qwg,\nif |XJ\n\n@g P G,\n@g P G,@j P g,\n\nj \u03b8c| ` r (cid:107)Xj(cid:107) \u0103 \u03c4,\n\nthen \u02c6\u03b2p\u03bb,\u2126q\ng\np\u03bb,\u2126q\nthen \u02c6\u03b2\nj\n\n3\n\n\f(a) Lasso dual ball B\u2126D for\n\u2126Dp\u03b8q \u201c (cid:107)\u03b8(cid:107)8.\n\na\n\n(b) Group-Lasso dual ball B\u2126D for\n2,|\u03b83|q.\n\u2126Dp\u03b8q \u201c maxp\n\n1 ` \u03b82\n\u03b82\n\n(cid:32)\n(\n(c) Sparse-Group Lasso dual ball\n\u03b8 : @g P G,(cid:107)S\u03c4p\u03b8gq(cid:107) \u010f\nB\u2126D \u201c\np1 \u00b4 \u03c4qwg\n.\n\nFigure 1: Lasso, Group-Lasso and Sparse-Group Lasso dual unit balls B\u2126D \u201c t\u03b8 : \u2126Dp\u03b8q \u010f 1u, for\nthe case of G \u201c tt1, 2u,t3uu (i.e., g1 \u201c t1, 2u, g2 \u201c t3u), n \u201c p \u201c 3, wg1 \u201c wg2 \u201c 1 and \u03c4 \u201c 1{2.\n\nThe screening rules can detect which coordinates or group of coordinates can be safely set to zero.\nThis allows to remove the corresponding features from the design matrix X during the optimization\nprocess. While standard algorithms solve (1) scanning all variables, only active ones, i.e., non\nscreened-out variables (using the terminology from Section 3.3) need to be considered with safe\nscreening strategies. This leads to signi\ufb01cant computational speed-ups, especially with a coordinate\ndescent algorithm for which it is natural to ignore features (see Algorithm 2, in Appendix G).\n\n3.2 GAP safe sphere\n\nWe now show how to compute the safe sphere radius and center using the duality gap.\n\n3.2.1 Computation of the radius\nWith a dual feasible point \u03b8 P \u2206X,\u2126 and a primal vector \u03b2 P Rp at hand, let us construct a safe sphere\ncentered on \u03b8, with radius obtained thanks to dual gap computations.\nTheorem 2 (Safe radius). For any \u03b8 P \u2206X,\u2126 and \u03b2 P Rp, one has \u02c6\u03b8p\u03bb,\u2126q P B p\u03b8, r\u03bb,\u2126p\u03b2, \u03b8qq , for\n\nc\n\nr\u03bb,\u2126p\u03b2, \u03b8q \u201c\n\n2pP\u03bb,\u2126p\u03b2q \u00b4 D\u03bbp\u03b8qq\n\n\u03bb2\n\n,\n\ni.e., the aforementioned ball is a safe region for the Sparse-Group Lasso problem.\n\nProof. The result holds thanks to strong concavity of the dual objective, cf. Appendix C.\n\n3.2.2 Computation of the center\n\nIn GAP safe screening rules, the screening test relies crucially on the ability to compute a vector\nthat belongs to the dual feasible set \u2206X,\u2126. The geometry of this set is illustrated in Figure 1.\nFollowing [3], we leverage the primal/dual link-equation (3) to construct a dual point based on a\ncurrent approximation \u03b2 of \u02c6\u03b2p\u03bb,\u2126q. When \u03b2 \u201c \u03b2\u03bb1\nis obtained as an approximation for a previous\nvalue of \u03bb1 \u2030 \u03bb we call such a strategy sequential screening. When \u03b2 \u201c \u03b2k is the primal value at\niteration k obtained by an iterative algorithm, we call this dynamical screening. Starting from a\nresidual \u03c1 \u201c y \u00b4 X\u03b2, one can create a dual feasible point by choosing 1:\n\n\u03b8 \u201c\n\n\u03c1\n\nmaxp\u03bb, \u2126DpXJ\u03c1qq .\n\n(9)\nWe refer to the sets Bp\u03b8, r\u03bb,\u2126p\u03b2, \u03b8qq as GAP safe spheres. Note that the generalization to any smooth\ndata \ufb01tting term would be straightforward see [15].s\nRemark 3. Recall that \u03bb \u011b \u03bbmax yields \u02c6\u03b2p\u03bb,\u2126q \u201c 0, in which case \u03c1 :\u201c y \u00b4 X \u02c6\u03b2p\u03bb,\u2126q \u201c y is the\noptimal residual and y{\u03bbmax is the dual solution. Thus, as for getting \u03bbmax \u201c \u2126DpXJyq, the scaling\ncomputation in (9) requires a dual norm evaluation.\n\n1We have used a simpler scaling w.r.t. [2] choice\u2019s (without noticing much difference in practice): \u03b8 \u201c s\u03c1\n\n\u201d\n\n\u00b4\n\nwhere s \u201c min\n\nmax\n\n\u03c1Jy\n\u03bb(cid:107)\u03c1(cid:107)2 ,\n\n\u00b41\n\n\u2126DpXJ\u03c1q\n\n\u00af\n\n\u0131\n\n,\n\n1\n\n\u2126DpXJ\u03c1q\n\n.\n\n4\n\n\fAlgorithm 1 Computation of \u039bpx, \u03b1, Rq.\n\nInput:\nx \u201c px1, . . . , xdqJ P Rd, \u03b1 P r0, 1s, R \u011b 0\nOutput: \u039bpx, \u03b1, Rq\nif \u03b1 \u201c 0 and R \u201c 0 then\nelse if \u03b1 \u201c 0 and R \u2030 0 then\nelse if R \u201c 0 then\nelse\n\n\u039bpx, \u03b1, Rq \u201c 8\n\u039bpx, \u03b1, Rq \u201c (cid:107)x(cid:107){R\n!\n\u039bpx, \u03b1, Rq \u201c (cid:107)x(cid:107)8{\u03b1\nGet I :\u201c\ni P rds : |xi| \u0105 \u03b1(cid:107)x(cid:107)8\n\u03b1`R\nnI :\u201c CardpIq\nSort xp1q \u011b xp2q \u011b \u00a8\u00a8\u00a8 \u011b xpnIq\n\n)\n\n3.3 Convergence of the active set\n\nS0 \u201c xp0q, S\nfor k P rnI \u00b4 1s do\n\n0 \u201c x2p0q, a0 \u201c 0\np2q\nk \u201c S\np2q\n\nSk \u201c Sk\u00b41 ` xpkq; S\np2q\nak`1 \u201c S\n\u00b4 2 Sk\nk\nx2pk`1q\n\u03b12 P rak, ak`1r then\nif R2\nj0 \u201c k ` 1\nbreak\n\nk\u00b41 ` x2pkq\np2q\nxpk`1q ` k ` 1\n\nif \u03b12j0 \u00b4 R2 \u201c 0 then\n\u039bpx, \u03b1, Rq \u201c S2\nj0\n2\u03b1Sj0\n\nc\nelse\n\u039bpx, \u03b1, Rq \u201c \u03b1Sj0\u00b4\n\n\u00b4S\n\np2q\n\u03b12S2\nj0\nj0\n\u03b12j0\u00b4R2\n\np\u03b12j0\u00b4R2q\n\nThe next proposition states that the sequence of dual feasible points obtained from (9) converges to the\ndual solution \u02c6\u03b8p\u03bb,\u2126q if p\u03b2kqkPN converges to an optimal primal solution \u02c6\u03b2p\u03bb,\u2126q (proof in Appendix).\nIt guarantees that the GAP safe spheres Bp\u03b8k, r\u03bb,\u2126p\u03b2k, \u03b8kqq are converging safe regions in the sense\nintroduced by [10], since by strong duality limk\u00d18 r\u03bb,\u2126p\u03b2k, \u03b8kq \u201c 0.\nProposition 2. If limk\u00d18 \u03b2k \u201c \u02c6\u03b2p\u03bb,\u2126q, then limk\u00d18 \u03b8k \u201c \u02c6\u03b8p\u03bb,\u2126q.\n\u010f\nFor any safe region R, i.e., a set containing \u02c6\u03b8p\u03bb,\u2126q, we de\ufb01ne two levels of active sets, one for the\ngroup level and one for the feature level:\nAgppRq :\u201c tg P G, max\n\u03b8PR\n\ntj P g : max\n\n\u03b8PR |XJ\n\n!\nIf one considers sequence of converging regions, then the next proposition (whose proof in Appendix)\nstates that we can identify in \ufb01nite time the optimal active sets de\ufb01ned as follows:\nj P g : |XJ\n\n!\ng P G :\n\n\u02c6\u03b8p\u03bb,\u2126q| \u011b \u03c4\n\nEgp :\u201c\n\n.\n\nj\n\nj \u03b8| \u011b \u03c4u.\n)\n\ng \u03b8q(cid:13)(cid:13) \u011b p1 \u00b4 \u03c4qwgu, AftpRq :\u201c\n(cid:13)(cid:13)S\u03c4pXJ\n\u010f\n(cid:13)(cid:13)(cid:13)S\u03c4pXJ\n\u02c6\u03b8p\u03bb,\u2126qq(cid:13)(cid:13)(cid:13) \u201c p1 \u00b4 \u03c4qwg\n\n, Eft :\u201c\n\n)\n\ng\n\ngPEgp\n\ngPAgppRq\n\nProposition 3. Let pRkqkPN be a sequence of safe regions whose diameters converge to 0. Then,\nk\u00d18AgppRkq \u201c Egp and lim\nlim\n4 Properties of the Sparse-Group Lasso\n\nk\u00d18AftpRkq \u201c Eft.\n\nTo apply our safe rule, we need to be able to evaluate the dual norm \u2126D ef\ufb01ciently. We describe such\nas step hereafter along with some useful properties of the norm \u2126. Such evaluations are performed\nmultiple times during the algorithm, motivating the derivation of an ef\ufb01cient algorithm, as presented\nin Algorithm 1.\n\n4.1 Connections with \u0001-norms\n\u0159\nHere, we establish a link between the Sparse-Group Lasso norm \u2126 and the \u0001-norm (denoted (cid:107)\u00a8(cid:107)\u0001)\nintroduced in [6]. For any \u0001 P r0, 1s and x P Rd, (cid:107)x(cid:107)\u0001 is de\ufb01ned as the unique nonnegative solution\ni\u201c1p|xi| \u00b4 p1 \u00b4 \u0001q\u03bdq2` \u201c p\u0001\u03bdq2, ((cid:107)x(cid:107)0 :\u201c (cid:107)x(cid:107)8). Using soft-thresholding, this\n\u03bd of the equation\ni\u201c1 Sp1\u00b4\u0001q\u03bdpxiq2 \u201c (cid:107)Sp1\u00b4\u0001q\u03bdpxq(cid:107)2 \u201c p\u0001\u03bdq2. Moreover, the\nis equivalent to solve in \u03bd the equation\n\u0001 \u201c \u0001(cid:107)y(cid:107)D ` p1 \u00b4 \u0001q(cid:107)y(cid:107)D8 \u201c \u0001(cid:107)y(cid:107) ` p1 \u00b4 \u0001q(cid:107)y(cid:107)1. Now\ndual norm of the \u0001-norm is given by2: (cid:107)y(cid:107)D\nwe can express the Sparse-Group Lasso norm \u2126 in term of the dual \u0001-norm and derive some basic\nproperties.\n\n\u0159\n\nd\n\nd\n\n2see [7, Eq. (42)] or Appendix\n\n5\n\n\fProposition 4. For all groups g in G, let us introduce \u0001g :\u201c p1\u00b4\u03c4qwg\n\u03c4`p1\u00b4\u03c4qwg\nLasso norm satis\ufb01es the following properties: for any \u03b2 and \u03be in Rp\n\n\u00ff\n(cid:32)\np\u03c4 ` p1 \u00b4 \u03c4qwgq(cid:107)\u03b2g(cid:107)D\ngPG\n\u03be P Rp : @g P G,(cid:107)S\u03c4p\u03begq(cid:107) \u010f p1 \u00b4 \u03c4qwg\n\nand\n\n\u0001g\n\n,\n\n(\n\n\u2126p\u03b2q \u201c\nB\u2126D \u201c\n\n. Then, the Sparse-Group\n\n(cid:107)\u03beg(cid:107)\u0001g\n\n\u03c4 ` p1 \u00b4 \u03c4qwg\n\n,\n\n(10)\n\n\u2126Dp\u03beq \u201c max\ngPG\n\n.\n\n(11)\nThe sub-differential at \u03b2 reads B\u2126p\u03b2q \u201c tz P Rp : @g P G, zg P \u03c4B(cid:107)\u00a8(cid:107)1p\u03b2gq ` p1 \u00b4 \u03c4qwgB(cid:107)\u00a8(cid:107)p\u03b2gqu .\nWe obtain from the characterization of the unit dual ball (11) that for the Sparse-Group Lasso, any\ndual feasible point \u03b8 P \u2206X,\u2126 veri\ufb01es: @g P G, XJ\nFrom the dual norm formulation (10), a vector \u03b8 P Rn is feasible if and only if \u2126DpXJ\u03b8q \u010f 1,\ni.e., @g P G,(cid:107)XJ\ng \u03b8(cid:107)\u0001g \u010f \u03c4 ` p1 \u00b4 \u03c4qwg. Hence we deduce from (11) a new characterization of the\ndual feasible set: \u2206X,\u2126 \u201c\n\ng \u03b8 P p1 \u00b4 \u03c4qwgB ` \u03c4B8.\n(\n\n(cid:32)\n\u03b8 P Rn : @g P G,(cid:107)XJ\n\ng \u03b8(cid:107)\u0001g \u010f \u03c4 ` p1 \u00b4 \u03c4qwg\n\n.\n\n4.2 Ef\ufb01cient computation of the dual norm\n\nThe following proposition shows how to compute the dual norm of the Sparse-Group Lasso (and the\n\u0001-norm). This is turned into an ef\ufb01cient procedure in Algorithm 1 (see the Appendix for details).\nProposition 5. For \u03b1 P r0, 1s, R \u011b 0 and x P Rd, the equation\ni\u201c1 S\u03bd\u03b1pxiq2 \u201c p\u03bdRq2 has\na unique solution \u03bd :\u201c \u039bpx, \u03b1, Rq P R`, that can be computed in Opd log dq operations in the\nworst case. With nI \u201c Cardti P rds : |xi| \u0105 \u03b1(cid:107)x(cid:107)8{p\u03b1 ` Rqu, the complexity of Algorithm 1 is\nnI ` nI logpnIq, which is comparable to the ambient dimension d.\nThanks to Remark 2, we can explicit the critical parameter \u03bbmax for the Sparse-Group Lasso that is\n\nd\n\n\u0159\n\nand get a dual feasible point (9), since \u2126DpXJ\u03c1q \u201c maxgPG \u039bpXJ\n\n\u03bbmax \u201c max\ngPG\n\ng y, 1 \u00b4 \u0001g, \u0001gq\n\u039bpXJ\n\u03c4 ` p1 \u00b4 \u03c4qwg\n\n\u201c \u2126DpXJyq,\n\n(12)\ng \u03c1, 1 \u00b4 \u0001g, \u0001gq{p\u03c4 ` p1 \u00b4 \u03c4qwgq.\n\n5\n\nImplementation\n\nIn this section we provide details on how to solve the Sparse-Group Lasso primal problem, and how\nwe apply the GAP safe screening rules. We focus on the block coordinate iterative soft-thresholding\nalgorithm (ISTA-BC); see [16]. This algorithm requires a block-wise Lipschitz gradient condition\non the data \ufb01tting term fp\u03b2q \u201c (cid:107)y \u00b4 X\u03b2(cid:107)2{2. For our problem (1), one can show that for all\n2 (where (cid:107)\u00a8(cid:107)2 is the spectral norm of a matrix) is a suitable block-wise\ngroup g in G, Lg \u201c (cid:107)Xg(cid:107)2\nLipschitz constant. We de\ufb01ne the block coordinate descent algorithm according to the Majorization-\n`\n`\n\u02d8\nMinimization principle: at each iteration l, we choose (e.g., cyclically) a group g and the next\n\u02d8\u02d8\n`\ng \u201c arg min\u03b2gPRng(cid:107)\u03b2g \u00b4\ng1 \u201c \u03b2l\niterate \u03b2l`1 is de\ufb01ned such that \u03b2l`1\n\u03bb{Lg, where we denote for all g in G, \u03b1g :\u201c\n\u03c4(cid:107)\u03b2g(cid:107)1 `p1\u00b4 \u03c4qwg(cid:107)\u03b2g(cid:107)\ng \u00b4 \u2207gfp\u03b2lq{Lg\n\u03b2l\n\u03bb{Lg. This can be simpli\ufb01ed to \u03b2l`1\ng \u201c S gpp1\u00b4\u03c4q\u03c9g\u03b1g\ng \u00b4 \u2207gfp\u03b2lq{Lg\nS\u03c4 \u03b1g\n. The expensive\n\u03b2l\ncomputation of the dual gap is not performed at each pass over the data, but only every f ce pass (in\npractice f ce \u201c 10 in all our experiments). A pseudo code is given in Appendix G.\n\ng1 if g1 \u2030 g and otherwise \u03b2l`1\n\n(cid:107)2{2`\n\n\u02d8\n\n`\n\n6 Experiments\n\nIn this section we present our experiments and illustrate the numerical bene\ufb01t of screening rules for\nthe Sparse-Group Lasso.\n\n6.1 Experimental settings and methods compared\n\nWe have run our ISTA-BC algorithm 3 to obtain the Sparse-Group Lasso estimator for a non-increasing\nsequence of T regularization parameters p\u03bbtqtPrT\u00b41s de\ufb01ned as follows: \u03bbt :\u201c \u03bbmax10\u00b4\u03b4pt\u00b41q{pT\u00b41q.\n\n3The source code can be found in https://github.com/EugeneNdiaye/GAPSAFE_SGL.\n\n6\n\n\fFigure 2: Experiments on a synthetic dataset (\u03c1 \u201c 0.5, \u03b31 \u201c 10, \u03b32 \u201c 4, \u03c4 \u201c 0.2).\n(a) Proportion of active variables, i.e., variables not safely eliminated, as a function of parameters p\u03bbtq\nand the number of iterations K. More red, means more variables eliminated and better screening. (b)\nTime to reach convergence w.r.t the accuracy on the duality gap, using various screening strategies.\n\nBy default, we choose \u03b4 \u201c 3 and T \u201c 100, following the standard practice when running cross-\nwg \u201c ?\nvalidation using sparse models (see R glmnet package [11]). The weights are always chosen as\n\nng (as in [17]).\n\nWe also provide a natural extension of the previous safe rules [9, 21, 3] to the Sparse-Group\nLasso for comparisons (please refer to Appendix D for more details). The static safe region\n[9] is given by B py{\u03bb,(cid:107)y{\u03bbmax \u00b4 y{\u03bb(cid:107)q. The corresponding dynamic safe region [3]) is given by\nB py{\u03bb,(cid:107)\u03b8k \u00b4 y{\u03bb(cid:107)q, where p\u03b8kqkPN is a sequence of dual feasible points obtained by dual scaling;\ncf. Equation (9). The DST3, is an improvement of the preceding safe region, see [21, 3], that we\nadapted to the Sparse-Group Lasso. The GAP safe sequential rules corresponds to using only\nGAP Safe spheres whose centers are the (last) dual point output by the solver for a former value\nof \u03bb in the path. The GAP safe rules corresponds to performing our strategy both sequentially and\ndynamically. Presenting the sequential rule allows to measure the bene\ufb01ts due to sequential rules and\nto the dynamic rules.\nWe now demonstrate the ef\ufb01ciency of our method in both synthetic (Fig. (2)) and real datasets\n(Fig. 6.2). For comparison, we report computation times to reach convergence up to a certain\ntolerance on the duality gap for all the safe rules considered.\nSynthetic dataset: We use a common framework [19, 20] based on the model y \u201c X\u03b2 ` 0.01\u03b5\nwhere \u03b5 \u201e Np0, Idnq, X P Rn\u02c6p follows a multivariate normal distribution such that @pi, jq P\nrps2, corrpXi, Xjq \u201c \u03c1|i\u00b4j|. We \ufb01x n \u201c 100 and break randomly p \u201c 10000 in 1000 groups of\nsize 10 and select \u03b31 groups to be active and the others are set to zero. In each selected groups, \u03b32\ncoordinates are drawn with r\u03b2gsj \u201c signp\u03beq \u02c6 U for U is uniform in r0.5, 10sq, \u03be uniform in r\u00b41, 1s.\nReal dataset: NCEP/NCAR Reanalysis 1 [14] The dataset contains monthly means of climate data\nmeasurements spread across the globe in a grid of 2.5\u02dd \u02c6 2.5\u02dd resolutions (longitude and latitude\n144\u02c673) from 1948{1{1 to 2015{10{31 . Each grid point constitutes a group of 7 predictive variables\n(Air Temperature, Precipitable water, Relative humidity, Pressure, Sea Level Pressure, Horizontal\nWind Speed and Vertical Wind Speed) whose concatenation across time constitutes our design matrix\nX P R814\u02c673577. Such data have therefore a natural group structure.\nIn our experiments, we considered as target variable y P R814, the values of Air Temperature in a\nneighborhood of Dakar. Seasonality and trend are \ufb01rst removed, as usually done in climate analysis\nfor bias reduction in the regression estimates. Similar data has been used in [8], showing that the\nSparse-Group Lasso estimator is well suited for prediction in climatology. Indeed, thanks to the\nsparsity structure, the estimates delineate via their support some predictive regions at the group level,\nas well as predictive features via coordinate-wise screening.\nWe choose \u03c4 in the set t0, 0.1, . . . , 0.9, 1u by splitting in 50% the observations and run a training-test\nvalidation procedure. For each value of \u03c4, we require a duality gap of 10\u00b48 on the training part\n\n7\n\n\fFigure 3: Experiments on NCEP/NCAR Reanalysis\n1 pn \u201c 814, p \u201c 73577q: (a) Prediction error for the\nSparse-Group Lasso path with 100 values of \u03bb and\n11 values of \u03c4 (best : \u03c4\u2039 \u201c 0.4). (b) Time to reach\nconvergence controlled by duality gap (for whole path\np\u03bbtqtPrTs with \u03b4 \u201c 2.5 and \u03c4\u2039 \u201c 0.4). (c) Active groups\nto predict Air Temperature in a neighborhood of Dakar\n(in blue). Cross validation was run over 100 values for\n\u03bb\u2019s and 11 for \u03c4\u2019s. At each location, the highest absolute\nvalue among the seven coef\ufb01cients is displayed.\n\nand pick the best one in term of prediction accuracy on the test part. The result is displayed in\nFigure 6.2.(a). We \ufb01xed \u03b4 \u201c 2.5 for the computational time benchmark in Figure 6.2.(b)\n\n6.2 Performance of the screening rules\n\nIn all our experiments, we observe that our proposed GAP Safe rule outperforms the other rules\nin term of computation time. On Figure 2.(c), we can see that we need 65s to reach convergence\nwhereas others rules need up to 212s at a precision of 10\u00b48. A similar performance is observed on\nthe real dataset (Figure 6.2) where we obtain up to a 5x speed up over the other rules. The key reason\nbehind this performance gain is the convergence of the GAP Safe regions toward the dual optimal\npoint as well as the ef\ufb01cient strategy to compute the screening rule. As shown in the results presented\non Figure 2, our method still manages to screen out variables when \u03bb is small. It corresponds to low\nregularizations which lead to less sparse solutions but need to be explored during cross-validation.\nIn the climate experiments, the support map in Figure 6.2.(c) shows that the most important coef\ufb01-\ncients are distributed in the vicinity of the target region (in agreement with our intuition). Nevertheless,\nsome active variables with small coef\ufb01cients remain and cannot be screened out.\nNote that we do not compare our method to the TLFre [20], since this sequential rule requires the\nexact knowledge of the dual optimal solution which is not available in practice. As a consequence,\none may discard active variables which can prevent the algorithm from converging as shown in [15].\n\n7 Conclusion\n\nThe recent GAP safe rules introduced have shown great improvements, for a wide range of regularized\nregression, in the reduction of computing time, especially in high dimension. To apply such GAP\nsafe rules to the Sparse-Group Lasso, we have proposed a new description of the dual feasible set\nby establishing connections between the Sparse-Group Lasso norm and \u0001-norms. This geometrical\nconnection has helped providing an ef\ufb01cient algorithm to compute the dual norm and dual feasible\npoints, bottlenecks for applying the GAP Safe rules. Extending GAP safe rules on general hierarchical\nregularizations, is a possible direction for future research.\nAcknowledgments: this work was supported by the ANR THALAMEEG ANR-14-NEUC-0002-\n01, the NIH R01 MH106174, by ERC Starting Grant SLAB ERC-YStG-676943 and by the Chair\nMachine Learning for Big Data at T\u00e9l\u00e9com ParisTech.\n\n8\n\na)b)\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[2] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. A dynamic screening principle for the lasso. In\n\nEUSIPCO, 2014.\n\n[3] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. Dynamic Screening: Accelerating First-Order\n\nAlgorithms for the Lasso and Group-Lasso. IEEE Trans. Signal Process., 63(19):20, 2015.\n\n[4] J. M. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization. Springer, New York, second\n\nedition, 2006.\n\n[5] P. B\u00fchlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series in Statistics.\n\nSpringer, Heidelberg, 2011. Methods, theory and applications.\n\n[6] O. Burdakov. A new vector norm for nonlinear curve \ufb01tting and some other optimization problems. 33.\nInt. Wiss. Kolloq. Fortragsreihe \"Mathematische Optimierung | Theorie und Anwendungen\", pages 15\u201317,\n1988.\n\n[7] O. Burdakov and B. Merkulov. On a new norm for data \ufb01tting and optimization problems. Link\u00f6ping\n\nUniversity, Link\u00f6ping, Sweden, Tech. Rep. LiTH-MAT, 2001.\n\n[8] S. Chatterjee, K. Steinhaeuser, A. Banerjee, S. Chatterjee, and A. Ganguly. Sparse group lasso: Consistency\n\nand climate applications. In SIAM International Conference on Data Mining, pages 47\u201358, 2012.\n\n[9] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. J. Paci\ufb01c\n\nOptim., 8(4):667\u2013698, 2012.\n\n[10] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML, pages\n\n333\u2013342, 2015.\n\n[11] J. Friedman, T. Hastie, H. H\u00f6\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat.,\n\n1(2):302\u2013332, 2007.\n\n[12] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. J.\n\nMach. Learn. Res., 12:2297\u20132334, 2011.\n\n[13] T. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimization. In\n\nICML, pages 1171\u20131179, 2015.\n\n[14] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S. Saha, G. White,\nJ. Woollen, et al. The NCEP/NCAR 40-year reanalysis project. Bulletin of the American meteorological\nSociety, 77(3):437\u2013471, 1996.\n\n[15] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. GAP safe screening rules for sparse multi-task and\n\nmulti-class models. NIPS, pages 811\u2013819, 2015.\n\n[16] Z. Qin, K. Scheinberg, and D. Goldfarb. Ef\ufb01cient block-coordinate descent algorithms for the group lasso.\n\nMathematical Programming Computation, 5(2):143\u2013169, 2013.\n\n[17] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. J. Comput. Graph. Statist.,\n\n22(2):231\u2013245, 2013.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSSB, 58(1):267\u2013288, 1996.\n\n[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for\n\ndiscarding predictors in lasso-type problems. JRSSB, 74(2):245\u2013266, 2012.\n\n[20] J. Wang and J. Ye. Two-layer feature reduction for sparse-group lasso via decomposition of convex sets.\n\narXiv preprint arXiv:1410.4210, 2014.\n\n[21] Z. J. Xiang, H. Xu, and P. J. Ramadge. Learning sparse representations of high dimensional data on large\n\nscale dictionaries. In NIPS, pages 900\u2013908, 2011.\n\n[22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. JRSSB,\n\n68(1):49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 239, "authors": [{"given_name": "Eugene", "family_name": "Ndiaye", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Olivier", "family_name": "Fercoq", "institution": "Telecom ParisTech"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "CNRS LTCI, Telecom Paristech"}, {"given_name": "Joseph", "family_name": "Salmon", "institution": "T\u00e9l\u00e9com ParisTech"}]}