{"title": "GAP Safe screening rules for sparse multi-task and multi-class models", "book": "Advances in Neural Information Processing Systems", "page_first": 811, "page_last": 819, "abstract": "High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be safe. In this paper we derive new safe rules for generalized linear models regularized with L1 and L1/L2 norms. The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters. The GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for multi-task Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on all tested datasets with respect to previous safe rules.", "full_text": "GAP Safe screening rules for sparse multi-task and\n\nmulti-class models\n\nEugene Ndiaye Olivier Fercoq Alexandre Gramfort\n\nJoseph Salmon\n\nfirstname.lastname@telecom-paristech.fr\n\nLTCI, CNRS, T\u00b4el\u00b4ecom ParisTech, Universit\u00b4e Paris-Saclay\n\nParis, 75013, France\n\nAbstract\n\nHigh dimensional regression bene\ufb01ts from sparsity promoting regularizations.\nScreening rules leverage the known sparsity of the solution by ignoring some\nvariables in the optimization, hence speeding up solvers. When the procedure\nis proven not to discard features wrongly the rules are said to be safe. In this paper\nwe derive new safe rules for generalized linear models regularized with (cid:96)1 and\n(cid:96)1{(cid:96)2 norms. The rules are based on duality gap computations and spherical safe\nregions whose diameters converge to zero. This allows to discard safely more vari-\nables, in particular for low regularization parameters. The GAP Safe rule can cope\nwith any iterative solver and we illustrate its performance on coordinate descent\nfor multi-task Lasso, binary and multinomial logistic regression, demonstrating\nsigni\ufb01cant speed ups on all tested datasets with respect to previous safe rules.\n\n1\n\nIntroduction\n\nThe computational burden of solving high dimensional regularized regression problem has lead to a\nvast literature in the last couple of decades to accelerate the algorithmic solvers. With the increasing\npopularity of (cid:96)1-type regularization ranging from the Lasso [18] or group-Lasso [24] to regularized\nlogistic regression and multi-task learning, many algorithmic methods have emerged to solve the\nassociated optimization problems. Although for the simple (cid:96)1 regularized least square a speci\ufb01c\nalgorithm (e.g., the LARS [8]) can be considered, for more general formulations, penalties, and\npossibly larger dimension, coordinate descent has proved to be a surprisingly ef\ufb01cient strategy [12].\nOur main objective in this work is to propose a technique that can speed-up any solver for such\nlearning problems, and that is particularly well suited for coordinate descent method, thanks to\nactive set strategies.\nThe safe rules introduced by [9] for generalized (cid:96)1 regularized problems, is a set of rules that allows\nto eliminate features whose associated coef\ufb01cients are proved to be zero at the optimum. Relaxing\nthe safe rule, one can obtain some more speed-up at the price of possible mistakes. Such heuristic\nstrategies, called strong rules [19] reduce the computational cost using an active set strategy, but\nrequire dif\ufb01cult post-precessing to check for features possibly wrongly discarded. Another road to\nspeed-up screening method has been the introduction of sequential safe rules [21, 23, 22]. The idea\nis to improve the screening thanks to the computations done for a previous regularization parameter.\nThis scenario is particularly relevant in machine learning, where one computes solutions over a\ngrid of regularization parameters, so as to select the best one (e.g., to perform cross-validation).\nNevertheless, such strategies suffer from the same problem as strong rules, since relevant features\ncan be wrongly disregarded: sequential rules usually rely on theoretical quantities that are not known\nby the solver, but only approximated. Especially, for such rules to work one needs the exact dual\noptimal solution from the previous regularization parameter.\n\n1\n\n\fRecently, the introduction of safe dynamic rules [6, 5] has opened a promising venue by letting\nthe screening to be done not only at the beginning of the algorithm, but all along the iterations.\nFollowing a method introduced for the Lasso [11], we generalize this dynamical safe rule, called\nGAP Safe rules (because it relies on duality gap computation) to a large class of learning problems\nwith the following bene\ufb01ts:\n\n\u2022 a uni\ufb01ed and \ufb02exible framework for a wider family of problems,\n\u2022 easy to insert in existing solvers,\n\u2022 proved to be safe,\n\u2022 more ef\ufb01cient that previous safe rules,\n\u2022 achieves fast true active set identi\ufb01cation.\n\nWe introduce our general GAP Safe framework in Section 2. We then specialize it to important\nmachine learning use cases in Section 3. In Section 4 we apply our GAP Safe rules to a multi-\ntask Lasso problem, relevant for brain imaging with magnetoencephalography data, as well as to\nmultinomial logistic regression regularized with (cid:96)1{(cid:96)2 norm for joint feature selection.\n\n2 GAP Safe rules\n\n2.1 Model and notations\nWe denote by rds the set t1, . . . , du for any integer d P N, and by QJ the transpose of a matrix\nQ. Our observation matrix is Y P Rn\u02c6q where n represents the number of samples, and q the\nnumber of tasks or classes. The design matrix X \u201c rxp1q, . . . , xppqs \u201c rx1, . . . , xnsJ P Rn\u02c6p has\np explanatory variables (or features) column-wise, and n observations row-wise. The standard (cid:96)2\nnorm is written } \u00a8 }2, the (cid:96)1 norm } \u00a8 }1, the (cid:96)8 norm } \u00a8 }8. The (cid:96)2 unit ball is denoted by B2 (or\nsimply B) and we write Bpc, rq the (cid:96)2 ball with center c and radius r. For a matrix B P Rp\u02c6q, we\ndenote by }B}2\nj,k the Frobenius norm, and by x\u00a8,\u00a8y the associated inner product.\nWe consider the general optimization problem of minimizing a separable function with a group-\nLasso regularization. The parameter to recover is a matrix B P Rp\u02c6q, and for any j in Rp, Bj,: is the\nj-th row of B, while for any k in Rq, B:,k is the k-th column. We would like to \ufb01nd\n\nk\u201c1 B2\n\n\u0159\n\n\u0159\n\n2 \u201c\n\np\nj\u201c1\n\nq\n\npBp\u03bbq P arg min\n\nBPRp\u02c6q\n\nn\u00ff\nloooooooooooomoooooooooooon\nfipxJ\ni Bq ` \u03bb\u2126pBq\n\ni\u201c1\n\n,\n\n(1)\n\n\u0159\n\nP\u03bbpBq\n\np\n\n\u0159\nwhere fi : R1\u02c6q \u00de\u00d1 R is a convex function with 1{\u03b3-Lipschitz gradient. So F : B \u00d1\ni Bq\nis also convex with Lipschitz gradient. The function \u2126 : Rp\u02c6q \u00de\u00d1 R` is the (cid:96)1{(cid:96)2 norm \u2126pBq \u201c\nj\u201c1 }Bj,:}2 promoting a few lines of B to be non-zero at a time. The \u03bb parameter is a non-negative\nconstant controlling the trade-off between data \ufb01tting and regularization.\nSome elements of convex analysis used in the following are introduced here. For a convex function\nf : Rd \u00d1 r\u00b48,`8s the Fenchel-Legendre transform1 of f, is the function f\u02da : Rd \u00d1 r\u00b48,`8s\nde\ufb01ned by f\u02dapuq \u201c supzPRdxz, uy \u00b4 fpzq. The sub-differential of a function f at a point x is\ndenoted by Bfpxq. The dual norm of \u2126 is the (cid:96)8{(cid:96)2 norm and reads \u2126\u02dapBq \u201c maxjPrps }Bj,:}2.\nRemark 1. For the ease of reading, all groups are weighted with equal strength, but extension\nof our results to non-equal weights as proposed in the original group-Lasso [24] paper would be\nstraightforward.\n\ni\u201c1 fipxJ\n\nn\n\n2.2 Basic properties\n\nFirst we recall the associated Fermat\u2019s condition and a dual formulation of the optimization problem:\nTheorem 1. Fermat\u2019s condition (see [3, Proposition 26.1] for a more general result)\nFor any convex function f : Rn \u00d1 R:\n\nx P arg min\nxPRn\n\nfpxq \u00f4 0 P Bfpxq.\n\n(2)\n\n1this is also often referred to as the (convex) conjugate of a function\n\n2\n\n\f\u00b4 n\u00ff\nTheorem 2 ([9]). A dual formulation of (1) is given by\nlooooooooomooooooooon\ni p\u00b4\u03bb\u0398i,:q\nf\u02da\n\np\u0398p\u03bbq \u201c arg max\n\n\u0398P\u2206X\n\ni\u201c1\n\nD\u03bbp\u0398q\n\n.\n\n(3)\n\np\u0398\nxpjqJp\u0398p\u03bbq P\n\nwhere \u2206X \u201c t\u0398 P Rn\u02c6q : @j P rps,}xpjqJ\nand dual solutions are linked by\n\n@i P rns,\nFurthermore, Fermat\u2019s condition reads:\n\n\u0398}2 \u010f 1u \u201c t\u0398 P Rn\u02c6q : \u2126\u02dapXJ\u0398q \u010f 1u. The primal\npBp\u03bbqq{\u03bb.\np\u03bbq\ni,: \u201c \u00b4\u2207fipxJ\n*\nifpB\nifpB\nThen the primal/dual link can be writtenp\u0398p\u03bbq \u201c \u00b4GpXpBp\u03bbqq{\u03bb .\n\nRemark 2. Contrarily to the primal, the dual problem has a unique solution under our assumption\non fi. Indeed, the dual function is strongly concave, hence strictly concave.\nRemark 3. For any \u0398 P Rn\u02c6q let us introduce Gp\u0398q \u201c r\u2207f1p\u03981,:qJ, . . . ,\u2207fnp\u0398n,:qJs P Rn\u02c6q.\n\np\u03bbq\nj,: \u2030 0,\np\u03bbq\nj,: \u201c 0.\n\n\u02c6B\u03bb\nj,;\nj,;}2\n} \u02c6B\u03bb\nB2,\n\n@j P rps,\n\n$&%\n\n\"\n\n(5)\n\n(4)\n\ni\n\n,\n\n2.3 Critical parameter: \u03bbmax\n\nFor \u03bb large enough the solution of the primal problem is simply 0. Thanks to the Fermat\u2019s rule (2),\n0 is optimal if and only if \u00b4\u2207Fp0q{\u03bb P B\u2126p0q. Thanks to the property of the dual norm \u2126\u02da, this is\nequivalent to \u2126\u02dap\u2207Fp0q{\u03bbq \u010f 1 where \u2126\u02da is the dual norm of \u2126. Since \u2207Fp0q \u201c XJGp0q, 0 is a\nprimal solution of P\u03bb if and only if \u03bb \u011b \u03bbmax :\u201c maxjPrps }xpjqJ\nThis development shows that for \u03bb \u011b \u03bbmax, Problem (1) is trivial. So from now on, we will only\nfocus on the case where \u03bb \u010f \u03bbmax.\n\nGp0q}2 \u201c \u2126\u02dapXJGp0qq.\n\n2.4 Screening rules description\n\nSafe screening rules rely on a simple consequence of the Fermat\u2019s condition:\n\nStated in such a way, this relation is useless becausep\u0398p\u03bbq is unknown (unless \u03bb \u0105 \u03bbmax). However,\n\nit is often possible to construct a set R \u0102 Rn\u02c6q, called a safe region, containing it. Then, note that\n\np\u03bbq\nj,: \u201c 0 .\n\n(6)\n\n}xpjqJp\u0398p\u03bbq}2 \u0103 1 \u00f1pB\n\u0398}2 \u0103 1 \u00f1pB\n\n\u0398PR }xpjqJ\n\nmax\n\np\u03bbq\nj,: \u201c 0 .\n\n(7)\n\nprevious test is satis\ufb01ed, sincepB\n\nThe so called safe screening rules consist in removing the variable j from the problem whenever the\nis then guaranteed to be zero. This property leads to considerable\nspeed-up in practice especially with active sets strategies, see for instance [11] for the Lasso case. A\nnatural goal is to \ufb01nd safe regions as narrow as possible: smaller safe regions can only increase the\nnumber of screened out variables. However, complex regions could lead to a computational burden\nlimiting the bene\ufb01t of screening. Hence, we focus on constructing R satisfying the trade-off:\n\np\u03bbq\nj,:\n\n\u2022 R is as small as possible and containsp\u0398p\u03bbq.\n\u2022 Computing max\u0398PR }xpjqJ\n\n\u0398}2 is cheap.\n\nSpheres as safe regions\n\n2.5\nVarious shapes have been considered in practice for the set R such as balls (referred to as spheres)\n[9], domes [11] or more re\ufb01ned sets (see [23] for a survey). Here we consider the so-called\n\u201csphere regions\u201d choosing a ball R \u201c Bpc, rq as a safe region. One can easily obtain a control\n\n3\n\n\f\u0398}2 by extending the computation of the support function of a ball [11, Eq.\n\u0398PBpc,rq}xpjqJ\n\non max\u0398PBpc,rq }xpjqJ\n(9)] to the matrix case: max\nNote that here the center c is a matrix in Rp\u02c6q. We can now state the safe sphere test:\np\u03bbq\nj,: \u201c 0.\n\nc}2 ` r}xpjq}2 \u0103 1,\n\nthen pB\n\nc}2 ` r}xpjq}2 .\n\n\u0398}2 \u010f }xpjqJ\n\nSphere test:\n\n}xpjqJ\n\n(8)\n\nIf\n\n2.6 GAP Safe rule description\n\ni\n\nIn this section we derive a GAP Safe screening rule extending the one introduced in [11]. For this,\nwe rely on the strong convexity of the dual objective function and on weak duality.\nFinding a radius: Remember that @i P rns, fi is differentiable with a 1{\u03b3-Lipschitz gradient.\nAs a consequence, @i P rns, f\u02da\nis \u03b3-strongly convex [14, Theorem 4.2.2, p. 83] and so D\u03bb is\n\u03b3\u03bb2-strongly concave:\n@p\u03981, \u03982q P Rn\u02c6q \u02c6 Rn\u02c6q, D\u03bbp\u03982q \u010f D\u03bbp\u03981q ` x\u2207D\u03bbp\u03981q, \u03982 \u00b4 \u03981y \u00b4 \u03b3\u03bb2\n2\n\nSpecifying the previous inequality for \u03981 \u201c p\u0398p\u03bbq, \u03982 \u201c \u0398 P \u2206X, one has\nD\u03bbp\u0398q \u010f D\u03bbpp\u0398p\u03bbqq ` x\u2207D\u03bbpp\u0398p\u03bbqq, \u0398 \u00b4p\u0398p\u03bbqy \u00b4 \u03b3\u03bb2\nBy de\ufb01nition,p\u0398p\u03bbq maximizes D\u03bb on \u2206X, so we have: x\u2207D\u03bbpp\u0398p\u03bbqq, \u0398\u00b4p\u0398p\u03bbqy \u010f 0. This implies\n}p\u0398p\u03bbq \u00b4 \u0398}2.\nD\u03bbp\u0398q \u010f D\u03bbpp\u0398p\u03bbqq \u00b4 \u03b3\u03bb2\nBy weak duality @B P Rp\u02c6q, D\u03bbpp\u0398p\u03bbqq \u010f P\u03bbpBq, so : @B P Rp\u02c6q,@\u0398 P \u2206X , D\u03bbp\u0398q \u010f P\u03bbpBq \u00b4\n2 }p\u0398p\u03bbq \u00b4 \u0398}2, and we deduce the following theorem:\nd\n\n}p\u0398p\u03bbq \u00b4 \u0398}2.\n\n}\u03981 \u00b4 \u03982}2.\n\n\u03b3\u03bb2\n\n2\n\n2\n\nTheorem 3.\n\n2pP\u03bbpBq \u00b4 D\u03bbp\u0398qq\n\n\u03b3\u03bb2\n\n\u201c: \u02c6r\u03bbpB, \u0398q.\n\n(9)\n\n@B P Rp\u02c6q,@\u0398 P \u2206X ,\n\n(cid:13)(cid:13)(cid:13)p\u0398p\u03bbq \u00b4 \u0398\n\n(cid:13)(cid:13)(cid:13)2\n\n\u010f\n\nProvided one knows a dual feasible point \u0398 P \u2206X and a B P Rp\u02c6q , it is possible to construct a safe\nsphere with radius \u02c6r\u03bbpB, \u0398q centered on \u0398. We now only need to build a (relevant) dual point to\ncenter such a ball. Results from Section 2.3, ensure that \u00b4Gp0q{\u03bbmax P \u2206X, but it leads to a static\nrule, a introduced in [9]. We need a dynamic center to improve the screening as the solver proceeds.\n\nFinding a center: Remember thatp\u0398p\u03bbq \u201c \u00b4GpXpBp\u03bbqq{\u03bb. Now assume that one has a converging\nalgorithm for the primal problem, i.e., Bk \u00d1 pBp\u03bbq. Hence, a natural choice for creating a dual\n\nfeasible point \u0398k is to choose it proportional to \u00b4GpXBkq, for instance by setting:\n\n\u0398k \u201c\n\nRk\n\u03bb ,\n\u2126\u02dapXJRkq ,\n\nRk\n\nif \u2126\u02dapXJRkq \u010f \u03bb,\notherwise.\n\n(10)\nA re\ufb01ned method consists in solving the one dimensional problem: arg max\u0398P\u2206XXSpanpRkq D\u03bbp\u0398q.\nIn the Lasso and group-Lasso case [5, 6, 11] such a step is simply a projection on the intersection of\na line and the (polytope) dual set and can be computed ef\ufb01ciently. However for logistic regression\nthe computation is more involved, so we have opted for the simpler solution in Equation (10). This\nstill provides converging safe rules (see Proposition 1).\n\nwhere Rk \u201c \u00b4GpXBkq .\n\n#\n\nDynamic GAP Safe rule summarized\n\nWe can now state our dynamical GAP Safe rule at the k-th step of an iterative solver:\n\n1. Compute Bk, and then obtain \u0398k and \u02c6r\u03bbpBk, \u0398kq using (10).\n\n4\n\n\f\u0398k}2 ` \u02c6r\u03bbpBk, \u0398kq}xpjq}2 \u0103 1, then setpB\n\n2. If }xpjqJ\n\np\u03bbq\nj,: \u201c 0 and remove xpjq from X.\n\nDynamic safe screening rules are more ef\ufb01cient than existing methods in practice because they can\nincrease the ability of screening as the algorithm proceeds. Since one has sharper and sharper dual\nregions available along the iterations, support identi\ufb01cation is improved. Provided one relies on a\nprimal converging algorithm, one can show that the dual sequence we propose is converging too.\nThe convergence of the primal is unaltered by our GAP Safe rule: screening out unnecessary coef\ufb01-\ncients of Bk can only decrease its distance with its original limits. Moreover, a practical consequence\nis that one can observe surprising situations where lowering the tolerance of the solver can reduce\nthe computation time. This can happen for sequential setups.\n\nProposition 1. Let Bk be the current estimate of pBp\u03bbq and \u0398k de\ufb01ned in Eq. (10) be the current\nestimate ofp\u0398p\u03bbq. Then limk\u00d1`8 Bk \u201cpBp\u03bbq implies limk\u00d1`8 \u0398k \u201c p\u0398p\u03bbq.\n\nNote that if the primal sequence is converging to the optimal, our dual sequence is also converging.\nBut we know that the radius of our safe sphere is p2pP\u03bbpBkq \u00b4 D\u03bbp\u0398kqq{p\u03b3\u03bb2qq1{2. By strong\nduality, this radius converges to 0, hence we have certi\ufb01ed that our GAP Safe regions sequence\nBp\u0398k, \u02c6r\u03bbpBk, \u0398kqq is a converging safe rules (in the sense introduced in [11, De\ufb01nition 1]).\nRemark 4. The active set obtained by our GAP Safe rule (i.e., the indexes of non screened-out\n\nvariables) converges to the equicorrelation set [20] E\u03bb :\u201c tj P p : }xpjqJp\u0398p\u03bbq}2 \u201c 1u, allowing us\n\nto early identify relevant features (see Proposition 2 in the supplementary material for more details).\n\n3 Special cases of interest\n\nWe now specialize our results to relevant supervised learning problems, see also Table 1.\n\nn\n\n\u0159\n3.1 Lasso\nIn the Lasso case q \u201c 1, the parameter is a vector: B \u201c \u03b2 P Rp, Fp\u03b2q \u201c 1{2}y \u00b4 X\u03b2}2\ni\u201c1pyi \u00b4 xJ\n3.2\n\ni \u03b2q2, meaning that fipzq \u201c pyi \u00b4 zq2{2 and \u2126p\u03b2q \u201c }\u03b2}1.\n\n(cid:96)1{(cid:96)2 multi-task regression\n\n2}Y \u00b4XB}2\n\n\u0159\nIn the multi-task Lasso, which is a special case of group-Lasso, we assume that the observation is\nY P Rn\u02c6q, FpBq \u201c 1\n2 (i.e., fipzq \u201c }Yi,:\u00b4z}2{2) and \u2126pBq \u201c\nj\u201c1 }Bj,:}2. In signal processing, this model is also referred to as Multiple Measurement Vector\n(MMV) problem. It allows to jointly select the same features for multiple regression tasks [1, 2].\nRemark 5. Our framework could encompass easily the case of non-overlapping groups with var-\nious size and weights presented in [6]. Since our aim is mostly for multi-task and multinomial\napplications, we have rather presented a matrix formulation.\n\ni\u201c1 }Yi,:\u00b4xJ\n\n2 \u201c 1\n\ni B}2\n\n2 \u201c\n\n\u0159\n\nn\n\np\n\n2\n\n3.3\n\n(cid:96)1 regularized logistic regression\n\n`\n\n`\n\n`\n\n\u02d8\u02d8\u02d8\n\n\u00b4yixJ\n\nHere, we consider the formulation given in [7, Chapter 3] for the two classes logistic regression. In\nsuch a context, one observes for each i P rns a class label ci P t1, 2u. This information can be recast\nas yi \u201c 1tci\u201c1u, and it is then customary to minimize (1) where\n1 ` exp\n\n(11)\nwith B \u201c \u03b2 P Rp (i.e., q \u201c 1), fipzq \u201c \u00b4yiz ` logp1 ` exppzqq and the penalty is simply the (cid:96)1\nnorm: \u2126p\u03b2q \u201c }\u03b2}1. Let us introduce Nh, the (binary) negative entropy function de\ufb01ned by 2:\n\nFp\u03b2q \u201c n\u00ff\n\"\nx logpxq ` p1 \u00b4 xq logp1 \u00b4 xq,\n`8,\ni pziq \u201c Nhpzi ` yiq and \u03b3 \u201c 4.\nThen, one can easily check that f\u02da\n2with the convention 0 logp0q \u201c 0\n\nif x P r0, 1s ,\notherwise .\n\ni \u03b2 ` log\n\nNhpxq \u201c\n\nxJ\ni \u03b2\n\n(12)\n\ni\u201c1\n\n,\n\n5\n\n\fLasso\npyi\u00b4zq2\n\n2\n\nMulti-task regr.\n\n}Yi,:\u00b4z}2\n\n2\n\nLogistic regr.\nlogp1 ` ezq \u00b4 yiz\n\nlog\n\nfipzq\ni puq\nf\u02da\n\u2126pBq\n\n\u03bbmax\nGp\u0398q\n\n\u03b3\n\npyi\u00b4uq2\u00b4y2\n\ni\n\n2\n\n}\u03b2}1\n\n}XJy}8\n\u03b8 \u00b4 y\n\n1\n\n}Yi,:\u00b4u}2\u00b4}Yi,:}2\n\n2\n\np\u00ff\n\n2\n\n}Bj,:}2\n\nj\u201c1\n\u2126\u02dapXJY q\n\u0398 \u00b4 Y\n\n1\n\nYi,kzk\n\nMultinomial regr.\n\n` q\u00ff\n\n\u02d8\n\n\u00b4 q\u00ff\n\nezk\n\nk\u201c1\nk\u201c1\np\u00ff\nNHpu ` Yi,:q\n}Bj,:}2\n\nj\u201c1\n\nNhpu ` yiq\n\n}\u03b2}1\n\n}XJp1n{2 \u00b4 yq}8\n\nez\n\n1`ez \u00b4 y\n\n4\n\n\u2126\u02dapXJp1n\u02c6q{q \u00b4 Y qq\nRowNormpe\u0398q \u00b4 Y\n\n1\n\nTable 1: Useful ingredients for computing GAP Safe rules. We have used lower case to indicate\nwhen the parameters are vectorial (i.e., q \u201c 1). The function RowNorm consists in normalizing a\n(non-negative) matrix row-wise, such that each row sums to one.\n\n3.4\n\n(cid:96)1{(cid:96)2 multinomial logistic regression\n\nWe adapt the formulation given in [7, Chapter 3] for the multinomial regression. In such a context,\none observes for each i P rns a class label ci P t1, . . . , qu. This information can be recast into a\nmatrix Y P Rn\u02c6q \ufb01lled by 0\u2019s and 1\u2019s: Yi,k \u201c 1tci\u201cku. In the same spirit as the multi-task Lasso, a\nmatrix B P Rp\u02c6q is formed by q vectors encoding the hyperplanes for the linear classi\ufb01cation. The\nmultinomial (cid:96)1{(cid:96)2 regularized regression reads:\n\u0159\n\u0159\n\u00b4Yi,kxJ\nk\u201c1\nk\u201c1 \u00b4Yi,kzk ` logp\n\n(13)\nk\u201c1 exppzkqq to recover the formulation as in (1). Let us\n\nwith fipzq \u201c\nintroduce NH, the negative entropy function de\ufb01ned by (still with the convention 0 logp0q \u201c 0)\n\ni B:,k ` log\n\n\u02d8\u00b8\u00b8\n\nxJ\ni B:,k\n\nq\u00ff\n\n\u02dc\n\n\u02dc\n\nk\u201c1\n\n`\n\nexp\n\n,\n\nq\n\ni\u201c1\n\nq\u00ff\n\nFpBq \u201c n\u00ff\n\"\u0159\ni\u201c1 xi logpxiq,\n`8,\n\nq\n\nq\n\n\u0159\ni\u201c1 xi \u201c 1u,\n\nq\n\nif x P \u03a3q \u201c tx P Rq` :\notherwise.\n\nNHpxq \u201c\n\n(14)\n\n: @i P rns,\u00b4\u03bb\u03981\n\ni pzq \u201c NHpz ` Yi,:q and \u03b3 \u201c 1.\n\nAgain, one can easily check that f\u02da\nRemark 6. For multinomial logistic regression, D\u03bb implicitly encodes the additional constraint\n\u0398 P dom D\u03bb \u201c t\u03981\ni,: ` Yi,: P \u03a3qu where \u03a3q is the q dimensional simplex, see\n(14). As 0 and Rk{\u03bb both belong to this set, any convex combination of them, such as \u0398k de\ufb01ned\nin (10), satis\ufb01es this additional constraint.\nRemark 7. The intercept has been neglected in our models for simplicity. Our GAP Safe framework\ncan also handle such a feature at the cost of more technical details (by adapting the results from [15]\nfor instance). However, in practice, the intercept can be handled in the present formulation by adding\na constant column to the design matrix X. The intercept is then regularized. However, if the constant\nis set high enough, regularization is small and experiments show that it has little to no impact for\nhigh-dimensional problems. This is the strategy used by the Liblinear package [10].\n\n4 Experiments\n\nIn this section we present results obtained with the GAP Safe rule. Results are on high dimensional\ndata, both dense and sparse. Implementation have been done in Python and Cython for low critical\nparts. They are based on the multi-task Lasso implementation of Scikit-Learn [17] and coordinate\ndescent logistic regression solver in the Lightning software [4]. In all experiments, the coordinate\ndescent algorithm used follows the pseudo code from [11] with a screening step every 10 iterations.\n\n6\n\n\fFigure 1: Experiments on MEG/EEG brain imaging dataset (dense data with n \u201c 360, p \u201c 22494\nand q \u201c 20). On the left: fraction of active variables as a function of \u03bb and the number of iterations\nK. The GAP Safe strategy has a much longer range of \u03bb with (red) small active sets. On the right:\nComputation time to reach convergence using different screening strategies.\n\nNote that we have not performed comparison with the sequential screening rule commonly acknowl-\nedge as the state-of-the-art \u201csafe\u201d screening rule (such as th EDDP+ [21]), since we can show that\nthis kind of rule is not safe. Indeed, the stopping criterion is based on dual gap accuracy, and com-\nparisons would be unfair since such methods sometimes do not converge to the prescribed accuracy.\nThis is backed-up by a counter example given in the supplementary material. Nevertheless, modi\ufb01-\ncations of such rules, inspired by our GAP Safe rules, can make them safe. However the obtained\nsequential rules are still outperformed by our dynamic strategies (see Figure 2 for an illustration).\n\n4.1\n\n(cid:96)1{(cid:96)2 multi-task regression\n\nTo demonstrate the bene\ufb01t of the GAP Safe screening rule for a multi-task Lasso problem we used\nneuroimaging data. Electroencephalography (EEG) and magnetoencephalography (MEG) are brain\nimaging modalities that allow to identify active brain regions. The problem to solve is a multi-task\nregression problem with squared loss where every task corresponds to a time instant. Using a multi-\ntask Lasso one can constrain the recovered sources to be identical during a short time interval [13].\nThis corresponds to a temporal stationary assumption. In this experiment we used a joint MEG/EEG\ndata with 301 MEG and 59 EEG sensors leading to n \u201c 360. The number of possible sources is\np \u201c 22, 494 and the number of time instants q \u201c 20. With a 1 kHz sampling rate it is equivalent to\nsay that the sources stay the same for 20 ms.\nResults are presented in Figure 1. The GAP Safe rule is compared with the dynamic safe rule\nfrom [6]. The experimental setup consists in estimating the solutions of the multi-task Lasso problem\nfor 100 values of \u03bb on a logarithmic grid from \u03bbmax to \u03bbmax{103. For the experiments on the left\na \ufb01xed number of iterations from 2 to 211 is allowed for each \u03bb. The fraction of active variables\nis reported. Figure 1 illustrates that the GAP Safe rule screens out much more variables than the\ncompared method, as well as the converging nature of our safe regions. Indeed, the more iterations\nperformed the more the rule allows to screen variables. On the right, computation time con\ufb01rms the\neffective speed-up. Our rule signi\ufb01cantly improves the computation time for all duality gap tolerance\nfrom 10\u00b42 to 10\u00b48, especially when accurate estimates are required, e.g., for feature selection.\n\n4.2\n\n(cid:96)1 binary logistic regression\n\nResults on the Leukemia dataset are reported in Figure 2. We compare the dynamic strategy of GAP\nSafe to a sequential and non dynamic rule such as Slores [22]. We do not compare to the actual\nSlores rule as it requires the previous dual optimal solution, which is not available. Slores is indeed\nnot a safe method (see Section B in the supplementary materials). Nevertheless one can observe that\ndynamic strategies outperform pure sequential one, see Section C in the supplementary material).\n\n7\n\n\fFigure 2: (cid:96)1 regularized binary logistic regression on the Leukemia dataset (n = 72 ; m = 7,129 ;\nq = 1). Simple sequential and full dynamic screening GAP Safe rules are compared. On the left:\nfraction of the variables that are active. Each line corresponds to a \ufb01xed number of iterations for\nwhich the algorithm is run. On the right: computation times needed to solve the logistic regression\npath to desired accuracy with 100 values of \u03bb.\n\n(cid:96)1{(cid:96)2 multinomial logistic regression\n\n4.3\nWe also applied GAP Safe to an (cid:96)1{(cid:96)2 multinomial logistic regression problem on a sparse dataset.\nData are bag of words features extracted from the News20 dataset (TF-IDF removing English stop\nwords and words occurring only once or more than 95% of the time). One can observe on Figure 3\nthe dynamic screening and its bene\ufb01t as more iterations are performed. GAP Safe leads to a sig-\nni\ufb01cant speedup: to get a duality gap smaller than 10\u00b42 on the 100 values of \u03bb, we needed 1,353 s\nwithout screening and only 485 s when GAP Safe was activated.\n\nFigure 3: Fraction of the variables that are ac-\ntive for (cid:96)1{(cid:96)2 regularized multinomial logistic\nregression on 3 classes of the News20 dataset\n(sparse data with n = 2,757 ; m = 13,010 ;\nq = 3). Computation was run on the best 10%\nof the features using \u03c72 univariate feature se-\nlection [16]. Each line corresponds to a \ufb01xed\nnumber of iterations for which the algorithm is\nrun.\n\n5 Conclusion\n\nThis contribution detailed new safe rules for accelerating algorithms solving generalized linear mod-\nels regularized with (cid:96)1 and (cid:96)1{(cid:96)2 norms. The rules proposed are safe, easy to implement, dynamic\nand converging, allowing to discard signi\ufb01cantly more variables than alternative safe rules. The\npositive impact in terms of computation time was observed on all tested datasets and demonstrated\nhere on a high dimensional regression task using brain imaging data as well as binary and multiclass\nclassi\ufb01cation problems on dense and sparse data. Extensions to other generalized linear model,\ne.g., Poisson regression, are expected to reach the same conclusion. Future work could investigate\noptimal screening frequency, determining when the screening has correctly detected the support.\n\nAcknowledgment\n\nWe acknowledge the support from Chair Machine Learning for Big Data at T\u00b4el\u00b4ecom ParisTech and\nfrom the Orange/T\u00b4el\u00b4ecom ParisTech think tank phi-TAB. This work bene\ufb01ted from the support of\nthe \u201dFMJH Program Gaspard Monge in optimization and operation research\u201d, and from the support\nto this program from EDF.\n\n8\n\nNo screeningGAP Safe (sequential)GAP Safe (dynamic)No screeningGAP Safe (sequential)GAP Safe (dynamic)\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41\u201348,\n\n2006.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73(3):243\u2013272, 2008.\n\n[3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert\n\nspaces. Springer, New York, 2011.\n\n[4] M. Blondel, K. Seki, and K. Uehara. Block coordinate descent algorithms for large-scale\n\nsparse multiclass classi\ufb01cation. Machine Learning, 93(1):31\u201352, 2013.\n\n[5] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. A dynamic screening principle for the\n\nlasso. In EUSIPCO, 2014.\n\n[6] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. Dynamic Screening: Accelerat-\nIEEE Trans. Signal Process.,\n\ning First-Order Algorithms for the Lasso and Group-Lasso.\n63(19):20, 2015.\n\n[7] P. B\u00a8uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series in\n\nStatistics. Springer, Heidelberg, 2011. Methods, theory and applications.\n\n[8] B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist.,\n\n32(2):407\u2013499, 2004. With discussion, and a rejoinder by the authors.\n\n[9] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learn-\n\ning. J. Paci\ufb01c Optim., 8(4):667\u2013698, 2012.\n\n[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large\n\nlinear classi\ufb01cation. J. Mach. Learn. Res., 9:1871\u20131874, 2008.\n\n[11] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In\n\nICML, pages 333\u2013342, 2015.\n\n[12] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[13] A. Gramfort, M. Kowalski, and M. H\u00a8am\u00a8al\u00a8ainen. Mixed-norm estimates for the M/EEG inverse\n\nproblem using accelerated gradient methods. Phys. Med. Biol., 57(7):1937\u20131961, 2012.\n\n[14] J.-B. Hiriart-Urruty and C. Lemar\u00b4echal. Convex analysis and minimization algorithms. II,\n\nvolume 306. Springer-Verlag, Berlin, 1993.\n\n[15] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic\n\nregression. J. Mach. Learn. Res., 8(8):1519\u20131555, 2007.\n\n[16] C. D. Manning and H. Sch\u00a8utze. Foundations of Statistical Natural Language Processing. MIT\n\nPress, Cambridge, MA, USA, 1999.\n\n[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.,\n12:2825\u20132830, 2011.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSSB, 58(1):267\u2013288, 1996.\n[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong\n\nrules for discarding predictors in lasso-type problems. JRSSB, 74(2):245\u2013266, 2012.\n\n[20] R. J. Tibshirani. The lasso problem and uniqueness. Electron. J. Stat., 7:1456\u20131490, 2013.\n[21] J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. arXiv\n\npreprint arXiv:1211.3966, 2012.\n\n[22] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A safe screening rule for sparse logistic regres-\n\nsion. In NIPS, pages 1053\u20131061, 2014.\n\n[23] Z. J. Xiang, Y. Wang, and P. J. Ramadge. Screening tests for lasso problems. arXiv preprint\n\narXiv:1405.4897, 2014.\n\n[24] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJRSSB, 68(1):49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "Eugene", "family_name": "Ndiaye", "institution": "Institut Mines-T\u00e9l\u00e9com, T\u00e9l\u00e9com ParisTech, CNRS LTCI"}, {"given_name": "Olivier", "family_name": "Fercoq", "institution": "Telecom ParisTech"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "Telecom Paristech"}, {"given_name": "Joseph", "family_name": "Salmon", "institution": "T\u00e9l\u00e9com ParisTech"}]}