{"title": "A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups", "book": "Advances in Neural Information Processing Systems", "page_first": 2604, "page_last": 2612, "abstract": "We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in Jacob et al. 09, where the group lasso penalty is generalized to overlapping groups of variables. While in Jacob et al. 09 the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and constrained Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing to reduce the dimensionality of the data. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations.", "full_text": "A primal-dual algorithm for group sparse\nregularization with overlapping groups\n\nSo\ufb01a Mosci\n\nDISI- Universit`a di Genova\nmosci@disi.unige.it\n\nSilvia Villa\n\nDISI- Universit`a di Genova\nvilla@dima.unige.it\n\nAlessandro Verri\n\nDISI- Universit`a di Genova\nverri@disi.unige.it\n\nAbstract\n\nLorenzo Rosasco\n\nIIT - MIT\n\nlrosasco@MIT.EDU\n\nWe deal with the problem of variable selection when variables must be selected\ngroup-wise, with possibly overlapping groups de\ufb01ned a priori. In particular we\npropose a new optimization procedure for solving the regularized algorithm pre-\nsented in [12], where the group lasso penalty is generalized to overlapping groups\nof variables. While in [12] the proposed implementation requires explicit repli-\ncation of the variables belonging to more than one group, our iterative procedure\nis based on a combination of proximal methods in the primal space and projected\nNewton method in a reduced dual space, corresponding to the active groups. This\nprocedure provides a scalable alternative with no need for data duplication, and\nallows to deal with high dimensional problems without pre-processing for dimen-\nsionality reduction. The computational advantages of our scheme with respect\nto state-of-the-art algorithms using data duplication are shown empirically with\nnumerical simulations.\n\n1\n\nIntroduction\n\nSparsity has become a popular way to deal with small samples of high dimensional data and, in a\nbroad sense, refers to the possibility of writing the solution in terms of a few building blocks. Often,\nsparsity based methods are the key towards \ufb01nding interpretable models in real-world problems. In\nparticular, regularization based on (cid:96)1 type penalties is a powerful approach for dealing with the prob-\nlem of variable selection, since it provides sparse solutions by minimizing a convex functional. The\nsuccess of (cid:96)1 regularization motivated exploring different kinds of sparsity properties for (general-\nized) linear models, exploiting available a priori information, which restricts the admissible sparsity\npatterns of the solution. An example of a sparsity pattern is when the input variables are partitioned\ninto groups (known a priori), and the goal is to estimate a sparse model where variables belonging to\nthe same group are either jointly selected or discarded. This problem can be solved by regularizing\nwith the group-(cid:96)1 penalty, also known as group lasso penalty, which is the sum, over the groups, of\nthe euclidean norms of the coef\ufb01cients restricted to each group.\nA possible generalization of group lasso is to consider groups of variables which can be potentially\noverlapping, and the goal is to estimate a model which support is the union of groups. This is a\ncommon situation in bioinformatics (especially in the context of high-throughput data such as gene\nexpression and mass spectrometry data), where problems are characterized by a very low number of\nsamples with several thousands of variables. In fact, when the number of samples is not suf\ufb01cient\nto guarantee accurate model estimation, an alternative is to take advantage of the huge amount of\nprior knowledge encoded in online databases such as the Gene Ontology. Largely motivated by ap-\nplications in bioinformatics, a new type of penalty is proposed in [12], which is shown to give better\n\n1\n\n\fperformances than simple (cid:96)1 regularization.\nA straightforward solution to the minimization problem underlying the method proposed in [12] is\nto apply state-of-the-art techniques for group lasso (we recall interior-points methods [3, 20], block\ncoordinate descent [16], and proximal methods [9, 21], also known as forward-backward splitting\nalgorithms, among others) in an expanded space, built by duplicating variables that belong to more\nthan one group.\nAs already mentioned in [12], though very simple, such an implementation does not scale to large\ndatasets, when the groups have signi\ufb01cant overlap, and a more scalable algorithm with no data du-\nplication is needed. For this reason we propose an alternative optimization approach to solve the\ngroup lasso problem with overlap. Our method does not require explicit replication of the features\nand is thus more appropriate to deal with high dimensional problems with large groups overlap.\nOur approach is based on a proximal method (see for example [18, 6, 5]), and two ad hoc results\nthat allow to ef\ufb01ciently compute the proximity operator in a much lower dimensional space: with\nLemma 1 we identify the subset of active groups, whereas in Theorem 2 we formulate the reduced\ndual problem for computing the proximity operator, where the dual space dimensionality coincides\nwith the number of active groups. The dual problem can then be solved via Bertsekas\u2019 projected\nNewton method [7]. We recall that a particular overlapping structure is the hierarchical structure,\nwhere the overlap between groups is limited to inclusion of a descendant in its ancestors. In this case\nthe CAP penalty [24] can be used for model selection, as it has been done in [2, 13], but ancestors\nare forced to be selected when any of their descendant are selected. Thanks to the nested structure,\nthe proximity operator of the penalty term can be computed exactly in a \ufb01nite number of steps [14].\nThis is no longer possible in the case of general overlap. Finally it is worth noting that the penalty\nanalyzed here can be applied also to hierarchical group lasso. Differently from [2, 13] selection of\nancestors is no longer enforced.\nThe paper is organized as follows. In Section 2 we recall the group lasso functional for overlap-\nping groups and set some notations. In Section 3 we state the main results, present a new iterative\noptimization procedure, and discuss computational issues. Finally in Section 4 we present some\nnumerical experiments comparing running time of our algorithm with state-of-the-art techniques.\nThe proofs are reported in the Supplementary material.\n\nj\u2208G \u03b22\n\nnotation (cid:107)\u03b2(cid:107)G = ((cid:80)\n\ngeneralized linear model f(x) =(cid:80)d\n\n2 Problem and Notations\nWe \ufb01rst \ufb01x some notations. Given a vector \u03b2 \u2208 Rd, while (cid:107)\u00b7(cid:107) denotes the (cid:96)2-norm, we will use the\nj )1/2 to denote the (cid:96)2-norm of the components of \u03b2 in G \u2282 {1, . . . , d}.\nThen, for any differentiable function f : RB \u2192 R, we denote by \u2202rf its partial derivative with\nrespect to variables r, and by \u2207f = (\u2202rf)B\nWe are now ready to cast group (cid:96)1 regularization with overlapping groups as the following varia-\ntional problem. Given a training set {(xi, yi)n\nj=1, and B subsets\nof variables G = {Gr}B\nr=1 with Gr \u2282 {1, . . . , d}, we assume the estimator to be described by a\nj=1 \u03c8j(x)\u03b2j and consider the following regularization scheme\n(1)\nwhere \u03a8 is the n \u00d7 d matrix given by the features \u03c8j in the dictionary evaluated in the training set\ni=1 (cid:96) (f(xi), yi), when\npoints, [\u03a8]i,j = \u03c8j(xi). The term 1\nthe cost function1 (cid:96) : R \u00d7 Y \u2192 R+ is the square loss, (cid:96)(f(x), y) = (y \u2212 f(x))2.\nThe penalty term \u2126Goverlap : Rd \u2192 R+ is lower semicontinuous, convex, and one-homogeneous,\n(\u2126Goverlap(\u03bb\u03b2) = \u03bb\u2126Goverlap(\u03b2),\u2200\u03b2 \u2208 Rd and \u03bb \u2208 R+), and is de\ufb01ned as\n\nr=1 its gradient.\ni=1} \u2208 (X \u00d7 Y )n, a dictionary (\u03c8j)d\n(cid:27)\n\nE\u03c4 (\u03b2) = argmin\n\u03b2\u2208Rd\n\n(cid:107)\u03a8\u03b2 \u2212 y(cid:107)2 + 2\u03c4\u2126G\n\nn\n\nn (cid:107)\u03a8\u03b2 \u2212 y(cid:107)2 is the empirical error, 1\nB(cid:88)\n\n(v1,...,vB ),vr\u2208Rd,supp(vr)\u2282Gr,PB\n\ninf\n\nr=1 vr=\u03b2\n\n\u03b2\u2217 = argmin\n\u03b2\u2208Rd\n\n\u2126G\noverlap(\u03b2) =\n\n(cid:26) 1\n\nn\n\noverlap(\u03b2)\n\n,\n\n(cid:80)n\n\n(cid:107)vr(cid:107) .\n\nr=1\n\nThe functional \u2126Goverlap was introduced in [12] as a generalization of the group lasso penalty to\nallow overlapping groups, while maintaining the group lasso property of enforcing sparse solutions\nwhich support is a union of groups. When groups do not overlap, \u2126Goverlap reduces to the group lasso\n\n1Note our analysis would immediately apply to other loss functions, e.g. the logistic loss.\n\n2\n\n\fpenalty. Note that, as pointed out in [12], using(cid:80)B\n\nr=1 (cid:107)\u03b2(cid:107)Gr\n\nas generalization of the group lasso\npenalty leads to a solution which support is the complement of the union of groups. For an extensive\nstudy of the properties of \u2126Goverlap, its comparison with the (cid:96)1 norm, and its extension to graph lasso,\nwe therefore refer the interested reader to [12].\n\n3 The GLO-pridu Algorithm\n\nIf one needs to solve problem (1) for high dimensional data, the use of standard second-order meth-\nods such as interior-point methods is precluded (see for instance [6]), since they need to solve large\nsystems of linear equations to compute the Newton steps. On the other hand, \ufb01rst order methods\ninspired to Nesterov\u2019s seminal paper [19] (see also [18]) and based on proximal methods already\nproved to be a computationally ef\ufb01cient alternative in many machine learning applications [9, 21].\n\n3.1 A Proximal algorithm\nGiven the convex functional E\u03c4 in (1), which is sum of a differentiable term, namely 1\nn (cid:107)\u03a8\u03b2 \u2212 y(cid:107)2,\nand a non-differentiable one-homogeneous term 2\u03c4\u2126Goverlap, its minimum can be computed with\nfollowing acceleration of the iterative forward-backward splitting scheme\n\u03a8T (\u03a8hp \u2212 y)\n\n\u03b2p =(cid:0)I \u2212 \u03c0\u03c4 /\u03c3K\n\n(cid:1)(cid:18)\n\n(cid:19)\n\nhp \u2212 1\nn\u03c3\n\n(cid:16)\u2212cp +\n\n(cid:113)\n\n(cid:17)\n\ncp = (1 \u2212 tp)cp\u22121,\n\ntp+1 =\nhp+1 = \u03b2p(1 \u2212 tp+1 + tp+1\ntp\n\np + 8cp\nc2\n) + \u03b2p\u22121(tp \u2212 1) tp+1\ntp\n\n/4\n\n(2)\n\n\u03c3 \u2126Goverlap reduces to the identity minus the projection onto the subdifferential of \u03c4\n\nfor a suitable choice of \u03c3. Due to one-homogeneity of \u2126Goverlap, the proximity operator associated\n\u03c3 \u2126Goverlap at\nto \u03c4\nthe origin, which is a closed and convex set. We will denote such a projection as \u03c0\u03c4 /\u03c3K, where\nK = \u2202\u2126Goverlap(0). The above scheme is inspired to [10], and is equivalent to the algorithm named\nFISTA [5], which convergence is guaranteed, as recalled in the following theorem\nTheorem 1 Given \u03b20 \u2208 Rd, and \u03c3 = ||\u03a8T \u03a8||/n, let h1 = \u03b20 and t1 = 1, c0 = 1, then there exists\na constant C0 such that the iterative update (10) satis\ufb01es\n\n(3)\nAs it happens for other accelerations of the basic forward-backward splitting algorithm such as [19,\n6, 4], convergence of the sequence \u03b2p is no longer guaranteed unless strong convexity is assumed.\nHowever, sacri\ufb01cing theoretical convergence for speed may be mandatory in large scale applications.\nFurthermore, there is a strong empirical evidence that \u03b2p is indeed convergent (see Section 4).\n\nE\u03c4 (\u03b2p) \u2212 E\u03c4 (\u03b2\u2217) \u2264 C0\np2 .\n\n3.2 The projection\nNote that the proximity operator of the penalty \u2126Goverlap does not admit a closed form and must be\ncomputed approximatively. In fact the projection on the convex set\n\nK = \u2202\u2126G\n\noverlap(0) = {v \u2208 Rd,(cid:107)v(cid:107)Gr \u2264 1 for r = 1, . . . , B}.\n\ncannot be decomposed group-wise, as in standard group (cid:96)1 regularization, which proximity operator\nresolves to a group-wise soft-thresholding operator (see Eq. (9) later). Nonetheless, the following\nlemma shows that, when evaluating the projection, \u03c0K, we can restrict ourselves to a subset of\n\u02c6B = | \u02c6G| \u2264 B active groups. This equivalence is crucial for speeding up the algorithm, in fact \u02c6B is\nthe number of selected groups which is small if one is interested in sparse solutions.\nLemma 1 Given \u03b2 \u2208 Rd, G = {Gr}B\nconvex set \u03c4 K with K = {v \u2208 Rd,(cid:107)v(cid:107)Gr \u2264 1 for r = 1, . . . , B} is given by\n\nr=1 with Gr \u2282 {1, . . . , d}, and \u03c4 > 0, the projection onto the\n\nMinimize\nsubject to\n\nwhere \u02c6G := {G \u2208 G, (cid:107)\u03b2(cid:107)G > \u03c4} .\n\n(cid:107)v \u2212 \u03b2(cid:107)2\nv \u2208 Rd,(cid:107)v(cid:107)G \u2264 \u03c4 for G \u2208 \u02c6G.\n\n(4)\n\n3\n\n\fThe proof (given in the supplementary material) is based on the fact that the convex set \u03c4 K is the\nintersection of cylinders that are all centered on a coordinate subspace. Since \u02c6B is typically much\nsmaller than d, it is convenient to solve the dual problem associated to (4).\nTheorem 2 Given \u03b2 \u2208 Rd, {Gr}B\nconvex set \u03c4 K with K = {v \u2208 Rd,(cid:107)v(cid:107)Gr\n\nr=1 with Gr \u2282 {1, . . . , d}, and \u03c4 > 0, the projection onto the\n\n\u2264 \u03c4 for r = 1, . . . , B} is given by\n\n[\u03c0\u03c4 K(\u03b2)]j =\n\nwhere \u03bb\u2217 is the solution of\n\n(1 +(cid:80) \u02c6B\n\n\u03b2j\nr=1 \u03bb\u2217\n\nfor j = 1, . . . , d\n\n(5)\n\nr1r,j)\n\nd(cid:88)\n\n1 +(cid:80) \u02c6B\n\n\u2212\u03b22\nr=1 1r,j\u03bbr\n\nj\n\n\u02c6B(cid:88)\n\n\u2212\n\nf(\u03bb),\n\nwith f(\u03bb) :=\n\nargmax\n\u03bb\u2208R \u02c6B\n+\n\n(6)\n\u02c6G = {G \u2208 G, (cid:107)\u03b2(cid:107)G > \u03c4} := { \u02c6G1, . . . , \u02c6G \u02c6B}, and 1r,j is 1 if j belongs to group \u02c6Gr and 0 otherwise.\nEquation (6) is the dual problem associated to (4), and, since strong duality holds, the minimum\nof (4) is equal to the maximum of the dual problem, which can be ef\ufb01ciently solved via Bertsekas\u2019\nprojected Newton method described in [7], and here reported as Algorithm 1.\n\n\u03bbr\u03c4 2,\n\nr=1\n\nj=1\n\nAlgorithm 1 Projection\n\nGiven: \u03b2 \u2208 Rd, \u03bbinit \u2208 R \u02c6B, \u03b7 \u2208 (0, 1), \u03b4 \u2208 (0, 1/2), \u0001 > 0\nInitialize: q = 0, \u03bb0 = \u03bbinit\nr = 0, or |\u2202rf(\u03bbq)| > \u0001 if \u03bbq\nwhile (\u2202rf(\u03bbq) > 0 if \u03bbq\n\nq := q + 1\n\nr > 0, for r = 1, . . . , \u02c6B) do\n\n\u0001q = min{\u0001,||\u03bbq \u2212 [\u03bbq \u2212 \u2207f(\u03bbq)]+||}\n\nI q\n+ = {r such that 0 \u2264 \u03bbq\n\nr \u2264 \u0001q, \u2202rf(\u03bbq) > 0}\n+or s \u2208 I q\n\nif r (cid:54)= s, and r \u2208 I q\n\n+\n\nHr,s =\n\n\u2202r\u2202sf(\u03bbq) otherwise\n\u03bb(\u03b1) = [\u03bbq \u2212 \u03b1(H q)\u22121\u2207f(\u03bbq)]+\n\n\u2202rf(\u03bbq) +(cid:80)\n\nr /\u2208Iq\n+\n\nr\u2208Iq\n+\n\n\u2202rf(\u03bbq)[\u03bbq\n\n(7)\n\ndo\n\n(cid:111)\nr \u2212 \u03bbr(\u03b7m)]\n\n(cid:26)0\n(cid:110)\n\u03b7m(cid:80)\n\nm = 0\nwhile f(\u03bbq) \u2212 f(\u03bb(\u03b7m)) \u2265 \u03b4\n\nm := m + 1\n\nend while\n\nend while\nreturn \u03bbq+1\n\n\u03bbq+1 = \u03bb(\u03b7m)\n\nBertsekas\u2019 iterative scheme combines the basic simplicity of the steepest descent iteration [22] with\nthe quadratic convergence of the projected Newton\u2019s method [8]. It does not involve the solution of\na quadratic program thereby avoiding the associated computational overhead.\n\n3.3 Computing the regularization path\n\nIn Algorithm 2 we report the complete Group Lasso with Overlap primal-dual (GLO-pridu) scheme\nfor computing the regularization path, i.e. the set of solutions corresponding to different values of\nthe regularization parameter \u03c41 > . . . > \u03c4T , for problem (1). Note that we employ the continuation\nstrategy proposed in [11]. A similar warm starting is applied to the inner iteration, where at the p-th\nstep \u03bbinit is determined by the solution of the (p\u22121)-th projection. Such an initialization empirically\nproved to guarantee convergence, despite the local nature of Bertsekas\u2019 scheme.\n\n3.4 The replicates formulation\n\nAn alternative way to solve the optimization problem (1) is proposed by [12], where the authors\nshow that problem (1) is equivalent to the standard group (cid:96)1 regularization (without overlap) in an\nexpanded space built by replicating variables belonging to more than one group:\n\n4\n\n\fAlgorithm 2 GLO-pridu regularization path\n\nGiven: \u03c41 > \u03c42 > \u00b7\u00b7\u00b7 > \u03c4T ,G, \u03b7 \u2208 (0, 1), \u03b4 \u2208 (0, 1/2), \u00010 > 0, \u03bd > 0\nLet: \u03c3 = ||\u03a8T \u03a8||/n\nInitialize: \u03b2(\u03c40) = 0\nfor t = 1, . . . , T do\n\n0 = 0\n\nInitialize: \u03b20 = \u03b2(\u03c4t\u22121), \u03bb\u2217\nwhile ||\u03b2p \u2212 \u03b2p\u22121|| > \u03bd||\u03b2p\u22121|| do\n\u2022 w = hp \u2212 (n\u03c3)\u22121\u03a8T (\u03a8hp \u2212 y)\n\u2022 Find \u02c6G = {G \u2208 G,(cid:107)w(cid:107)G \u2265 \u03c4}\n\u2022 Compute \u03bb\u2217\n\u2022 Compute \u03b2p as \u03b2p\n\u2022 Update cp, tp, and hp as in (10)\n\nj = wj(1 +(cid:80) \u02c6B\n\nend while\n\u03b2(\u03c4t) = \u03b2p\n\nend for\nreturn \u03b2(\u03c41), . . . , \u03b2(\u03c4T )\n\np via Algorithm 1 with groups \u02c6G, initialization \u03bb\u2217\n\np\u22121 and tolerance \u00010p\u22123/2\nr 1r,j)\u22121 for j = 1, . . . , d, see Equation (5)\n\nr=1 \u03bbq+1\n\n(cid:40)\n\n(cid:41)\n\nB(cid:88)\n\n|| \u02dc\u03a8 \u02dc\u03b2 \u2212 y||2 + 2\u03c4\n\n|| \u02dc\u03b2|| \u02dcGr\n\n\u02dc\u03b2\u2217 \u2208 argmin\n\u02dc\u03b2\u2208R \u02dcd\n\n1\nn\n\n,\n\nr=1\n\n|GB|, . . . , \u02dcd|]}, and \u02dcd = (cid:80)B\n\n(8)\nwhere \u02dc\u03a8 is the matrix built by concatenating copies of \u03a8 restricted each to a certain group, i.e.\n= (\u03a8j)j\u2208Gr, where { \u02dcG1, . . . , \u02dcGB} = {[1, . . . ,|G1|], [1+|G1|, . . . ,|G1|+|G2|], . . . , [ \u02dcd\u2212\n( \u02dc\u03a8j)j\u2208 \u02dcGr\nr=1 |Gr| is the number of total variables obtained after including the\nr=1 \u03c6Gr( \u02dc\u03b2\u2217), where \u03c6Gr : R \u02dcd \u2192 Rd\nreplicates. One can then reconstruct \u03b2\u2217 from \u02dc\u03b2\u2217 as \u03b2\u2217\nmaps \u02dc\u03b2 in v \u2208 Rd, such that supp(v) \u2282 Gr and (vj)j\u2208Gr = ( \u02dc\u03b2j)j\u2208 \u02dcGr\n, for r = 1, . . . , B. The main\nadvantage of the above formulation relies on the possibility of using any state-of-the-art optimization\nprocedure for group lasso. In terms of proximal methods, a possible solution is given by Algorithm\n3, where S\u03c4 /\u03c3 is the proximity operator of the new penalty, and can be computed exactly as\n\nj =(cid:80)B\n\n\u2212 \u03c4\n\u03c3\n\n\u02dc\u03b2j,\n\n+\n\nfor j \u2208 \u02dcGr,\n\nfor r = 1, . . . , B.\n\n(9)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)|| \u02dc\u03b2|| \u02dcGr\n\n(cid:17)\n\nS\u03c4 /\u03c3( \u02dc\u03b2)\nAlgorithm 3 GL-prox\n\n=\n\nj\n\nGiven: \u02dc\u03b20 \u2208 Rd, \u03c4 > 0, \u03c3 = || \u02dc\u03a8T \u02dc\u03a8||/n\nInitialize: p = 0, \u02dch1 = \u02dc\u03b20, t1 = 1\nwhile convergence not reached do\n\np := p + 1\n\n(cid:16)\u02dchp \u2212 (n\u03c3)\u22121 \u02dc\u03a8T ( \u02dc\u03a8\u02dchp \u2212 y)\n(cid:113)\n\n\u02dc\u03b2p = S\u03c4 /\u03c3\ncp = (1 \u2212 tp)cp\u22121,\ntp+1 =\n\u02dchp+1 = \u02dc\u03b2p(1 \u2212 tp+1 + tp+1\ntp\n\n(\u2212cp +\n\n1\np + 8cp)\nc2\n4\n) + \u02dc\u03b2p\u22121(tp \u2212 1) tp+1\ntp\n\n(cid:17)\n\n(10)\n\nend while\nreturn \u02dc\u03b2p\n\nNote that in principle, by applying Lemma 1, the group-soft-thresholding operator in (9) can be com-\nputed only on the active groups. In practice this does not yield any advantage, since the identi\ufb01cation\nof the active groups has the same computational cost of the thresholding itself.\n\n3.5 Computational issues\n\nFor both GL-prox and GLO-pridu, the complexity of one iteration is the sum of the complexity of\ncomputing the gradient of the data term and the complexity of computing the proximity operator\nof the penalty term. The former has complexity O(dn) and O( \u02dcdn) for GLO-pridu and GL-prox,\n\n5\n\n\frespectively, for the case n < d. One should then add at each iteration, the cost of performing the\nprojection onto K. This can be neglected for the case of replicated variables.On the other hand,\nthe time complexity of one iteration for Algorithm 1 is driven by the number of active groups \u02c6B.\nThis number is typically small when looking for sparse solutions. The complexity is thus given\nby the sum of the complexity of evaluating the inverse of the \u02c6B \u00d7 \u02c6B matrix H, O( \u02c6B3), and the\ncomplexity of performing the product H\u22121\u2207g(\u03bb), O( \u02c6B2). The worst case complexity would then\nbe O( \u02c6B3). Nevertheless, in practice the complexity is much lower because matrix H is highly\nsparse. In fact, Equation (7) tells us that the part of matrix H corresponding to the active set I+ is\ndiagonal. As a consequence, if \u02c6B = \u02c6B\u2212 + \u02c6B+, where \u02c6B\u2212 is the number of non active constraints,\nand \u02c6B+ is the number of active constraints, then the complexity of inverting matrix H is at most\nO( \u02c6B+) + O( \u02c6B3\u2212). Furthermore the \u02c6B\u2212\u00d7 \u02c6B\u2212 non diagonal part of matrix H is highly sparse, since\nHr,s = 0 if \u02c6Gr \u2229 \u02dcGs = \u2205 and the complexity of inverting it is in practice much lower than O( \u02c6B3\u2212).\nThe worst case complexity for computing the projection onto K is thus O(q \u00b7 \u02c6B+) + O(q \u00b7 \u02c6B3\u2212),\nwhere q is the number of iterations necessary to reach convergence. Note that even if, in order to\nguarantee convergence, the tolerance for evaluating convergence of the inner iteration must decrease\nwith the number of external iterations, in practice, thanks to warm starting, we observed that q is\nrarely greater than 10 in the experiments presented here.\nConcerning the number of iterations required to reach convergence for GL-prox in the replicates\nformulation, we empirically observed that it requires a much higher number of iterations than GLO-\npridu (see Table 3). We argue that such behavior is due to the combination of two occurences: 1) the\nlocal condition number of matrix \u02dc\u03a8 is 0 even if \u03a8 is locally well conditioned, 2) the decomposition\nof \u03b2\u2217 as \u02dc\u03b2\u2217 is possibly not unique, which is required in order to have a unique solution for (8).\nThe former is due to the presence of replicated columns in \u02dc\u03a8. In fact, since E\u03c4 is convex but not\nnecessarily strictly convex \u2013 as when n < d \u2013, uniqueness and convergence is not always guaranteed\nunless some further assumption is imposed. Most convergence results relative to (cid:96)1 regularization\nlink uniqueness of the solution as well as the rate of convergence of the Soft Thresholding Iteration\nto some measure of local conditioning of the Hessian of the differentiable part of E\u03c4 (see for instance\nProposition 4.1 in [11], where the Hessian restricted to the set of relevant variables is required to\nbe full rank). In our case the Hessian for GL-prox is simply \u02dcH = 1/n \u02dc\u03a8T \u02dc\u03a8, so that, if the relevant\ngroups have non null intersection, then \u02dcH restricted to the set of relevant variables is by no means\nfull rank. Concerning the latter argument, we must say that in many real world problems, such as\nbioinformatics, one cannot easily verify that the solution indeed has a unique decomposition. In\nfact, we can think of trivial examples where the replicates formulation has not a unique solution.\n\n4 Numerical Experiments\n\nIn this section we present numerical experiments aimed at comparing the running time performance\nof GLO-pridu with state-of-the-art algorithms. To ensure a fair comparison, we \ufb01rst run some pre-\nliminary experiments to identify the fastest codes for group (cid:96)1 regularization with no overlap. We\nrefer to [6] for an extensive empirical and theoretical comparison of different optimization proce-\ndures for solving (cid:96)1 regularization. Further empirical comparisons can be found in [15].\n\n4.1 Comparison of different implementations for standard group lasso\n\nWe considered three algorithms which are representative of the optimization techniques used to\nsolve group lasso: interior-point methods, (group) coordinate descent and its variations, and prox-\nimal methods. As an instance of the \ufb01rst set of techniques we employed the publicly available\nMatlab code at http://www.di.ens.fr/\u02dcfbach/grouplasso/index.htm described\nin [1]. For coordinate descent methods, we employed the R-package grlplasso, which imple-\nments block coordinate gradient descent minimization for a set of possible loss functions. In the\nfollowing we will refer to these two algorithms as \u201c\u2019GL-IP\u201d and \u201cGL-BCGD\u201d. Finally we use our\nMatlab implementation of Algorithm GL-prox as an instance of proximal methods.\nWe \ufb01rst observe that the solutions of the three algorithms coincide up to an error which depends\non each algorithm tolerance. We thus need to tune each tolerance in order to guarantee that all\niterative algorithms are stopped when the level of approximation to the true solution is the same.\n\n6\n\n\fTable 1: Running time (mean and standard deviation) in seconds for computing the entire regular-\nization path of GL-IP, GL-BCGD, and GL-prox for different values of B, and n. For B = 1000,\nGL-IP could not be computed due to memory reasons.\n\nn = 100\n\nGL-IP\n\nGL-BCGD\nGL-prox\n\nn = 500\n\nGL-IP\n\nGL-BCGD\nGL-prox\n\nB = 10\n5.6 \u00b1 0.6\n2.1 \u00b1 0.6\n0.21 \u00b1 0.04\nB = 10\n2.30 \u00b1 0.27\n2.15 \u00b1 0.16\n\n0.1514 \u00b1 0.0025\n\nn = 1000\n\nGL-IP\n\nGL-BCGD\nGL-prox\n\nB = 10\n1.92 \u00b1 0.25\n2.06 \u00b1 0.26\n0.182 \u00b1 0.006\n\nB = 100 B = 1000\n60 \u00b1 90\n14.4 \u00b1 1.5\n2.8 \u00b1 0.6\n2.9 \u00b1 0.4\n183 \u00b1 19\n\n\u2013\n\n\u2013\n\nB = 1000\nB = 100\n370 \u00b1 30\n16.5 \u00b1 1.2\n4.7 \u00b1 0.5\n2.54 \u00b1 0.16\n109 \u00b1 6\nB = 100 B = 1000\n328 \u00b1 22\n20.6 \u00b1 2.2\n18 \u00b1 3\n4.7 \u00b1 0.5\n112 \u00b1 6\n\n\u2013\n\nToward this end, we run Algorithm GL-prox with machine precision, \u03bd = 10\u221216, in order to have\na good approximation of the asymptotic solution. We observe that for many values of n and d, and\nover a large range of values of \u03c4, the approximation of GL-prox when \u03bd = 10\u22126 is of the same\norder of the approximation of GL-IP with optparam.tol=10\u22129, and of GL-BCGD with tol=\n10\u221212. Note also that with these tolerances the three solutions coincide also in terms of selection,\ni.e. their supports are identical for each value of \u03c4. Therefore the following results correspond to\noptparam.tol = 10\u22129 for GL-IP, tol = 10\u221212 for GL-BCGD, and \u03bd = 10\u22126 for GL-prox.\nFor the other parameters of GL-IP we used the values used in the demos supplied with the code.\nConcerning the data generation protocol, the input variables x = (x1, . . . , xd) are uniformly drawn\nfrom [\u22121, 1]d. The labels y are computed using a noise-corrupted linear regression function, i.e. y =\n\u03b2\u00b7 x+ w, where \u03b2 depends on the \ufb01rst 30 variables, \u03b2j = 1 if j =1, . . . , 30, and 0 otherwise, w is an\nadditive gaussian white noise, and the signal to noise ratio is 5:1. In this case the dictionary coincides\nwith the variables, \u03a8j(x) = xj for j = 1, . . . , d. We then evaluate the entire regularization path for\nthe three algorithms with B sequential groups of 10 variables, (G1=[1, . . . , 10], G2=[11, . . . , 20],\nand so on), for different values of n and B. In order to make sure that we are working on the correct\nrange of values for the parameter \u03c4, we \ufb01rst evaluate the set of solutions of GL-prox corresponding\nto a large range of 500 values for \u03c4, with \u03bd = 10\u22124. We then determine the smallest value of \u03c4\nwhich corresponds to selecting less than n variables, \u03c4min, and the smallest one returning the null\nsolution, \u03c4max. Finally we build the geometric series of 50 values between \u03c4min and \u03c4max, and use\nit to evaluate the regularization path on the three algorithms. In order to obtain robust estimates of\nthe running times, we repeat 20 times for each pair n, B.\nIn Table 1 we report the computational times required to evaluate the entire regularization path for\nthe three algorithms. Algorithms GL-BCGD and GL-prox are always faster than GL-IP which, due\nto memory reasons, cannot by applied to problems with more than 5000 variables, since it requires\nto store the d \u00d7 d matrix \u03a8T \u00d7 \u03a8. It must be said that the code for GP-IL was made available\nmainly in order to allow reproducibility of the results presented in [1], and is not optimized in terms\nof time and memory occupation. However it is well known that standard second-order methods are\ntypically precluded on large data sets, since they need to solve large systems of linear equations\nto compute the Newton steps. GL-BCGD is the fastest for B = 1000, whereas GL-prox is the\nfastest for B = 10, 100. The candidates as benchmark algorithms for comparison with GLO-pridu\nare GL-prox and GL-BCGD. Nevertheless we observed that, when the input data matrix contains\na signi\ufb01cant fraction of replicated columns, this algorithm does not provide sparse solutions. We\ntherefore compare GLO-pridu with GL-prox only.\n\n4.1.1 Projection vs duplication\n\nThe data generation protocol is equal to the one described in the previous experiments, but \u03b2 depends\non the \ufb01rst 12/5b variables (which correspond to the \ufb01rst three groups)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u03b2 = ( c, . . . , c\nb\u00b712/5 times\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n, 0, 0, . . . , 0\nd\u2212b\u00b712/5 times\n\n).\n\n(cid:125)\n\n7\n\n\fWe then de\ufb01ne B groups of size b, so that \u02dcd = B \u00b7 b > d. The \ufb01rst three groups correspond to the\nsubset of relevant variables, and are de\ufb01ned as G1 = [1, . . . , b], G2 = [4/5b + 1, . . . , 9/5b], and\nG3 = [1, . . . , b/5, 8/5b + 1, . . . , 12/5b], so that they have a 20% pair-wise overlap. The remaining\nB \u2212 3 groups are built by randomly drawing sets of b indexes from [1, d]. In the following we\nwill let n = 10|G1 \u222a G2 \u222a G3|, i.e. n is ten times the number of relevant variables, and vary d, b.\nWe also vary the number of groups B, so that the dimension of the expanded space is \u03b1 times the\ninput dimension, \u02dcd = \u03b1d, with \u03b1 = 1.2, 2, 5. Clearly this amounts to taking B = \u03b1 \u00b7 d/b. The\nparameter \u03b1 can be thought of as the average number of groups a single variable belongs to. We\nidentify the correct range of values for \u03c4 as in the previous experiments, using GLO-pridu with loose\ntolerance, and then evaluate the running time and the number of iterations necessary to compute the\nentire regularization path for GL-prox on the expanded space and GLO-pridu, both with \u03bd = 10\u22126.\nFinally we repeat 20 times for each combination of the three parameters d, b, and \u03b1.\n\nTable 2: Running time (mean \u00b1 standard deviation) in seconds for b=10 (top), and b=100 (below).\nFor each d and \u03b1, the left and right side correspond to GLO-pridu, and GL-prox, respectively.\n\nd=1000\nd=5000\nd=10000\n\nd=1000\nd=5000\nd=10000\n\n\u03b1 = 1.2\n\n0.15 \u00b1 0.04\n1.1 \u00b1 0.4\n2.1 \u00b1 0.7\n\n0.20 \u00b1 0.09\n1.0 \u00b1 0.6\n2.1 \u00b1 1.4\n\n\u03b1 = 2\n\n1.6 \u00b1 0.9\n1.55 \u00b1 0.29\n3.0 \u00b1 0.6\n\n5.1 \u00b1 2.0\n2.4 \u00b1 0.7\n4.5 \u00b1 1.4\n\n\u03b1 = 5\n\n12.4 \u00b1 1.3\n103 \u00b1 12\n460 \u00b1 110\n\n68 \u00b1 8\n790 \u00b1 57\n2900 \u00b1 400\n\n\u03b1 = 1.2\n\n11.7 \u00b1 0.4\n31 \u00b1 13\n16.6 \u00b1 2.1\n\n24.1 \u00b1 2.5\n38 \u00b1 15\n13 \u00b1 3\n\n\u03b1 = 2\n\n11.6 \u00b1 0.4\n90 \u00b1 5\n90 \u00b1 30\n\n42 \u00b1 4\n335 \u00b1 21\n270 \u00b1 120\n\n13.5 \u00b1 0.7\n85 \u00b1 3\n296 \u00b1 16\n\n1467 \u00b1 13\n1110 \u00b1 80\n\n\u2013\n\n\u03b1 = 5\n\nTable 3: Number of iterations (mean \u00b1 standard deviation) for b = 10 (top) and b = 100 (below).\nFor each d and \u03b1, the left and right side correspond to GLO-pridu, and GL-prox, respectively.\n\nd=1000\nd=5000\nd=10000\n\nd=1000\nd=5000\nd=10000\n\n\u03b1 = 1.2\n\n100 \u00b1 30\n100 \u00b1 40\n100 \u00b1 30\n\n80 \u00b1 30\n70 \u00b1 30\n70 \u00b1 40\n\n1200 \u00b1 500\n148 \u00b1 25\n160 \u00b1 30\n\n\u03b1 = 2\n\n1900 \u00b1 800\n139 \u00b1 24\n137 \u00b1 26\n\u03b1 = 2\n\n\u03b1 = 5\n\n2150 \u00b1 160\n6600 \u00b1 500\n13300 \u00b1 1900\n\n11000 \u00b1 1300\n27000 \u00b1 2000\n49000 \u00b1 6000\n\n\u03b1 = 5\n\n\u03b1 = 1.2\n\n913 \u00b1 12\n600 \u00b1 400\n81 \u00b1 11\n\n2160 \u00b1 210\n600 \u00b1 300\n63 \u00b1 11\n\n894 \u00b1 11\n1860 \u00b1 110\n1000 \u00b1 500\n\n2700 \u00b1 300\n4590 \u00b1 290\n1800 \u00b1 900\n\n895 \u00b1 10\n1320 \u00b1 30\n2100 \u00b1 60\n\n4200 \u00b1 400\n6800 \u00b1 500\n\n\u2013\n\nRunning times and number of iterations are reported in Table 2 and 3, respectively. When the degree\nof overlap \u03b1 is low the computational times of GL-prox and GLO-pridu are comparable. As \u03b1\nincreases, there is a clear advantage in using GLO-pridu instead of GL-prox. The same behavior\noccurs for the number of iterations.\n\n5 Discussion\n\nWe have presented an ef\ufb01cient optimization procedure for computing the solution of group lasso\nwith overlapping groups of variables, which allows dealing with high dimensional problems with\nlarge groups overlap. We have empirically shown that our procedure has a great computational\nadvantage with respect to state-of-the-art algorithms for group lasso applied on the expanded space\nbuilt by replicating variables belonging to more than one group. We also mention that computational\nperformance may improve if our scheme is used as core for the optimization step of active set\nmethods, such as [23]. Finally, as shown in [17], the improved computational performance enables\nto use group (cid:96)1 regularization with overlap for pathway analysis of high-throughput biomedical data,\nsince it can be applied to the entire data set and using all the information present in online databases,\nwithout pre-processing for dimensionality reduction.\n\n8\n\n\fReferences\n[1] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine\n\nLearning Research, 9:1179\u20131225, 2008.\n\n[2] F. Bach. High-dimensional non-linear variable selection through hierarchical kernel learning.\n\nTechnical Report HAL 00413473, INRIA, 2009.\n\n[3] F. R. Bach, G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo\n\nalgorithm. In ICML, volume 69 of ACM International Conference Proceeding Series, 2004.\n\n[4] A. Beck and Teboulle. M. Fast gradient-based algorithms for constrained total variation image\ndenoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419\u2013\n2434, 2009.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[6] S. Becker, J. Bobin, and E. Candes. Nesta: A fast and accurate \ufb01rst-order method for sparse\n\nrecovery, 2009.\n\n[7] D. Bertsekas. Projected newton methods for optimization problems with simple constraints.\n\nSIAM Journal on Control and Optimization, 20(2):221\u2013246, 1982.\n\n[8] R. Brayton and J. Cullum. An algorithm for minimizing a differentiable function subject to. J.\n\nOpt. Th. Appl., 29:521\u2013558, 1979.\n\n[9] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting.\n\nJournal of Machine Learning Research, 10:28992934, December 2009.\n\n[10] O. Guler. New proximal point algorithm for convex minimization. SIAM J. on Optimization,\n\n2(4):649\u2013664, 1992.\n\n[11] E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: Methodology\n\nand convergence. SIOPT, 19(3):1107\u20131130, 2008.\n\n[12] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ICML,\n\npage 55, 2009.\n\n[13] R. Jenatton, J.-Y . Audibert, and F. Bach. Structured variable selection with sparsity-inducing\n\nnorms. Technical report, INRIA, 2009.\n\n[14] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical\n\ndictionary learning. In Proceeding of ICML 2010, 2010.\n\n[15] I. Loris. On the performance of algorithms for the minimization of l1-penalized functionals.\n\nInverse Problems, 25(3):035008, 16, 2009.\n\n[16] L. Meier, S. van de Geer, and P. Buhlmann. The group lasso for logistic regression. J. R.\n\nStatist. Soc, B(70):53\u201371, 2008.\n\n[17] S. Mosci, S. Villa, Verri A., and L. Rosasco. A fast algorithm for structured gene selection.\n\npresented at MLSB 2010, Edinburgh.\n\n[18] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of con-\n\nvergence o(1/k2). Doklady AN SSSR, 269(3):543\u2013547, 1983.\n\n[19] Y. Nesterov.\n\nSmooth minimization of non-smooth functions. Math. Prog. Series A,\n\n103(1):127\u2013152, 2005.\n\n[20] M. Y. Park and T. Hastie. L1-regularization path algorithm for generalized linear models. J. R.\n\nStatist. Soc. B, 69:659\u2013677, 2007.\n\n[21] L. Rosasco, M. Mosci, S. Santoro, A. Verri, and S. Villa.\n\nIterative projection methods for\n\nstructured sparsity regularization. Technical Report MIT-CSAIL-TR-2009-050, MIT, 2009.\n\n[22] J. Rosen. The gradient projection method for nonlinear programming, part i: linear constraints.\n\nJ. Soc. Ind. Appl. Math., 8:181\u2013217, 1960.\n\n[23] V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions\n\nand ef\ufb01cient algorithms. In Proceedings of 25th ICML, 2008.\n\n[24] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and\n\nhierarchical variable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n9\n\n\f", "award": [], "sourceid": 776, "authors": [{"given_name": "Sofia", "family_name": "Mosci", "institution": null}, {"given_name": "Silvia", "family_name": "Villa", "institution": null}, {"given_name": "Alessandro", "family_name": "Verri", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}]}