{"title": "Efficient Methods for Overlapping Group Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 352, "page_last": 360, "abstract": "The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms.", "full_text": "Ef\ufb01cient Methods for Overlapping Group Lasso\n\nLei Yuan\n\nArizona State University\n\nTempe, AZ, 85287\n\nLei.Yuan@asu.edu\n\nJun Liu\n\nArizona State University\n\nTempe, AZ, 85287\nj.liu@asu.edu\n\nJieping Ye\n\nArizona State University\n\nTempe, AZ, 85287\n\njieping.ye@asu.edu\n\nAbstract\n\nThe group Lasso is an extension of the Lasso for feature selection on (prede\ufb01ned)\nnon-overlapping groups of features. The non-overlapping group structure limits\nits applicability in practice. There have been several recent attempts to study a\nmore general formulation, where groups of features are given, potentially with\noverlaps between the groups. The resulting optimization is, however, much more\nchallenging to solve due to the group overlaps. In this paper, we consider the ef\ufb01-\ncient optimization of the overlapping group Lasso penalized problem. We reveal\nseveral key properties of the proximal operator associated with the overlapping\ngroup Lasso, and compute the proximal operator by solving the smooth and con-\nvex dual problem, which allows the use of the gradient descent type of algorithms\nfor the optimization. We have performed empirical evaluations using both syn-\nthetic and the breast cancer gene expression data set, which consists of 8,141\ngenes organized into (overlapping) gene sets. Experimental results show that the\nproposed algorithm is more ef\ufb01cient than existing state-of-the-art algorithms.\n\nIntroduction\n\n1\nProblems with high dimensionality have become common over the recent years. The high dimen-\nsionality poses signi\ufb01cant challenges in building interpretable models with high prediction accuracy.\nRegularization has been commonly employed to obtain more stable and interpretable models. A\nwell-known example is the penalization of the \u21131 norm of the estimator, known as Lasso [25]. The\n\u21131 norm regularization has achieved great success in many applications. However, in some appli-\ncations [28], we are interested in \ufb01nding important explanatory factors in predicting the response\nvariable, where each explanatory factor is represented by a group of input features. In such cases,\nthe selection of important features corresponds to the selection of groups of features. As an exten-\nsion of Lasso, group Lasso [28] based on the combination of the \u21131 norm and the \u21132 norm has been\nproposed for group feature selection, and quite a few ef\ufb01cient algorithms [16, 17, 19] have been\nproposed for ef\ufb01cient optimization. However, the non-overlapping group structure in group Lasso\nlimits its applicability in practice. For example, in microarray gene expression data analysis, genes\nmay form overlapping groups as each gene may participate in multiple pathways [12].\n\nSeveral recent work [3, 12, 15, 18, 29] studies the overlapping group Lasso, where groups of features\nare given, potentially with overlaps between the groups. The resulting optimization is, however,\nmuch more challenging to solve due to the group overlaps. When optimizing the overlapping group\nLasso problem, one can reformulate it as a second order cone program and solve it by a generic\ntoolbox, which, however, does not scale well. Jenatton et al. [13] proposed an alternating algorithm\ncalled SLasso for solving the equivalent reformulation. However, SLasso involves an expensive\nmatrix inversion at each alternating iteration, and there is no known global convergence rate for\nsuch an alternating procedure. A reformulation [5] was also proposed such that the original problem\ncan be solved by the Alternating Direction Method of Multipliers (ADMM), which involves solving\na linear system at each iteration, and may not scale well for high dimensional problems. Argyriou\net al.\n[1] adopted the proximal gradient method for solving the overlapping group lasso, and a\n\ufb01xed point method was developed to compute the proximal operator. Chen et al. [6] employed a\n\n1\n\n\fsmoothing technique to solve the overlapping group Lasso problem. Mairal [18] proposed to solve\nthe proximal operator associated with the overlapping group Lasso de\ufb01ned as the sum of the \u2113\u221e\nnorms, which, however, is not applicable to the overlapping group Lasso de\ufb01ned as the sum of the\n\u21132 norms considered in this paper.\nIn this paper, we develop an ef\ufb01cient algorithm for the overlapping group Lasso penalized problem\nvia the accelerated gradient descent method. The accelerated gradient descent method has recently\nreceived increasing attention in machine learning due to the fast convergence rate even for non-\nsmooth convex problems. One of the key operations is the computation of the proximal operator\nassociated with the penalty. We reveal several key properties of the proximal operator associated\nwith the overlapping group Lasso penalty, and proposed several possible reformulations that can\nbe solved ef\ufb01ciently. The main contributions of this paper include: (1) we develop a low cost\nprepossessing procedure to identify (and then remove) zero groups in the proximal operator, which\ndramatically reduces the size of the problem to be solved; (2) we propose one dual formulation\nand two proximal splitting formulations for the proximal operator; (3) for the dual formulation, we\nfurther derive the duality gap which can be used to check the quality of the solution and determine\nthe convergence of the algorithm. We have performed empirical evaluations using both synthetic\ndata and the breast cancer gene expression data set, which consists of 8,141 genes organized into\n(overlapping) gene sets. Experimental results demonstrate the ef\ufb01ciency of the proposed algorithm\nin comparison with existing state-of-the-art algorithms.\nNotations: k \u00b7 k denotes the Euclidean norm, and 0 denotes a vector of zeros. SGN(\u00b7) and sgn(\u00b7)\nare de\ufb01ned in a component wise fashion as: 1) if t = 0, then SGN(t) = [\u22121, 1] and sgn(t) = 0; 2)\nif t > 0, then SGN(t) = {1} and sgn(t) = 1; and 3) if t < 0, SGN(t) = {\u22121} and sgn(t) = \u22121.\nGi \u2286 {1, 2, . . . , p} denotes an index set, and xGi denote a sub-vector of x restricted to Gi.\n2 The Overlapping Group Lasso\nWe consider the following overlapping group Lasso penalized problem:\n\n(1)\n\n(2)\n\nmin\nx\u2208Rp\n\nf (x) = l(x) + \u03c6\u03bb1\n\u03bb2\n\n(x)\n\nwhere l(\u00b7) is a smooth convex loss function, e.g., the least squares loss,\n\ng\n\n\u03c6\u03bb1\n\u03bb2\n\n(x) = \u03bb1kxk1 + \u03bb2\n\nwikxGik\n\nXi=1\n\nis the overlapping group Lasso penalty, \u03bb1 \u2265 0 and \u03bb2 \u2265 0 are regularization parameters,\nwi > 0, i = 1, 2, . . . , g, Gi \u2286 {1, 2, . . . , p} contains the indices corresponding to the i-th group\nof features, and k \u00b7 k denotes the Euclidean norm. Note that the \ufb01rst term in (2) can be absorbed\ninto the second term, which however will introduce p additional groups. The g groups of features\nare pre-speci\ufb01ed, and they may overlap. The penalty in (2) is a special case of the more general\nComposite Absolute Penalty (CAP) family [29]. When the groups are disjoint with \u03bb1 = 0 and\n\u03bb2 > 0, the model in (1) reduces to the group Lasso [28]. If \u03bb1 > 0 and \u03bb2 = 0, then the model in\n(1) reduces to the standard Lasso [25].\n\nIn this paper, we propose to make use of the accelerated gradient descent (AGD) [2, 21, 22] for\nsolving (1), due to its fast convergence rate. The algorithm is called \u201cFoGLasso\u201d, which stands for\nFast overlapping Group Lasso. One of the key steps in the proposed FoGLasso algorithm is the\ncomputation of the proximal operator associated with the penalty in (2); and we present an ef\ufb01cient\nalgorithm for the computation in the next section.\nIn FoGLasso, we \ufb01rst construct a model for approximating f (\u00b7) at the point x as:\nL\n2 ky \u2212 xk2,\n\nfL,x(y) = [l(x) + hl\u2032(x), y \u2212 xi] + \u03c6\u03bb2\n\n(y) +\n\n(3)\n\n\u03bb1\n\nwhere L > 0. The model fL,x(y) consists of the \ufb01rst-order Taylor expansion of the smooth function\nl(\u00b7) at the point x, the non-smooth penalty \u03c6\u03bb2\n2 ky \u2212 xk2. Next, a\nsequence of approximate solutions {xi} is computed as follows: xi+1 = arg miny fLi,si (y), where\nthe search point si is an af\ufb01ne combination of xi\u22121 and xi as si = xi +\u03b2i(xi\u2212 xi\u22121), for a properly\nchosen coef\ufb01cient \u03b2i, Li is determined by the line search according to the Armijo-Goldstein rule\n\n(x), and a regularization term L\n\n\u03bb1\n\n2\n\n\fso that Li should be appropriate for si, i.e., f (xi+1) \u2264 fLi,si (xi+1). A key building block in\nFoGLasso is the minimization of (3), whose solution is known as the proximal operator [20]. The\ncomputation of the proximal operator is the main technical contribution of this paper. The pseudo-\ncode of FoGLasso is summarized in Algorithm 1, where the proximal operator \u03c0(\u00b7) is de\ufb01ned in\n(4). In practice, we can terminate Algorithm 1 if the change of the function values corresponding to\nadjacent iterations is within a small value, say 10\u22125.\n\nAlgorithm 1 The FoGLasso Algorithm\nInput: L0 > 0, x0, k\nOutput: xk+1\n1: Initialize x1 = x0, \u03b1\u22121 = 0, \u03b10 = 1, and L = L0.\n2: for i = 1 to k do\n3:\n4:\n\n\u03bb2/L(si \u2212 1\nSet Li = L and \u03b1i+1 =\n\nL l\u2032(si))\n\n1+\u221a1+4\u03b12\n\ni\n\n2\n\n5:\n6: end for\n\nSet \u03b2i = \u03b1i\u22122\u22121\n\u03b1i\u22121\nFind the smallest L = 2jLi\u22121, j = 0, 1, . . . such that f (xi+1) \u2264 fL,si (xi+1) holds, where\nxi+1 = \u03c0\u03bb1/L\n\n, si = xi + \u03b2i(xi \u2212 xi\u22121)\n\n3 The Associated Proximal Operator and Its Ef\ufb01cient Computation\nThe proximal operator associated with the overlapping group Lasso penalty is de\ufb01ned as follows:\n\n(4)\n\n\u03c0\u03bb1\n\u03bb2\n\nx\u2208Rp(cid:26)g\u03bb1\n\n\u03bb2\n\n1\n2kx \u2212 vk2 + \u03c6\u03bb1\n\n\u03bb2\n\n(x)(cid:27) ,\n\nLi\n\n(v) = arg min\n\n(x) \u2261\nwhich is a special case of (1) by setting l(x) = 1\n2kx \u2212 vk2. It can be veri\ufb01ed that the approximate\nsolution xi+1 = arg miny fLi,si (y) is given by xi+1 = \u03c0\u03bb1/Li\n(si \u2212 1\nl\u2032(si)). Recently, it has\n\u03bb2/Li\nbeen shown in [14] that, the ef\ufb01cient computation of the proximal operator is key to many sparse\nlearning algorithms. Next, we focus on the ef\ufb01cient computation of \u03c0\u03bb1\n(v) in (4) for a given v.\n\u03bb2\nThe rest of this section is organized as follows. In Section 3.1, we discuss some key properties of\nthe proximal operator, based on which we propose a pre-processing technique that will signi\ufb01cantly\nreduce the size of the problem. We then proposed to solve it via the dual formulation in Section 3.2,\nand the duality gap is also derived. Several alternative methods for solving the proximal operator\nvia proximal splitting methods are discussed in Section 3.3.\n3.1 Key Properties of the Proximal Operator\nWe \ufb01rst reveal several basic properties of the proximal operator \u03c0\u03bb1\n\u03bb2\nLemma 1. Suppose that \u03bb1, \u03bb2 \u2265 0, and wi > 0, for i = 1, 2, . . . , g. Let x\u2217 = \u03c0\u03bb1\nfollowing holds: 1) if vi > 0, then 0 \u2264 x\u2217\ni = 0; 4) SGN(v) \u2286 SGN(x\u2217); and 5) \u03c0\u03bb1\nx\u2217\nProof. When \u03bb1, \u03bb2 \u2265 0, and wi \u2265 0, for i = 1, 2, . . . , g, the objective function g\u03bb1\n(\u00b7) is strictly\ni > vi,\nconvex, thus x\u2217 is the unique minimizer. We \ufb01rst show if vi > 0, then 0 \u2264 x\u2217\ni < 0, then\nthen we can construct a \u02c6x as follows: \u02c6xj = x\u2217\nj , j 6= i and \u02c6xi = vi. Similarly, if x\u2217\nj , j 6= i and \u02c6xi = 0. It is easy to verify that \u02c6x achieves\nwe can construct a \u02c6x as follows: \u02c6xj = x\u2217\na lower objective function value than x\u2217 in both cases. We can prove the second and the third\nproperties using similar arguments. Finally, we can prove the fourth and the \ufb01fth properties using\nthe de\ufb01nition of SGN(\u00b7) and the \ufb01rst three properties.\nNext, we show that \u03c0\u03bb1\n(\u00b7) by soft-thresholding. Thus, we only\n\u03bb2\nneed to focus on the case when \u03bb1 = 0. This simpli\ufb01es the optimization in (4). It is an extension of\nthe result for Fused Lasso in [10].\nTheorem 1. Let u = sgn(v) \u2299 max(|v| \u2212 \u03bb1, 0), and\n\n(v). The\ni \u2264 0; 3) if vi = 0, then\n\n(\u00b7) can be directly derived from \u03c00\n\ni \u2264 vi; 2) if vi < 0, then vi \u2264 x\u2217\n\n(v) = sgn(v) \u2299 \u03c0\u03bb1\n\n\u03bb2\n\ni \u2264 vi. If x\u2217\n\n(|v|).\n\n(v).\n\n\u03bb2\n\n\u03bb2\n\n\u03bb2\n\n\u03bb2\n\n\u03c00\n\u03bb2 (u) = arg min\n\nx\u2208Rp(h\u03bb2(x) \u2261\n\nThen, the following holds: \u03c0\u03bb1\n\u03bb2\n\n(v) = \u03c00\n\u03bb2\n\n(u).\n\nwikxGik) .\n\ng\n\nXi=1\n\n(5)\n\n1\n2kx \u2212 uk2 + \u03bb2\n\n3\n\n\fProof. Denote the unique minimizer of h\u03bb2(\u00b7) as x\u2217. The suf\ufb01cient and necessary condition for the\noptimality of x\u2217 is:\n(6)\n\n0 \u2208 \u2202h\u03bb2(x\u2217) = x\u2217 \u2212 u + \u2202\u03c60\n\n\u03bb2(x\u2217),\n\n(x) are the sub-differential sets of h\u03bb2(\u00b7) and \u03c60\n\n\u03bb2\n\n(\u00b7) at x, respectively.\n\nwhere \u2202h\u03bb2(x) and \u2202\u03c60\n\u03bb2\nNext, we need to show 0 \u2208 \u2202g\u03bb1\n\n\u03bb2\n\n(x\u2217). The sub-differential of g\u03bb1\n\u03bb2\n\n(\u00b7) at x\u2217 is given by\n\u03bb2(x\u2217).\n\n\u03bb2\n\n\u2202g\u03bb1\n\u03bb2\n\n(x\u2217) = x\u2217 \u2212 v + \u2202\u03c6\u03bb1\n\nu \u2208 v \u2212 \u03bb1SGN(x\u2217).\n(x\u2217).\n\n(x\u2217) = x\u2217 \u2212 v + \u03bb1SGN(x\u2217) + \u2202\u03c60\n\n(7)\nIt follows from the de\ufb01nition of u that u \u2208 v \u2212 \u03bb1SGN(u). Using the fourth property in Lemma 1,\nwe have SGN(u) \u2286 SGN(x\u2217). Thus,\n(8)\nIt follows from (6)-(8) that 0 \u2208 \u2202g\u03bb1\nIt follows from Theorem 1 that, we only need to focus on the optimization of (5) in the following\ndiscussion. The dif\ufb01culty in the optimization of (5) lies in the large number of groups that may\noverlap. In practice, many groups will be zero, thus achieving a sparse solution (a sparse solution\nis desirable in many applications). However, the zero groups are not known in advance. The key\nquestion we aim to address is how we can identify as many zero groups as possible to reduce the\ncomplexity of the optimization. Next, we present a suf\ufb01cient condition for a group to be zero.\nLemma 2. Denote the minimizer of h\u03bb2(\u00b7) in (5) by x\u2217. If the i-th group satis\ufb01es kuGik \u2264 \u03bb2wi,\nthen x\u2217\nGi\n\n= 0, i.e., the i-th group is zero.\n\n\u03bb2\n\nProof. We decompose h\u03bb2(x) into two parts as follows:\n\n1\n\n2kxGi \u2212 uGik2 + \u03bb2wikxGik(cid:19) +\uf8eb\n\uf8ed\n\nh\u03bb2 (x) =(cid:18) 1\nwhere Gi = {1, 2, . . . , p} \u2212 Gi is the complementary set of Gi. We consider the minimization of\nis \ufb01xed. It can be veri\ufb01ed that if kuGik \u2264 \u03bb2wi, then\nh\u03bb2(x) in terms of xGi when xGi\nx\u2217\nGi\nLemma 2 may not identify many true zero groups due to the strong condition imposed. The lemma\nbelow weakens the condition in Lemma 2. Intuitively, for a group Gi, we \ufb01rst identify all existing\nzero groups that overlap with Gi, and then compute the overlapping index subset Si of Gi as:\n\n= 0 minimizes both terms in (9) simultaneously. Thus we have x\u2217\nGi\n\n2kxGi \u2212 uGik2 + \u03bb2Xj6=i\n\nwjkxGjk\uf8f6\n\uf8f8 ,\n\n= x\u2217\nGi\n\n= 0.\n\n(9)\n\nSi = [j6=i,x\u2217\n\nGj\n\n(Gj \u2229 Gi).\n\n=0\n\n(10)\n\n= 0 if kuGi\u2212Sik \u2264 \u03bb2wi is satis\ufb01ed. Note that this condition is much weaker\nWe can show that x\u2217\nGi\nthan the condition in Lemma 2, which requires that kuGik \u2264 \u03bb2wi.\nLemma 3. Denote the minimizer of h\u03bb2(\u00b7) by x\u2217. Let Si, a subset of Gi, be de\ufb01ned in (10). If\nkuGi\u2212Sik \u2264 \u03bb2wi holds, then x\u2217\nProof. Suppose that we have identi\ufb01ed a collection of zero groups. By removing these groups, the\noriginal problem (5) can then be reduced to:\n\n= 0.\n\nGi\n\n1\n\nGi\n\nmin\n\nx(I1)\u2208R|I1|\n\nwikxGi\u2212Sik\n\n2kx(I1) \u2212 u(I1)k2 + \u03bb2 Xi\u2208G1\nwhere I1 is the reduced index set, i.e., I1 = {1, 2, . . . , p} \u2212Si:x\u2217\nGi 6= 0}\nis the index set of the remaining non-zero groups. Note that \u2200i \u2208 G1, Gi \u2212 Si \u2208 I1. By applying\nLemma 2 again, we show that if kuGi\u2212Sik \u2264 \u03bb2wi holds, then x\u2217\nLemma 3 naturally leads to an iterative procedure for identifying the zero groups: For each group\nGi, if kuGik \u2264 \u03bb2wi, then we set uGi = 0; we cycle through all groups repeatedly until u does not\nchange. Let p\u2032 = |{ui : ui 6= 0}| be the number of nonzero elements in u, g\u2032 = |{uGi : uGi 6= 0}|\nIt follows from\nbe the number of the nonzero groups, and x\u2217 denote the minimizer of h\u03bb2(\u00b7).\nLemma 3 and Lemma 1 that, if ui = 0, then x\u2217\ni = 0. Therefore, by applying the above iterative\nprocedure, we can \ufb01nd the minimizer of (5) by solving a reduced problem that has p\u2032 \u2264 p variables\nand g\u2032 \u2264 g groups. With some abuse of notation, we still use (5) to denote the resulting reduced\nproblem. In addition, from Lemma 1, we only focus on u > 0 in the following discussion, and the\nanalysis can be easily generalized to the general u.\n\n=0 Gi, and G1 = {i : x\u2217\n= 0.\n\n= 0. Thus, x\u2217\nGi\n\nGi\u2212Si\n\n4\n\n\f3.2 Reformulation as an Equivalent Smooth Convex Optimization Problem\nIt follows from the \ufb01rst two properties of Lemma 1 that, we can rewrite (5) as:\n\n\u03c00\n\u03bb2 (u) = arg min\nx\u2208Rp\n0(cid:22)x(cid:22)u\n\nh\u03bb2 (x),\n\n(11)\n\nwhere (cid:22) denotes the element-wise inequality, and\n\nh\u03bb2(x) =\n\n1\n2kx \u2212 uk2 + \u03bb2\n\ng\n\nXi=1\n\nwikxGik,\n\nand the minimizer of h\u03bb2(\u00b7) is constrained to be non-negative due to u > 0 (refer to the discussion\nat the end of Section 3.1).\nMaking use of the dual norm of the Euclidean norm k \u00b7 k, we can rewrite h\u03bb2(x) as:\n\nh\u03bb2(x) = max\nY \u2208\u2126\n\n1\n2kx \u2212 uk2 +\n\nwhere \u2126 is de\ufb01ned as follows:\n\ng\n\nhx, Y ii,\n\nXi=1\n\n(12)\n\n\u2126 = {Y \u2208 Rp\u00d7g : Y i\n\nGi\n\n= 0,kY ik \u2264 \u03bb2wi, i = 1, 2, . . . , g},\n\nGi is the complementary set of Gi, Y is a sparse matrix satisfying Yij = 0 if the i-th feature does\nnot belong to the j-th group, i.e., i 6\u2208 Gj, and Y i denotes the i-th column of Y . As a result, we can\nreformulate (11) as the following min-max problem:\n\nmin\nx\u2208Rp\n0(cid:22)x(cid:22)u\n\nmax\n\nY \u2208\u2126(cid:26)\u03c8(x, Y ) =\n\n1\n\n2kx \u2212 uk2 + hx, Y ei(cid:27) ,\n\n(13)\n\nwhere e \u2208 Rg is a vector of ones. It is easy to verify that \u03c8(x, Y ) is convex in x and concave in Y ,\nand the constraint sets are closed convex for both x and Y . Thus, (13) has a saddle point, and the\nmin-max can be exchanged.\n\nIt is easy to verify that for a given Y , the optimal x minimizing \u03c8(x, Y ) in (13) is given by\n\nPlugging (14) into (13), we obtain the following minimization problem with regard to Y :\n\nx = max(u \u2212 Y e, 0).\n\nmin\n\nY \u2208Rp\u00d7g:Y \u2208\u2126{\u03c9(Y ) = \u2212\u03c8(max(u \u2212 Y e, 0), Y )} .\n\n(14)\n\n(15)\n\nOur methodology for minimizing h\u03bb2(\u00b7) de\ufb01ned in (5) is to \ufb01rst solve (15), and then construct the\nsolution to h\u03bb2 (\u00b7) via (14). Using standard optimization techniques, we can show that the function\n\u03c9(\u00b7) is continuously differentiable with Lipschitz continuous gradient. We include the detailed proof\nin the supplemental material for completeness. Therefore, we convert the non-smooth problem (11)\nto the smooth problem (15), making the smooth convex optimization tools applicable. In this paper,\nwe employ the accelerated gradient descent to solve (15), due to its fast convergence property. Note\nthat, the Euclidean projection onto the set \u2126 can be computed in closed form. We would like to\nemphasize here that, the problem (15) may have a much smaller size than (4).\n\n3.2.1 Computing the Duality Gap\nWe show how to estimate the duality gap of the min-max problem (13), which can be used to check\nthe quality of the solution and determine the convergence of the algorithm.\nFor any given approximate solution \u02dcY \u2208 \u2126 for \u03c9(Y ), we can construct the approximate solution\n\u02dcx = max(u \u2212 \u02dcY e, 0) for h\u03bb2(x). The duality gap for the min-max problem (13) at the point (\u02dcx, \u02dcY )\ncan be computed as:\n(16)\n\n\u03c8(x, \u02dcY ).\n\ngap( \u02dcY ) = max\nY \u2208\u2126\n\n\u03c8(\u02dcx, Y ) \u2212 min\n\nx\u2208Rp\n0(cid:22)x(cid:22)u\n\nThe main result of this subsection is summarized in the following theorem:\n\n5\n\n\fTheorem 2. Let gap( \u02dcY ) be the duality gap de\ufb01ned in (16). Then, the following holds:\n\nIn addition, we have\n\ngap( \u02dcY ) = \u03bb2\n\ng\n\n(wik\u02dcxGik \u2212 h\u02dcxGi , \u02dcY i\nXi=1\n\nGii).\n\n\u03c9( \u02dcY ) \u2212 \u03c9(Y \u2217) \u2264 gap( \u02dcY ),\nh(\u02dcx) \u2212 h(x\u2217) \u2264 gap( \u02dcY ).\n\nProof. Denote (x\u2217, Y \u2217) as the optimal solution to the problem (13). From (12)-(15), we have\n\n\u2212 \u03c9( \u02dcY ) = \u03c8(\u02dcx, \u02dcY ) = min\n\nx\u2208Rp\n0(cid:22)x(cid:22)u\n\n\u03c8(x, \u02dcY ) \u2264 \u03c8(x\u2217, \u02dcY ),\n\n\u03c8(x\u2217, \u02dcY ) \u2264 max\nh\u03bb2(x\u2217) = \u03c8(x\u2217, Y \u2217) = min\nx\u2208Rp\n0(cid:22)x(cid:22)u\n\n\u03c8(x\u2217, Y ) = \u03c8(x\u2217, Y \u2217) = \u2212\u03c9(Y \u2217),\n\u03c8(x, Y \u2217) \u2264 \u03c8(\u02dcx, Y \u2217),\n\nY \u2208\u2126\n\n(17)\n\n(18)\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n\u03c8(\u02dcx, Y \u2217) \u2264 max\nY \u2208\u2126\nIncorporating (11), (20)-(23), we prove (17)-(19).\nIn our experiments, we terminate the algorithm when the estimated duality gap is less than 10\u221210.\n\n\u03c8(\u02dcx, Y ) = h\u03bb2(\u02dcx).\n\n(23)\n\n3.3 Proximal Splitting Methods\n\nRecently, a family of proximal splitting methods [8] have been proposed for converting a challenging\noptimization problem into a series of sub-problems with a closed form solution. We consider two\nreformulations of the proximal operator (4), based on the Dykstra-like Proximal Splitting Method\nand the alternating direction method of multipliers (ADMM). The ef\ufb01ciency of these two methods\nfor overlapping Group Lasso will be demonstrated in the next section.\n\nIn [5], Boyd et al. suggested that the original overlapping group lasso problem (1) can be reformu-\nlated and solved by ADMM directly. We include the implementation of ADMM for our comparative\nstudy. We provide the details of all three reformulations in the supplemental materials.\n\n4 Experiments\n\nIn this section, extensive experiments are performed to demonstrate the ef\ufb01ciency of our proposed\nmethods. We use both synthetic data sets and a real world data set and the evaluation is done in\nvarious problem size and precision settings. The proposed algorithms are mainly implemented in\nMatlab, with the proximal operator implemented in standard C for improved ef\ufb01ciency. Several\nstate-of-the-art methods are also included for comparison purpose, including SLasso algorithm de-\nveloped by Jenatton et al. [13], the ADMM reformulation [5], the Prox-Grad method by Chen et\nal. [6] and the Picard-Nesterov algorithm by Argyriou et al. [1].\n\n4.1 Synthetic Data\n\nIn the \ufb01rst set of simulation we consider only the key component of our algorithm, the proximal\noperator. The group indices are prede\ufb01ned such that G1 = {1, 2, . . . , 10}, G2 = {6, 7, . . . , 20}, . . .,\nwith each group overlapping half of the previous group. 100 examples are generated for each set of\n\ufb01xed problem size p and group size g, and the results are summarized in Figure 1. As we can observe\nfrom the \ufb01gure, the dual formulation yields the best performance, followed closely by ADMM and\nthen the Dykstra method. We can also observe that our method scales very well to high dimensional\nproblems, since even with p = 106, the proximal operator can be computed in a few seconds. It\nis also not surprising that Dykstra method is much more sensitive to the number of groups, which\nequals to the number of projections in one Dykstra step.\n\nTo illustrate the effectiveness of our pre-processing technique, we repeat the previous experiment by\nremoving the pre-processing step. The results are shown in the right plot of Figure 1. As we can ob-\nserve from the \ufb01gure, the proposed pre-processing technique effectively reduces the computational\n\n6\n\n\f \n\nDual\nADMM\nDykstra\n\n102\n\n100\n\n10\u22122\n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n10\u22124\n\n \n\n1e3\n\n1e4\n\np\n\n1e5\n\n1e6\n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n \n10\n\nDual\nADMM\nDykstra\n\n \n\n102\n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n100\n\n10\u22122\n\n20\n\n50\ng\n\n100\n\n200\n\n10\u22124\n\n \n\n1e3\n\nDual\nADMM\nDual\u2212no\u2212pre\nADMM\u2212no\u2212pre\n\n \n\n1e4\n\np\n\n1e5\n\n1e6\n\nFigure 1: Time comparison for computing the proximal operators. The group number is \ufb01xed in the\nleft \ufb01gure and the problem size is \ufb01xed in the middle \ufb01gure. In the right \ufb01gure, the effectiveness of\nthe pre-processing technique is illustrated.\n\ntime. As is evident from Figure 1, the dual formulation proposed in Section 3.2 consistently outper-\nforms other proximal splitting methods. In the following experiments, only the dual method will be\nused for computing the proximal operator, and our method will then be called as \u201cFoGLasso\u201d.\n\n4.2 Gene Expression Data\n\n1\n\n1 = kATbk\u221e (the zero point is a solution to (1) if \u03bb1 \u2265 \u03bbmax\n\nWe have also conducted experiments to evaluate the ef\ufb01ciency of the proposed algorithm using the\nbreast cancer gene expression data set [26], which consists of 8,141 genes in 295 breast cancer\ntumors (78 metastatic and 217 non-metastatic). For the sake of analyzing microarrays in terms\nof biologically meaningful gene sets, different approaches have been used to organize the genes\ninto (overlapping) gene sets. In our experiments, we follow [12] and employ the following two\napproaches for generating the overlapping gene sets (groups): pathways [24] and edges [7]. For\npathways, the canonical pathways from the Molecular Signatures Database (MSigDB) [24] are used.\nIt contains 639 groups of genes, of which 637 groups involve the genes in our study. The statistics of\nthe 637 gene groups are summarized as follows: the average number of genes in each group is 23.7,\nthe largest gene group has 213 genes, and 3,510 genes appear in these 637 groups with an average\nappearance frequency of about 4. For edges, the network built in [7] will be used, and we follow [12]\nto extract 42,594 edges from the network, leading to 42,594 overlapping gene sets of size 2. All\n8,141 genes appear in the 42,594 groups with an average appearance frequency of about 10. The\nexperimental settings are as follows: we solve (1) with the least squares loss l(x) = 1\n2kAx \u2212 bk2,\nand we set wi = p|Gi|, and \u03bb1 = \u03bb2 = \u03b3 \u00d7 \u03bbmax\n, where |Gi| denotes the size of the i-th group\n), and \u03b3 is chosen from the\nGi, \u03bbmax\nset {5 \u00d7 10\u22121, 2 \u00d7 10\u22121, 1 \u00d7 10\u22121, 5 \u00d7 10\u22122, 2 \u00d7 10\u22122, 1 \u00d7 10\u22122, 5 \u00d7 10\u22123, 2 \u00d7 10\u22123, 1 \u00d7 10\u22123}.\nComparison with SLasso, Prox-Grad and ADMM We \ufb01rst compare our proposed FoGLasso\nwith the SLasso algorithm [13], ADMM [5] and Prox-Grad [6]. The comparisons are based on\nthe computational time, since all these methods have ef\ufb01cient Matlab implementations with key\ncomponents written in C. For a given \u03b3, we \ufb01rst run SLasso till a certain precision level is reached,\nand then run the others until they achieve an objective function value smaller than or equal to that\nof SLasso. Different precision levels of the solutions are evaluated such that a fair comparison can\nbe made. We vary the number of genes involved, and report the total computational time (seconds)\nincluding all nine regularization parameters in Figure 2. We can observe that: 1) for all precision\nlevels, our proposed FoGLasso is much more ef\ufb01cient than SLasso, ADMM and Prox-Grad; 2) the\nadvantage of FoGLasso over other three methods in ef\ufb01ciency grows with the increasing number of\ngenes (variables). For example, with the grouping by pathways, FoGLasso is about 25 and 70 times\nfaster than SLasso for 1000 and 2000 genes, respectively; and 3) the ef\ufb01ciency on edges is inferior\nto that on pathways, due to the larger number of overlapping groups. Additional scalability study of\nour proposed method using larger problem size can be found in the supplemental materials.\nComparison with Picard-Nesterov Since the code acquired for Picard-Nesterov is implemented\npurely in Matlab, a computational time comparison might not be fair. Therefore, only the number\nof iterations required for convergence is reported, as both methods adopt the \ufb01rst order method.\nWe use edges to generate the groups, and vary the problem size from 100 to 400, using the same\nset of regularization parameters. For each problem, we record both the number of outer iterations\n(the gradient steps) and the total number of inner iterations (the steps required for computing the\n\n1\n\n7\n\n\fi\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n104\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n \n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n \n\nEdges with Precision 1e\u221202\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\nPathways with Precision 1e\u221202\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n104\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n \n\n104\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n \n\nEdges with Precision 1e\u221204\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\nPathways with Precision 1e\u221204\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\ni\n\ne\nm\nT\nU\nP\nC\n\n \n\n104\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n \n\n104\n\n103\n\n102\n\n101\n\n100\n\n \n\nEdges with Precision 1e\u221206\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\nPathways with Precision 1e\u221206\n\n \n\nFoGLasso\nADMM\nSLasso\nProx\u2212Grad\n\n100 200 300 400 500 1000 1500 2000\n\nNumber of involved genes\n\nFigure 2: Comparison of SLasso [13], ADMM [5], Prox-Grad [6] and our proposed FoGLasso\nalgorithm in terms of computational time (in seconds and in the logarithmic scale) when different\nnumbers of genes (variables) are involved. Different precision levels are used for comparison.\n\nTable 1: Comparison of FoGLasso and Picard-Nesterov using different numbers (p) of genes and\nvarious precision levels. For each particular method, the \ufb01rst row denotes the number of outer itera-\ntions required for convergence, while the second row represents the total number of inner iterations.\n\nPrecision Level\np\nFoGLasso\n\nPicard-Nesterov\n\n10\u22122\n200\n189\n401\n176\n6.8e4\n\n100\n81\n288\n78\n8271\n\n400\n353\n921\n325\n2.2e5\n\n100\n192\n404\n181\n2.6e4\n\n10\u22124\n200\n371\n590\n304\n1.0e5\n\n400\n1299\n1912\n1028\n7.8e5\n\n100\n334\n547\n318\n5.1e4\n\n10\u22126\n200\n507\n727\n504\n1.3e5\n\n400\n1796\n2387\n1431\n1.1e6\n\nproximal operators). The average number of iterations among all the regularization parameters are\nsummarized in Table 1. As we can observe from the table, though Picard-Nesterov method often\ntakes less outer iterations to converge, it takes a lot more inner iterations to compute the proximal\noperator. It is straight forward to verify that the inner iterations in Picard-Nesterov method and our\nproposed method have the same complexity of O(pg).\n\n5 Conclusion\n\nIn this paper, we consider the ef\ufb01cient optimization of the overlapping group Lasso penalized prob-\nlem based on the accelerated gradient descent method. We reveal several key properties of the\nproximal operator associated with the overlapping group Lasso, and compute the proximal operator\nvia solving the smooth and convex dual problem. Numerical experiments on both synthetic and\nthe breast cancer data set demonstrate the ef\ufb01ciency of the proposed algorithm. Although with an\ninexact proximal operator, the optimal convergence rate of the accelerated gradient descent might\nnot be guaranteed [23, 11], the algorithm performs quite well empirically. A theoretical analysis on\nthe convergence property will be an interesting future direction. In the future, we also plan to apply\nthe proposed algorithm to other real-world applications involving overlapping groups.\n\nAcknowledgments\n\nThis work was supported by NSF IIS-0812551, IIS-0953662, MCB-1026710, CCF-1025177, NGA\nHM1582-08-1-0016, and NSFC 60905035, 61035003.\n\n8\n\n\fReferences\n[1] A. Argyriou, C.A. Micchelli, M. Pontil, L. Shen, and Y. Xu. Ef\ufb01cient \ufb01rst order methods for linear\n\ncomposite regularizers. Arxiv preprint arXiv:1104.1436, 2011.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection and clustering of\n\npredictors with oscar. Biometrics, 64:115\u2013123, 2008.\n\n[4] J. F. Bonnans and A. Shapiro. Optimization problems with perturbations: A guided tour. SIAM Review,\n\n40(2):228\u2013264, 1998.\n\n[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\n\nvia the alternating direction method of multipliers. 2010.\n\n[6] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, and E.P. Xing. An ef\ufb01cient proximal gradient method for general\n\nstructured sparse learning. Arxiv preprint arXiv:1005.4717, 2010.\n\n[7] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker. Network-based classi\ufb01cation of breast cancer\n\nmetastasis. Molecular Systems Biology, 3(140), 2007.\n\n[8] P.L. Combettes and J.C. Pesquet. Proximal splitting methods in signal processing. Arxiv preprint\n\narXiv:0912.3522, 2009.\n\n[9] J. M. Danskin. The theory of max-min and its applications to weapons allocation problems. Springer-\n\nVerlag, New York, 1967.\n\n[10] J. Friedman, T. Hastie, H. H\u00a8o\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied\n\nStatistics, 1(2):302\u2013332, 2007.\n\n[11] B. He and X. Yuan. An accelerated inexact proximal point algorithm for convex minimization. 2010.\n[12] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In ICML, 2009.\n[13] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.\n\nTechnical report, arXiv:0904.3523, 2009.\n\n[14] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary\n\nlearning. In ICML, 2010.\n\n[15] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In\n\nICML, 2010.\n\n[16] H. Liu, M. Palatucci, and J. Zhang. Blockwise coordinate descent procedures for the multi-task lasso,\n\nwith applications to neural semantic basis discovery. In ICML, 2009.\n\n[17] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef\ufb01cient \u21132\n[18] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network \ufb02ow algorithms for structured sparsity. In\n\n1-norm minimization. In UAI, 2009.\n\n,\n\nNIPS. 2010.\n\n[19] L. Meier, S. Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal of the Royal\n\nStatistical Society: Series B, 70:53\u201371, 2008.\n\n[20] J.-J. Moreau. Proximit\u00b4e et dualit\u00b4e dans un espace hilbertien. Bull. Soc. Math. France, 93:273\u2013299, 1965.\n[21] A. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture Notes, 1994.\n[22] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publish-\n\ners, 2004.\n\n[23] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and\n\nOptimization, 14:877, 1976.\n\n[24] A. Subramanian and et al. Gene set enrichment analysis: A knowledge-based approach for interpreting\ngenome-wide expression pro\ufb01les. Proceedings of the National Academy of Sciences, 102(43):15545\u2013\n15550, 2005.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[26] M. J. Van de Vijver and et al. A gene-expression signature as a predictor of survival in breast cancer. The\n\nNew England Journal of Medicine, 347(25):1999\u20132009, 2002.\n\n[27] Y. Ying, C. Campbell, and M. Girolami. Analysis of svm with inde\ufb01nite kernels. In NIPS. 2009.\n[28] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal Of\n\nThe Royal Statistical Society Series B, 68(1):49\u201367, 2006.\n\n[29] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n9\n\n\f", "award": [], "sourceid": 252, "authors": [{"given_name": "Lei", "family_name": "Yuan", "institution": null}, {"given_name": "Jun", "family_name": "Liu", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}