{"title": "Fast Sparse Group Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 1702, "page_last": 1710, "abstract": "Sparse Group Lasso is a method of linear regression analysis that finds sparse parameters in terms of both feature groups and individual features.\nBlock Coordinate Descent is a standard approach to obtain the parameters of Sparse Group Lasso, and iteratively updates the parameters for each parameter group.\nHowever, as an update of only one parameter group depends on all the parameter groups or data points, the computation cost is high when the number of the parameters or data points is large.\nThis paper proposes a fast Block Coordinate Descent for Sparse Group Lasso.\nIt efficiently skips the updates of the groups whose parameters must be zeros by using the parameters in one group.\nIn addition, it preferentially updates parameters in a candidate group set, which contains groups whose parameters must not be zeros.\nTheoretically, our approach guarantees the same results as the original Block Coordinate Descent.\nExperiments show that our algorithm enhances the efficiency of the original algorithm without any loss of accuracy.", "full_text": "Fast Sparse Group Lasso\n\nYasutoshi Ida1,3\n\nYasuhiro Fujiwara2 Hisashi Kashima3,4\n\n1NTT Software Innovation Center\n\n2NTT Communication Science Laboratories\n\n3Kyoto University\n\n4RIKEN AIP\n\nyasutoshi.ida@ieee.org\n\nyasuhiro.fujiwara.kh@hco.ntt.co.jp\n\nkashima@i.kyoto-u.ac.jp\n\nAbstract\n\nSparse Group Lasso is a method of linear regression analysis that \ufb01nds sparse pa-\nrameters in terms of both feature groups and individual features. Block Coordinate\nDescent is a standard approach to obtain the parameters of Sparse Group Lasso,\nand iteratively updates the parameters for each parameter group. However, as an\nupdate of only one parameter group depends on all the parameter groups or data\npoints, the computation cost is high when the number of the parameters or data\npoints is large. This paper proposes a fast Block Coordinate Descent for Sparse\nGroup Lasso. It ef\ufb01ciently skips the updates of the groups whose parameters must\nbe zeros by using the parameters in one group. In addition, it preferentially updates\nparameters in a candidate group set, which contains groups whose parameters must\nnot be zeros. Theoretically, our approach guarantees the same results as the original\nBlock Coordinate Descent. Experiments show that our algorithm enhances the\nef\ufb01ciency of the original algorithm without any loss of accuracy.\n\n1\n\nIntroduction\n\nSparse Group Lasso (SGL) [3, 17] is a popular feature-selection method based on the linear regression\nmodel for data that have group structures. For the analysis of such data, it is important to identify not\nonly individual features but also groups of features that have some relationships with the response.\nSGL \ufb01nds such groups and features by obtaining sparse parameters corresponding to the features in\nthe linear regression model. In particular, SGL effectively achieves parameter sparsity by utilizing\ntwo types of regularizations: feature- and group-level regularization. Owing to its effectiveness, SGL\nis used in the analysis of the various data, e.g., gene expression data [11, 16] and climate data [13].\nIn order to obtain the sparse parameters in SGL, Block Coordinate Descent (BCD) is used as a\nstandard approach [3, 17]. BCD iteratively updates the parameters for each group until convergence.\nIn particular, it \ufb01rst checks whether the parameters in a group are zeros by using all the parameters or\ndata points. This process induces group-level sparsity. If the parameters in the group are determined\nas nonzeros, BCD updates the parameters in the group. It applies the aforementioned steps to the\nparameters of each group until the parameters of all groups converge.\nAlthough SGL is practical for analyzing group structured data, BCD suffers from high computation\ncosts. The main bottleneck is the computation to check whether the parameters of a group are\nzeros, because the computation uses all the parameters or data points. The screening technique is\nthe main existing approach for reducing the computation cost of BCD by reducing the data size\n[18, 12\u201314]. This technique eliminates features and groups whose parameters are zeros, before\nentering the iterations of BCD. However, the screening techniques cannot be expected to reduce the\ndata size when the initial parameters are far from optimal [9]. The screening techniques often face\nsuch problems in practice, and the ef\ufb01ciency of BCD would not be increased in such cases. Therefore,\nspeeding up BCD is still an important topic of study for handling large data sizes.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis paper proposes a fast BCD for SGL. Our main idea is to identify the groups whose parameters\nmust be zeros by only using the parameters in the group, whereas the standard method uses all the\nparameters or data points. As the number of parameters in one group is much smaller than the total\nnumber of parameters or data points, our method ef\ufb01ciently skips the computation of groups whose\nparameters must be zeros. Another idea is to extract a candidate group set, which contains groups\nwhose parameters must not be zeros. As the parameters in the set are likely to largely contribute to the\nprediction [5, 4], we can expect BCD to effectively optimize the objective function by preferentially\nupdating the parameters in the set. The attractive point of our method is that it does not need any\nadditional hyperparameters, which incur additional tuning costs. In addition, it provably guarantees\nconvergence to the same value of the objective function as the original BCD. Experiments demonstrate\nthat our method enhances the ef\ufb01ciency of BCD while achieving the same prediction error. Although\nwe consider the case of non-overlapping groups in the paper, our method is relatively easy to be\nextended for overlapping groups by using overlap norm [8].\n\n2 Preliminary\n\n2.1 Sparse Group Lasso\n\nThis section de\ufb01nes the SGL as a method of linear regression analysis that \ufb01nds small sets of\ngroups in addition to features that achieve high prediction accuracy for the response. Let n be\nthe number of data points, where each data point is represented as a p-dimensional feature vector,\ny \u2208 Rn be a continuous response, and G be the total number of feature groups. A matrix of features\nX \u2208 Rn\u00d7p is represented as X = [X(1), X(2), ..., X(G)], where X(g) \u2208 Rn\u00d7pg is the block matrix\nof X corresponding to the g-th feature group with the number of features pg. Similarly, parameter\nvector \u03b2 \u2208 Rp is represented as \u03b2 = [\u03b2(1)T, \u03b2(2)T, ..., \u03b2(G)T]T, where \u03b2(g) \u2208 Rpg is the parameter\n(coef\ufb01cient) vector of group g. Therefore, the linear regression model in SGL is represented as\ny = X\u03b2 = X(1)\u03b2(1) + \u00b7\u00b7\u00b7 + X(G)\u03b2(G). Solution \u02c6\u03b2 is obtained by solving the following problem:\n(1)\n\n\u02c6\u03b2 = arg min\n\n2 + (1 \u2212 \u03b1)\u03bb(cid:80)G\n\n(cid:80)G\ng=1 X(g)\u03b2(g)||2\n\ng=1 \u221apg||\u03b2(g)||2 + \u03b1\u03bb||\u03b2||1,\n\n\u03b2\u2208Rp\n\n1\n\n2n||y \u2212\n\nwhere \u03b1 \u2208 [0, 1] and \u03bb \u2265 0 are regularization constants; \u03b1 decides the balances of the convex\ncombination of l1 and l2 norm penalties and \u03bb controls the degree of sparsity of the solution.\n\n2.2 Block Coordinate Descent\n\n(cid:17)\n\n\u03b2(g)\nnew =\n\nBCD is a standard approach used to obtain solution \u02c6\u03b2 of SGL [17]. It consists of a group-level outer\nloop and an element-level loop. The group-level outer loop checks whether parameter vector \u03b2(g)\nfor each feature group is a zero vector. If \u03b2(g) turns to be a nonzero vector, the element-level loop\nupdates each parameter in \u03b2(g). The process terminates if the whole parameter vector converges.\nIn the element-level loop, BCD updates \u03b2(g) in group g if the parameter vector of the group is not a\n(cid:16)\nzero vector. The updated parameter vector \u03b2(g)\n1\u2212 t(1\u2212\u03b1)\u03bb\n||S(Z(g),t\u03b1\u03bb)||2\nn (X(g)Tr(\u2212g)\u2212 X(g)TX(g)\u03b2(g)), where t \u2265 0 is the step size, and\n(cid:80)G\n\nIn Equation (2), Z (g) = \u03b2(g) + t\nr(\u2212g) is the partial residual and is de\ufb01ned as follows:\n\nnew is de\ufb01ned as follows:\n\nS(Z (g), t\u03b1\u03bb).\n+\n\nl(cid:54)=g X(l)\u03b2(l).\n\nr(\u2212g) = y \u2212\n\n(3)\nS(\u00b7) is the coordinate-wise soft-threshold operator; the j-th element is computed as S(z, \u03b3)[j] =\nsign(z[j])(|z[j]| \u2212 \u03b3)+. \u03b2(g) is iteratively updated using Equation (2) until convergence. If the\nparameter vector of the group is determined as a zero vector, Equation (2) is skipped. The computation\ncost of Equation (2) is O(p2\ng) time because X(g)TX(g) in Z (g) are precomputed before entering the\nmain loop. In addition, X(g)Tr(\u2212g) has already been computed in the group-level outer loop, as\ndescribed next.\nIn the group-level outer loop, \u03b2(g) is the zero vector if the following condition holds:\n\n(2)\n\n||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 \u2264 \u221apg(1 \u2212 \u03b1)\u03bb.\n\n(4)\n\n2\n\n\fIn other words, if Equation (4) holds, the parameter vector of group g is a zero vector; Equation (2) is\nthen skipped, and the parameter vector is not updated. X(g)Tr(\u2212g) in Equation (4) is computed using\nthe following equation that consists of only matrix operations:\n\nX(g)Tr(\u2212g) = X(g)Ty \u2212 X(g)TX\u03b2 + X(g)TX(g)\u03b2(g).\n\n(5)\n\nIn this equation, X(g)Ty, X(g)TX, and X(g)TX(g) are precomputed before entering the loops. The\ncosts of the precomputations have relatively low impacts on the total computational cost because\nprecomputations are performed only once in the total computation, and are easily parallelized. On\nthe other hand, the computation cost of Equation (5) is still high because it requires O(ppg + p2\ng)\ntime, and it is repeatedly performed until convergence. As a result, we need O(ppg + p2\ng) time for\nEquation (4) at every iteration. We can modify Equation (5) to have O(npg) time by maintaining\nthe partial residuals of Equation (3) as described in [7]. However, in either case, the computation\ncost of Equation (4) depends on p or n. Therefore, Equation (4) incurs a large computation cost for\nhigh-dimensional features or a large number of data points.\n\n3 Proposed Approach\n\nIn this section, we introduce our algorithm, which ef\ufb01ciently obtains the solution of SGL. First, we\nexplain the ideas underlying our algorithm. Next, we introduce several lemmas that are necessary to\nderive our algorithm. We then describe the algorithm.1\n\n3.1\n\nIdea\n\nIn SGL, obtaining the solution through BCD incurs a high computation cost. This is because (i)\nEquation (4) requires O(ppg + p2\ng) or O(npg) time, which incurs a large computation cost for large\nfeature vectors or a large number of data points, and (ii) BCD always checks all of the feature groups\nusing Equation (4) at every iteration even when most of the groups have zero vectors.\nOur main idea is to identify groups whose parameter vectors must be zero vectors by approximating\nEquation (4), which checks whether the parameter vector of each group is a zero. In particular, we\ncompute the upper bound of ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 instead of computing the exact value. If the upper\nbound is lower than \u221apg(1 \u2212 \u03b1)\u03bb, the parameter vector of the group must be a zero vector. As a\nresult, we can safely skip the computation of the group. As our upper bound requires only O(pg)\ntime instead of the O(ppg + p2\ng) or O(npg) time for the original Equation (4), we can effectively\nreduce the computation cost.\nAnother idea is to extract a candidate group set, which contains groups whose parameter vectors\nmust not be zero vectors. As the parameters in the set are likely to largely contribute to the prediction\n[5, 4], we can expect BCD to effectively optimize the objective function by preferentially updating\nthe parameters in the set. In addition, our method only requires O(G) time to construct the set, and\nthus the computation cost is relatively low.\n\n3.2 Upper Bound for Skipping Computations\nWe introduce the upper bound of ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 in Equation (4). To derive a tight upper\nbound, we introduce reference parameter vectors and partial residuals of Equation (3) that are\ncomputed before entering the group-level outer loop. To be speci\ufb01c, we can obtain a tight bound by\nexplicitly utilizing the term representing the difference between the reference and current parameter\nvectors. As many parameter vectors rapidly converge during the iterations, the difference between the\nreference and current parameter vectors rapidly decreases. We de\ufb01ne the upper bound as follows:\nDe\ufb01nition 1 (Upper bound) Let U (g) be an upper bound of ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 in Equation (4),\nand \u02dcr(\u2212g) be a partial residual of Equation (3) before entering the group-level outer loop. Then, U (g)\nis de\ufb01ned as follows:\n\nU (g) = ||X(g)T\u02dcr(\u2212g)||2 + \u039b(g, g) +(cid:80)G\n\nl=1 \u039b(g, l),\n\n(6)\n\n1We show all the proofs in the supplementary material.\n\n3\n\n\fwhere \u039b(g, l) = || \u02c6K (g)[l]||2||\u03b2(l) \u2212 \u02dc\u03b2(l)||2. The i-th element of \u02c6K (g)[l] \u2208 Rpg is given as ||K (g,l)[i, :\n]||2, that is, the l2 norm of the i-th row vector in block matrix K (g,l) \u2208 Rpg\u00d7pl of K := XTX \u2208 Rp\u00d7p.\n\u02dc\u03b2(g) is a parameter vector before entering the group-level outer loop.\nNote that we can precompute ||X(g)T\u02dcr(\u2212g)||2 and || \u02c6K (g)[\u00b7]||2 before entering the group-level outer\nloop and the main loop, respectively. The following lemma shows the property of the upper bound\ncorresponding to groups with parameters that must be zeros:\nLemma 1 (Groups with zero vectors) If U (g) satis\ufb01es U (g) \u2264 \u221apg(1 \u2212 \u03b1)\u03bb, parameter \u03b2(g) for\ngroup g is a zero vector.\n\nLemma 1 indicates that we can identify groups whose parameters must be zeros by using upper\nbound U (g) instead of ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2. The error bound of U (g) for ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 is\ndescribed in a later section.\n\n3.3 Online Update Scheme of Upper Bound\n\nAlthough we can identify groups whose parameters must be zeros by using upper bound U (g),\nO(p + pg) time is still required to compute Equation (6) of the upper bound even if we precompute\n||X(g)T\u02dcr(\u2212g)||2 and || \u02c6K (g)[\u00b7]||2. As the standard approach requires O(ppg + p2\ng) or O(npg) time, the\nef\ufb01ciency of our approach would be moderate. This is the motivation behind our use of the online\nupdate scheme for the upper bound. In particular, when a parameter vector of a group is updated, we\nuse the following de\ufb01nition for the upper-bound computation:\nDe\ufb01nition 2 (Online update scheme of upper bound) If \u03b2(g) is updated to \u03b2(g)(cid:48)\nbound U (g) of Equation (6) as follows:\n\n, we update upper\n\nU (g)(cid:48)\n\n= U (g)\u22122\u039b(g, g)+2|| \u02c6K (g)[g]||2||\u03b2(g)(cid:48)\n\n\u2212 \u02dc\u03b2(g)||2.\n\n(7)\n\nEquation (7) clearly holds because we subtract old values of 2\u039b(g, g) from Equation (6), and add\nupdated values of 2|| \u02c6K (g)[g]||2||\u03b2(g)(cid:48)\n\u2212 \u02dc\u03b2(g)||2 to the equation. In terms of the computation cost, we\nhave the following lemma:\nLemma 2 (Computation cost for online update scheme of upper bound) The computation of\nEquation (7) requires O(pg) time given precomputed ||X(g)T\u02dcr(\u2212g)||2 and || \u02c6K (g)[\u00b7]||2 when the param-\neter vector of group g is updated.\nThe above lemma shows that we can update the upper bound in O(pg) time. The computation\ncost is signi\ufb01cantly low compared with the computations of Equations (4) and (6), which require\ng) (or O(npg)) and O(p + pg) times, respectively. Therefore, we can ef\ufb01ciently identify\nO(ppg + p2\ngroups whose parameters must be zeros on the basis of Lemma 1 and De\ufb01nition 2.\n\n3.4 Candidate Group Set for Selective Updates\n\nIn this section, we introduce a method to extract the candidate group set, which contains the groups\nwhose parameters must not be zeros. We expect BCD to effectively update the parameter vectors by\npreferentially updating the parameter vectors on the candidate group set. To extract the candidate\ngroup set, we utilize a criterion, which approximates ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 in Equation (4). If the\ncriterion, de\ufb01ned as follows, is above a threshold, the group is included in the set.\n\nDe\ufb01nition 3 (Criterion to extract candidate group set) Let C (g) be a criterion, which is used to\ncheck whether the group is included in the candidate group set. Then, C (g) is de\ufb01ned as follows:\n\nC (g) = ||X(g)T\u02dcr(\u2212g)||2 \u2212 \u03b1\u03bb(cid:112)pg/2,\n\nwhere \u02dcr(\u2212g) is a partial residual of Equation (3) before entering the group-level outer loop.\n\nThe error bounds of C (g) and U (g) for ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 are shown as follows:\n\n4\n\n(8)\n\n\f|C (g) \u2212 ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2| \u2264 \u0001. We then have \u0001 = \u039b(g, g) +(cid:80)G\n\nLemma 3 (Error bound) Let \u0001 be an error bound of C (g) for ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 such that\naddition, we have |U (g) \u2212 ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2| \u2264 2\u0001.\nThe above lemma suggests that C (g) approximates ||S(X(g)Tr(\u2212g), \u03b1\u03bb)||2 better than U (g) because\nthe error bound of C (g) is half the size of that of U (g). We extract candidate group set C with respect\nto C (g) by using the following de\ufb01nition:\nDe\ufb01nition 4 (Candidate group set) Candidate group set C is de\ufb01ned as\n\nl=1 \u039b(g, l) + \u03b1\u03bb(cid:112)pg/2. In\n\nC = {g \u2208 {1, ..., G}|C (g) > \u221apg(1 \u2212 \u03b1)\u03bb}.\n\n(9)\n\nThe candidate group set has the following property:\nLemma 4 (Groups containing nonzero vectors) Candidate group set C contains the groups whose\nparameters must be nonzeros.\n\nThe above lemma suggests that the candidate group set comprises not only the groups whose\nparameters must be nonzeros but also groups whose parameters can be nonzeros. In terms of the\ncomputation cost, we have the following lemma to extract the candidate group set:\nLemma 5 (Computation cost of candidate group set) Given precomputed ||X(g)T\u02dcr(\u2212g)||2, we can\nextract candidate group set C at O(G) time.\n3.5 Algorithm\n\nThis section describes our algorithm, which utilizes the above-mentioned de\ufb01nitions and lemmas.\nAlgorithm 1 gives a full description of our approach, which is based on BCD with the sequential\nrule [6]: a standard approach for SGL. The sequential rule is used to tune regularization constant \u03bb\nwith respect to the sequence of (\u03bbq)Q\u22121\nq=0 , where \u03bb0 > \u03bb1 > ... > \u03bbQ\u22121: it sequentially optimizes the\nparameter vector by using (\u03bbq)Q\u22121\nq=0 , and reuses the solution of the previous \u03bb as the initial parameters\nfor the current \u03bb.\nOur main idea is to skip groups whose parameters must be zeros during the optimization by utilizing\nLemma 1. As upper bound U (g) in Lemma 1 can be computed with a low computation cost as\ndescribed in Lemma 2, we can ef\ufb01ciently avoid the computation of Equation (4), which is the main\nbottleneck of standard BCD. In addition, we extract the candidate group set before we start to optimize\nthe parameters for the current \u03bb. The impact of the computation cost is relatively low on the total\ncost, as shown in Lemma 5. We expected BCD to raise the effectiveness by preferentially updating\nthe parameters in the set based on Lemma 4.\nIn Algorithm 1, (lines 2\u20134), \ufb01rst precomputes || \u02c6K (g)[l]||2, which is used for computing the upper\nbounds. In the loop of the sequential rule, we construct the candidate group set (lines 6\u201310). Although\nwe computed Equation (9) in the initial iteration, we reused the term ||X(g)T\u02dcr(\u2212g)||2 of the previous\niteration in the equation for the other iterations. Next, BCD is performed on the parameter vectors\nof the set (lines 11\u201319). Then, the algorithm enters the loop of another BCD with upper bounds\n(lines 20\u201336). The reference parameter vector is set (line 21), and ||X(g)T\u02dcr(\u2212g)||2 is precomputed,\nwhich is also used for the computation of the upper bounds (lines 22 and 23). In the group-level outer\nloop, upper bound U (g) of group g was computed using Equation (7) (line 25). Note that Equation\n(6) is used for the initial computation of the upper bound. If bound U (g) is lower than threshold\n\u221apg(1 \u2212 \u03b1)\u03bb, the parameters of the group were set to zeros by following Lemma 1 (lines 26 and\n27). If the bound does not meet the condition, the same procedure as that of the original BCD is\nperformed (lines 28\u201334). Next, ||\u03b2(g) \u2212 \u02dc\u03b2(g)||2, which is used for the computation of the upper bound\nis updated (line 35).\nIn terms of the computation cost, our algorithm has the following property:\nTheorem 1 (Computation cost) Let S and S(cid:48) be the rates of the un-skipped groups when Lemma 1\nand Equation (4) are used, respectively. Suppose that all groups have the same size, pg. If tm and\n\n5\n\n\f(cid:46) A has all the group indices\n(cid:46) The precomputation for the bounds\n\n(cid:46) The loop for the sequential rule of regularization constants (\u03bbq)Q\u22121\n(cid:46) Initialize candidate group set C\n(cid:46) The loop for extracting candidate group set\n\nq=0\n\n(cid:46) Add groups to C by following Lemma 4\n\n(cid:46) The main loop for BCD on candidate group set C\n(cid:46) Group-level outer loop\n(cid:46) Check the condition of Equation (4)\n\n(cid:46) Element-level loop\n\n(cid:46) The main loop for BCD with the upper bounds\n(cid:46) Set the reference parameter vector\n(cid:46) The precomputation for the upper bounds\n\n(cid:46) Group-level outer loop\n\nrepeat\n\npg(1 \u2212 \u03b1)\u03bbq then\n\n\u03b2(g) \u2190 0;\n\nelse\n\nrepeat\n\nfor each l \u2208 A do\n\ncompute || \u02c6K(g)[l]||2;\n\nfor each g \u2208 C do\n\nif ||S(X(g)Tr(\u2212g), \u03b1\u03bbl)||2 \u2264 \u221a\n\ncompute C(g) by Equation (8);\nif C(g) >\n\npg(1 \u2212 \u03b1)\u03bbq then\n\nC = \u2205;\nfor each g \u2208 A do\n\u221a\nadd g to C;\n\nAlgorithm 1 Fast Sparse Group Lasso\n1: A = {1, ..., G}, \u03b2 \u2190 0, \u02dc\u03b2 \u2190 0;\n2: for each g \u2208 A do\n3:\n4:\n5: for q = 0 to Q \u2212 1 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34:\n35:\n36:\n\nfor each g \u2208 A do\ncompute U (g) by Equation (7);\nif U (g) \u2264 \u221a\n\u03b2(g) \u2190 0;\nif ||S(X(g)Tr(\u2212g), \u03b1\u03bbl)||2 \u2264 \u221a\n\nuntil \u03b2(g) converges\n\nupdate ||\u03b2(g) \u2212 \u02dc\u03b2(g)||2;\n\n\u02dc\u03b2 \u2190 \u03b2;\nfor each g \u2208 A do\n\ncompute ||X(g)T \u02dcr(\u2212g)||2;\n\n\u03b2(g) \u2190 0;\n\nelse\n\nrepeat\n\npg(1 \u2212 \u03b1)\u03bbq then\n\nupdate \u03b2(g) by Equation (2);\n\nuntil \u03b2(g) converges\n\nuntil \u03b2 converges\nrepeat\n\nelse\n\nuntil \u03b2 converges\n\n(cid:46) Skip the group whose parameters must be zeros by following Lemma 1\n\npg(1 \u2212 \u03b1)\u03bbq then\n\n(cid:46) Check the condition of Equation (4)\n\nupdate \u03b2(g) by Equation (2);\n\n(cid:46) Element-level loop\n\n(cid:46) For online update scheme for the upper bounds\n\ntf are the numbers of iterations of BCD for the main loop and element-level loop, respectively, our\ng) + S(cid:48)pgtm(tf pg + 1) + Q}) or O(G{(Q + Stm)npg +\napproach requires O(G{(Q + Stm)(ppg + p2\nS(cid:48)pgtm(tf pg + 1) + Q}) time.\nAccording to Theorem 1, when we have a large number of groups that are skipped on the basis\nof the upper bound, the rate of un-skipped groups S in Theorem 1 is small. As a result, the total\ncomputation cost is effectively reduced.\nIn terms of the convergence, our algorithm has the following property:\n\nTheorem 2 (Convergence property) Suppose that the regularization constants in Algorithm 1 are\nthe same as those of the original BCD, and the BCD converges. Then, the solution of Algorithm 1\nhas the same value of the objective function as that of the original BCD.\n\nTheorem 2 shows that our algorithm returns the same value of the objective function as the original\napproach. Therefore, our approach dose not decrease the accuracy compared to the original approach.\n\n4 Related Work\n\nTo improve the ef\ufb01ciency of optimization with sparsity-inducing regularization, safe screening is\ngenerally used [6]; it eliminates zero parameters in the solution before the optimization is initiated.\nAs the size of the feature vector can be reduced before optimizing the problem, the ef\ufb01ciency of the\noptimization is improved. The current state-of-the-art safe screening method for SGL is the GAP\n\n6\n\n\fSafe rule [18, 12\u201314], which is based on dual gap computation. The dual gap is computed as the\ndifference between the primal and dual problems of SGL. They de\ufb01ne a safe region that contains\nthe solution based on the dual gap. By utilizing the safe region, this approach can identify groups\nand features that must be inactive, and eliminates them. If the safe region is small, this approach\neffectively eliminates groups and features. However, unless \u03bb is large or a good approximate solution\nis already known, the screening is often ineffective [9]. To overcome this problem, Ndiaye et al. [14]\nused the dynamic safe rule [1, 2] with the GAP Safe rule for SGL. This dynamic GAP Safe rule\neffectively eliminates groups and features by repeatedly using the GAP Safe rule during the iterations\nof BCD.\n\n5 Experiments\n\nWe evaluated the processing time and prediction error of our approach by conducting experiments\non six datasets from the LIBSVM2 website (abalone, cpusmall, boston, bodyfat, eunite2001, and\npyrim). The numbers of data points were 4177, 8192, 506, 252, 336, and 74, respectively. In order\nto obtain group structure, we used the polynomial features of these datasets [15]. In particular, we\ncreated second-order polynomial features by following the method used in [16]. The groups, which\nconsisted of product over combinations of features up to the second degree, were created by using a\npolynomial kernel. As a result, the numbers of groups for each dataset were 36, 78, 91, 105, 136, and\n378, respectively. The total numbers of features were 176, 408, 481, 560, 736, and 2133, respectively.\nWe compared our method with the original BCD, GAP Safe rule [13], and dynamic GAP Safe\nrule [14]. We tuned \u03bb for all approaches based on the sequential rule by following the methods in\n[18, 12\u201314]. The search space was a non-increasing sequence of Q parameters (\u03bbq)Q\u22121\nq=0 de\ufb01ned as\n\u03bbq = \u03bbmax10\u2212\u03b4q/Q\u22121. We used \u03b4 = 4 and Q = 100 [18, 12\u201314]. For a fair comparison, \u03bbmax\nwas computed according to the dual norm by following the concept of GAP Safe rule [13]; Gap\nSafe rule safely eliminates groups and features under this setting. For dynamic GAP Safe rule, the\ninterval of dual gap computations is set to 10 [14]. For another tuning parameter \u03b1, we used the\nsettings \u03b1 \u2208 [0.2, 0.4, 0.6, 0.8]. We stopped the algorithm for each \u03bbq when the relative tolerance\n||\u03b2 \u2212 \u03b2new||2/||\u03b2new||2 dropped below 10\u22125 for all approaches [9, 10]. All the experiments were\nconducted on a Linux 2.20 GHz Intel Xeon server with 264 GB of main memory.\n\n5.1 Processing Time\nWe evaluated the processing times of the sequential rules for each \u03b1 \u2208 [0.2, 0.4, 0.6, 0.8]. Figure 1\nshows the processing time of each approach on the six datasets. Note that the processing times\ninclude precomputation times for a fair comparison. In the \ufb01gure, the terms origin, GAP, dynamic\nGAP, and ours represent the standard BCD, GAP Safe rule [13], dynamic GAP Safe rule [14], and\nour approach, respectively. Our approach is faster than the previous approaches for all datasets and\nhyperparameters; it reduces the processing time by up to 97% from the standard approach as shown\nin Figure 1 (f). Table 1 shows the number of computations for Equation (4), which is the main\nbottleneck of BCD. The result suggests the effectiveness of the upper bound and candidate group\nset, which effectively reduce the number of computations, and contribute to the reduction of the\nprocessing time, as shown in Figure 1. The GAP Safe rule and dynamic GAP Safe rule eliminate\ngroups and features that must be inactive, and increase the ef\ufb01ciency of BCD. However, when they\ncannot eliminate a signi\ufb01cant number of groups and features, they require a large computation cost\nfor BCD. To be speci\ufb01c, large numbers of groups and features remain when \u03bb has a small value\neven if we use dynamic GAP Safe rule. This is because the safe region is large for small \u03bb [12, 13],\nand it contains many groups and features that may be active. Furthermore, if the screening cannot\neliminate a signi\ufb01cant number of groups and features, the processing time may increase owing to the\ncomputation of the dual gap, as shown for \u03b1 = 0.4 in Figure 1(a).\n\n5.2 Accuracy\n\nIn this section, we evaluate the prediction error on test data to con\ufb01rm the effectiveness of our\nalgorithm. We split the data into training and test data for each dataset. That is, 50% of a dataset was\nused as test data for evaluating the prediction error in terms of the squared loss for the response. The\n\n2https://www.csie.ntu.edu.tw/~cjlin/libsvm/\n\n7\n\n\f(a) abalone\n\n(b) cpusmall\n\n(c) boston\n\n(d) bodyfat\n(f) pyrim\nFigure 1: Processing times of sequential rules for each hyperparameter \u03b1.\n\n(e) eunite2001\n\nTable 1: Numbers of computations for Eq. (4)\n\nTable 2: Prediction errors.\n\ndataset\n\n# of computations for Eq. (4)\n\ndataset\n\nprediction error\n\norigin\n\n1.141\u00d7105\n2.105\u00d7105\n1.248\u00d7106\n1.694\u00d7107\n8.629\u00d7105\n7.667\u00d7107\n\nours\n\n3.168 \u00d7 103\n7.768 \u00d7 104\n9.998 \u00d7 104\n2.403 \u00d7 106\n2.052 \u00d7 105\n7.523 \u00d7 106\n\norigin\n\nours\n\nabalone\ncpusmall\nboston\nbodyfat\n\neunite2001\n\npyrim\n\n2.232\n7.886\n9.887\n\n5.434 \u00d7 10\u22123\n2.010 \u00d7 102\n4.615 \u00d7 10\u22123\n\n2.232\n7.886\n9.887\n\n5.434 \u00d7 10\u22123\n2.010 \u00d7 102\n4.615 \u00d7 10\u22123\n\nabalone\ncpusmall\nboston\nbodyfat\n\neunite2001\n\npyrim\n\nresults are shown in Table 2. The squared losses of our approach are the same as those of the original\napproach. This is because our approach is guaranteed to yield the same value of the objective function\nas that of the original approach, as described in Theorem 2. The results presented in Table 2 indicate\nthat the prediction results match those of the original approach while improving the ef\ufb01ciency.\n\n6 Conclusion\n\nWe proposed a fast Block Coordinate Descent for Sparse Group Lasso. The main bottleneck of\nthe original Block Coordinate Descent is the computation to check whether groups have zero or\nnonzero parameter vectors, because it uses all the parameters or data points. In contrast, our approach\nidenti\ufb01es the groups whose parameters must be zeros by using the parameters in the group, and\nskips the computation. Furthermore, the proposed approach identi\ufb01es the candidate group set, which\ncontains the groups whose parameters must not be zeros. The parameters are preferentially updated\nin the set to raise the effectiveness of Block Coordinate Descent. The attractive point of our method\nis that it does not need any additional hyperparameters. In addition, it provably guarantees the\nsame results as the original method. The experimental results showed that our method reduces the\nprocessing time by up to 97% without any loss of accuracy compared with that of the original method.\n\n8\n\n0.20.40.60.8\u03b10.00.20.40.60.8Wallclocktime[s]\u00d7103originGAPdynamicGAPours0.20.40.60.8\u03b10.00.51.01.52.02.53.0Wallclocktime[s]\u00d7102originGAPdynamicGAPours0.20.40.60.8\u03b10.00.10.20.30.40.50.60.7Wallclocktime[s]\u00d7103originGAPdynamicGAPours0.20.40.60.8\u03b10.00.20.40.60.8Wallclocktime[s]\u00d7104originGAPdynamicGAPours0.20.40.60.8\u03b10.00.10.20.30.4Wallclocktime[s]\u00d7104originGAPdynamicGAPours0.20.40.60.8\u03b10.00.51.01.52.0Wallclocktime[s]\u00d7105originGAPdynamicGAPours\fReferences\n[1] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and R\u00e9mi Gribonval. A Dynamic Screening\n\nPrinciple for the Lasso. In European Signal Processing Conference, pages 6\u201310, 2014.\n\n[2] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and R\u00e9mi Gribonval. Dynamic Screening:\nAccelerating First-Order Algorithms for the Lasso and Group-lasso. IEEE Trans. Signal Processing,\n63(19):5121\u20135132, 2015.\n\n[3] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A Note on The Group Lasso and a\n\nSparse Group Lasso. arXiv preprint arXiv:1001.0736, 2010.\n\n[4] Yasuhiro Fujiwara, Yasutoshi Ida, Junya Arai, Mai Nishimura, and Sotetsu Iwamura. Fast\n\nAlgorithm for the Lasso based L1-Graph Construction. PVLDB, 10(3):229\u2013240, 2016.\n\n[5] Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast Lasso Algorithm\nvia Selective Coordinate Descent. In Proceedings of AAAI Conference on Arti\ufb01cial Intelligence,\npages 1561\u20131567, 2016.\n\n[6] Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Strong Rules for Discarding Predictors\n\nin Lasso-type Problems. Paci\ufb01c Journal of Optimization, 8(4):667\u2013698, 2012.\n\n[7] Jian Huang, Patrick Breheny, and Shuangge Ma. A Selective Review of Group Selection in\n\nHigh-Dimensional Models. Statistical Science, 27(4):481\u2013499, 2012.\n\n[8] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group Lasso with Overlap and\nGraph Lasso. In Proceedings of International Conference on Machine Learning (ICML), pages\n433\u2013440, 2009.\n\n[9] Tyler B. Johnson and Carlos Guestrin. Uni\ufb01ed Methods for Exploiting Piecewise Linear Structure\nin Convex Optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages\n4754\u20134762, 2016.\n\n[10] Tyler B. Johnson and Carlos Guestrin. StingyCD: Safely Avoiding Wasteful Updates in\nCoordinate Descent. In Proceedings of International Conference on Machine Learning (ICML),\npages 1752\u20131760, 2017.\n\n[11] Dougu Nam and Seon-Young Kim. Gene-set Approach for Expression Pattern Analysis. Brief.\n\nBioinforma., 9(3):189\u2013197, 2008.\n\n[12] Eug\u00e8ne Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. GAP Safe Screening\nRules for Sparse Multi-task and Multi-class models. In Advances in Neural Information Processing\nSystems (NeurIPS), pages 811\u2013819, 2015.\n\n[13] Eug\u00e8ne Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. GAP Safe Screening\nRules for Sparse-Group Lasso. In Advances in Neural Information Processing Systems (NeurIPS),\npages 388\u2013396, 2016.\n\n[14] Eug\u00e8ne Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap Safe Screen-\ning Rules for Sparsity Enforcing Penalties. Journal of Machine Learning Research (JMLR),\n18(1):4671\u20134703, 2017.\n\n[15] Paul Pavlidis, Jason Weston, Jinsong Cai, and William Noble Grundy. Gene Functional Classi\ufb01-\ncation from Heterogeneous Data. In Proceedings of the Fifth Annual International Conference on\nComputational Biology, pages 249\u2013255. ACM, 2001.\n\n[16] Volker Roth and Bernd Fischer. The group-Lasso for Generalized Linear Models: Uniqueness\nof Solutions and Ef\ufb01cient Algorithms. In Proceedings of International Conference on Machine\nLearning (ICML), pages 848\u2013855, 2008.\n\n[17] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A Sparse-Group Lasso.\n\nJournal of Computational and Graphical Statistics, 22(2):231\u2013245, 2013.\n\n[18] Jie Wang and Jieping Ye. Two-Layer Feature Reduction for Sparse-Group Lasso via Decompo-\nsition of Convex Sets. In Advances in Neural Information Processing Systems (NeurIPS), pages\n2132\u20132140, 2014.\n\n9\n\n\f", "award": [], "sourceid": 958, "authors": [{"given_name": "Yasutoshi", "family_name": "Ida", "institution": "NTT"}, {"given_name": "Yasuhiro", "family_name": "Fujiwara", "institution": "NTT Communication Science Laboratories"}, {"given_name": "Hisashi", "family_name": "Kashima", "institution": "Kyoto University/RIKEN Center for AIP"}]}