{"title": "Exclusive Feature Learning on Arbitrary Structures via $\\ell_{1,2}$-norm", "book": "Advances in Neural Information Processing Systems", "page_first": 1655, "page_last": 1663, "abstract": "Group lasso is widely used to enforce the structural sparsity, which achieves the sparsity at inter-group level. In this paper, we propose a new formulation called ``exclusive group lasso'', which brings out sparsity at intra-group level in the context of feature selection. The proposed exclusive group lasso is applicable on any feature structures, regardless of their overlapping or non-overlapping structures. We give analysis on the properties of exclusive group lasso, and propose an effective iteratively re-weighted algorithm to solve the corresponding optimization problem with rigorous convergence analysis. We show applications of exclusive group lasso for uncorrelated feature selection. Extensive experiments on both synthetic and real-world datasets indicate the good performance of proposed methods.", "full_text": "Exclusive Feature Learning on Arbitrary Structures\n\nvia (cid:96)1,2-norm\n\nDeguang Kong1, Ryohei Fujimaki2, Ji Liu3, Feiping Nie1, Chris Ding1\n1 Dept. of Computer Science, University of Texas Arlington, TX, 76019;\n\n2 NEC Laboratories America, Cupertino, CA, 95014;\n\n3 Dept. of Computer Science, University of Rochester, Rochester, NY, 14627\nEmail: doogkong@gmail.com, rfujimaki@nec-labs.com,\n\njliu@cs.rochester.edu, feipingnie@gmail.com, chqding@uta.edu\n\nAbstract\n\nGroup LASSO is widely used to enforce the structural sparsity, which achieves\nthe sparsity at the inter-group level. In this paper, we propose a new formulation\ncalled \u201cexclusive group LASSO\u201d, which brings out sparsity at intra-group level in\nthe context of feature selection. The proposed exclusive group LASSO is applica-\nble on any feature structures, regardless of their overlapping or non-overlapping\nstructures. We provide analysis on the properties of exclusive group LASSO, and\npropose an effective iteratively re-weighted algorithm to solve the corresponding\noptimization problem with rigorous convergence analysis. We show applications\nof exclusive group LASSO for uncorrelated feature selection. Extensive experi-\nments on both synthetic and real-world datasets validate the proposed method.\n\n1\n\nIntroduction\n\nStructure sparsity induced regularization terms [1, 8] have been widely used recently for feature\nlearning purpose, due to the inherent sparse structures of the real world data. Both theoretical and\nempirical studies have suggested the powerfulness of structure sparsity for feature learning, e.g.,\nLasso [24], group LASSO [29], exclusive LASSO [31], fused LASSO [25], and generalized LAS-\nSO [22]. To make a compromise between the regularization term and the loss function, the sparse-\ninduced optimization problem is expected to \ufb01t the data with better statistical properties. Moreover,\nthe results obtained from sparse learning are easier for interpretation, which give insights for many\npractical applications, such as gene-expression analysis [9], human activity recognition [14], elec-\ntronic medical records analysis [30], etc.\nMotivation Of all the above sparse learning methods, group LASSO [29] is known to enforce the\nsparsity on variables at an inter-group level, where variables from different groups are competing to\nsurvive. Our work is motivated from a simple observation: in practice, not only features from differ-\nent groups are competing to survive (i.e., group LASSO), but also features in a seemingly cohesive\ngroup are competing to each other. The winner features in a group are set to large values, while the\nloser features are set to zeros. Therefore, it leads to sparsity at the intra-group level. In order to make\na distinction with standard LASSO and group LASSO, we called it \u201cexclusive group LASSO\u201d regu-\nlarizer. In \u201cexclusive group LASSO\u201d regularizer, intra-group sparsity is achieved via (cid:96)1 norm, while\ninter-group non-sparsity is achieved via (cid:96)2 norm. Essentially, standard group LASSO achieves spar-\nsity via (cid:96)2,1 norm, while the proposed exclusive group LASSO achieves sparsity via (cid:96)1,2 norm. An\nexample of exclusive group LASSO is shown in Fig.(1) via Eq.(2). The signi\ufb01cant difference from\nthe standard LASSO is to encourage similar features in different groups to co-exist (Lasso usually\nallows only one of them surviving). Overall, the exclusive group LASSO regularization encourages\nintra-group competition but discourages inter-group competition.\n\n1\n\n\fFigure 1: Explanation of differences between group LASSO and exclusive group LASSO. Group setting:\nG1 = {1, 2},G2 = {3, 4},G3 = {5, 6, 7}. Group LASSO solution of Eq.(3) at \u03bb = 2 using least square loss\nis: w = [0.0337; 0.0891; 0; 0;\u22120.2532; 0.043; 0.015]. exclusive group LASSO solution of Eq.(2) at \u03bb = 10\nis: w = [0.0749; 0; 0;\u22120.0713;\u22120.1888; 0; 0]. Clearly, group LASSO introduces sparsity at an inter-group\nlevel, whereas exclusive LASSO enforces sparsity at an intra-group level.\n\nWe note that \u201cexclusive LASSO\u201d was \ufb01rst used in [31] for multi-task learning. Our \u201cexclusive group\nLASSO\u201d work, however, has clear difference from [31]: (1) we give a clear physical intuition of\n\u201cexclusive group LASSO\u201d, which leads to sparsity at an intra-group level (Eq.2), whereas [31] fo-\ncuses on \u201cExclusive LASSO\u201d problem in a multi-task setting; (2) we target a general \u201cgroup\u201d setting\nwhich allows arbitrary group structure, which can be easily extended to multi-task/multi-label learn-\ning. The main contributions of this paper include: (1) we propose a new formulation of \u201cexclusive\ngroup LASSO\u201d with clear physical meaning, which allow any arbitrary structure on feature space;\n(2) we propose an effective iteratively re-weighted algorithm to tackle non-smooth \u201cexclusive group\nLASSO\u201d term with rigorous convergence guarantees. Moreover, an effective algorithm is proposed\nto handle both non-smooth (cid:96)1 and exclusive group LASSO term (Lemma 4.1); (3) The proposed ap-\nproach is validated via experiments on both synthetic and real data sets, speci\ufb01cally for uncorrelated\nfeature selection problems.\nNotation Throughout the paper, matrices are written as boldface uppercase, vectors are written\n(cid:19) 1\nas boldface lowercase, and scalars are denoted by lower-case letters (a, b). n is the number of\ndata points, p is the dimension of data, K is the number of class in a dataset. For any vector\nw \u2208 (cid:60)p, (cid:96)q norm of w is (cid:107)w(cid:107)q =\nq for q \u2208 (0,\u221e). A group of variables is\na subset g \u2282 {1, 2,\u00b7\u00b7\u00b7 , p}. Thus, the set of possible groups is the power set of {1, 2,\u00b7\u00b7\u00b7 , p}:\nP({1, 2,\u00b7\u00b7\u00b7 , p}). Gg \u2208 P({1, 2,\u00b7\u00b7\u00b7 , p}) denotes a set of group g, which is known in advance\ndepending on applications. If two groups have one or more overlapped variables, we say that they are\noverlapped. For any group variable wGg \u2208 (cid:60)p, only the entries in the group g are preserved which\nare the same as those in w, while the other entries are set to zeros. For example, if Gg = {1, 2, 4},\n4. Let supp(w) \u2282 {1, 2,\u00b7\u00b7\u00b7 , p}\nbe a set which wi (cid:54)= 0, and zero(w) \u2282 {1, 2,\u00b7\u00b7\u00b7 , p} be a set which wi = 0. Clearly, zero(w) =\n{1, 2,\u00b7\u00b7\u00b7 , p} \\ supp(w). Let (cid:53)f (w) be gradient of f at w \u2208 (cid:60)p, for any differentiable function f:\n(cid:60)p \u2192 (cid:60).\n\nwGg = [w1, w2, 0, w4, 0,\u00b7\u00b7\u00b7 , 0], then (cid:107)wGg(cid:107)2 =(cid:112)w2\n\n(cid:18)(cid:80)p\ni=1 |wj|q\n\n1 + w2\n\n2 + w2\n\n2 Exclusive group LASSO\nLet G be a group set, the exclusive group LASSO penalty is de\ufb01ned as:\n\n(cid:88)\n\ng\u2208G\n\n\u2200w \u2208 (cid:60)p, \u2126G\n\nEg(w) =\n\n(cid:107)wGg(cid:107)2\n1.\n\n(1)\n\nWhen the groups of g form different partitions of the set of variables, \u2126G\nEg is a (cid:96)1/(cid:96)2 norm penalty.\nA (cid:96)2 norm is enforced on different groups, while in each group, (cid:96)1 norm is used to make a sum\nover each intra-group variable. Minimizing such a convex risk function often leads to a solution that\nsome entries in a group are zeros. For example, for a group Gg = {1, 2, 4}, there exists a solution\nw, such that w1 = 0, w2 (cid:54)= 0, w4 (cid:54)= 0. A concrete example is shown in Fig.1, in which we solve:\n\nJ1(w), J1(w) = f (w) + \u03bb\u2126G\n\nEg(w).\n\nmin\nw\u2208(cid:60)p\n\n(2)\n\nusing least square loss function f (w) = (cid:107)y\u2212XT w(cid:107)2\nsolution of Eq.(3),\n\n2. As compared to standard group LASSO [29]\n\n2\n\nw1w2w3w4w5w6w7G1G2G3w1w2w3w4w5w6w7G1G2G3(a)grouplasso(b)exclusivelassozerovalue\f(a) non-overlap\n(c) feature correlation ma-\ntrix\n(a-b): Geometric shape of \u2126(w) \u2264 1 in R3. (a) non-overlap exclusive group LASSO: \u2126(w) =\nFigure 2:\n(|w1| + |w2|)2 + (|w3|)2; (b) overlap exclusive group LASSO: \u2126(w) = (|w1| + |w2|)2 + (|w2| + |w3|)2;\n(c) feature correlation matrix R on dataset House (506 data points, 14 variables). Rij indicates the feature\ncorrelation between feature i and j. Red colors indicate large values, while blue colors indicate small values.\n\n(b) overlap\n\n(cid:88)\n\ng\n\nf (w) + \u03bb\n\n(cid:107)wGg(cid:107)2.\n\n(3)\n\nof w =(cid:80)\n\nWe observe that group LASSO introduces sparsity at an inter-group level, whereas exclusive LASSO\nenforces sparsity at an intra-group level.\nAnalysis of exclusive group LASSO For each group g, feature index u \u2208 supp(g) will be non-zero.\nLet vg \u2208 (cid:60)p be a variable which preserves the values of non-zero index for group g. Consider all\ngroups, for optimization goal w, we have supp(w) = \u222a\nsupp(vg). (1) For non-overlapping case,\ndifferent groups form a partition of feature set {1, 2,\u00b7\u00b7\u00b7 , p}, and there exists a unique decomposition\ng vg. Since there is not any common elements for any two different groups Gu and Gv,\ni.e., supp(wGu) \u2229 supp(wGv ) = \u03c6. thus it is easy to see: vg = wGg, \u2200g \u2208 G. (2) However, for\noverlapping groups, there could be element sets I \u2282 (Gu \u2229 Gv), and therefore, different groups Gu\nand Gv may have opposite effects to optimize the features in set I. For feature i \u2208 I, it is prone to\ngive different values if optimized separately, i.e., (wGu )i (cid:54)= (wGv )i. For example, Gu = [1, 2],Gv =\n[2, 3], whereas group u may require w2 = 0 and group v may require w2 (cid:54)= 0. Thus, there will be\nmany possible combinations of feature values, and it leads to: \u2126G\n1. Further,\nif some groups are overlapped, the \ufb01nal zeros sets will be a subset of unions of all different groups.\nzero(w) \u2282 \u2229\n\n(cid:80)\ng (cid:107)vg(cid:107)2\n\nEg = inf(cid:80)\n\ng vg=w\n\nzero(vg).\n\ng\n\ng\n\nIllustration of Geometric shape of exclusive LASSO Figure 2 shows the geometric shape for\nboth norms in R3 with different group settings, where in (a): G1 = [1, 2],G2 = [3]; and in (b):\nG1 = [1, 2],G2 = [2, 3]. For the non-overlapping case, variables w1, w2 usually can not be zero\nsimultaneously. In contrast, for the overlapping case, variable w2 cannot be zero unless both groups\nG1 and G2 require w2 = 0.\nProperties of exclusive LASSO The regularization term of Eq.(1) is a convex formulation.\n\u222ag\u2208G = {1, 2,\u00b7\u00b7\u00b7 , p}, then \u2126G\n\n\u2126G\nEg is a norm. See Appendix for proofs.\n\n(cid:113)\n\nE :=\n\nIf\n\nEg regularizer\n\n3 An effective algorithm for solving \u2126G\nThe challenge of solving Eq. (1) is to tackle the exclusive group LASSO term, where f (w) can be\nany convex loss function w.r.t w. It is generally felt that exclusive group LASSO term is much more\ndif\ufb01cult to solve than the standard LASSO term (shrinkage thresholding). Existing algorithm can\nformulate it as a quadratic programming problem [19], which can be solved by interior point method\nor active set method. However, the computational cost is expensive, which limits its use in practice.\nRecently, a primal-dual algorithm [27] is proposed to solve the similar problem, which casts the\nnon-smooth problem into a min-max problem. However, the algorithm is a gradient descent type\nmethod and converges slowly. Moreover, the algorithm is designed for multi-task learning problem,\nand cannot be applied directly for exclusive group LASSO problem with arbitrary structures.\nIn the following, we \ufb01rst derive a very ef\ufb01cient yet simple algorithm. Moreover, the proposed\nalgorithm is a generic algorithm, which allows arbitrary structure on feature space, irrespective of\nspeci\ufb01c feature structures [10], e.g., linear structure [28], tree structure [15], graph structure [7], etc.\n\n3\n\n\u2212101\u2212101\u22121\u22120.500.51\u2212101\u2212101\u22121\u22120.500.51 246810121424681012140.10.20.30.40.50.60.70.80.91\fTheoretical analysis guarantees the convergence of algorithm. Moreover, the algorithm is easy to\nimplement and ready to use in practice.\nKey idea The idea of the proposed algorithm is to \ufb01nd an auxiliary function for Eq.(1) which can be\neasily solved. Then the updating rules for w is derived. Finally, we prove the solution is exactly the\noptimal solution we are seeking for the original problem. Since it is a convex problem, the optimal\nsolution is the global optimal solution.\nProcedure Instead of directly optimizing Eq. (1), we propose to optimize the following objective\n(the reasons will be seen immediately below), i.e.,\n\n(4)\nwhere F \u2208 (cid:60)p\u00d7p is a diagonal matrices which encodes the exclusive group information, and its\ndiagonal element is given by1\n\nJ2(w) = f (w) + \u03bbwT Fw,\n\n(cid:16)(cid:88)\n\ng\n\nFii =\n\n(cid:17)\n\n(IGg )i(cid:107)wGg(cid:107)1\n\n|wi|\n\n.\n\n(5)\n\nLet IGg \u2208 {0, 1}p\u00d71 be group index indicator for group g \u2208 G. For example, group G1 is {1, 2},\nthen IG1 = [1, 1, 0,\u00b7\u00b7\u00b7 , 0]. Thus the group variable wGg can be explicitly expressed as wGg =\ndiag(IGg ) \u00d7 w.\nNote computation of F depends on w, thus minimization of w depends on both F. In the following,\nwe propose an ef\ufb01cient iteratively re-weighted algorithm to \ufb01nd out the optimal global solution for\nw, where in each iteration, w is updated along the gradient descent direction. This process is iterated\nuntil the algorithm converges. Taking the derivative of Eq.(4) w.r.t w and set \u2202J2\n\n\u2202w = 0. We have\n\n\u2207wf (w) + 2\u03bbFw = 0.\n\n(6)\n\nThen the complete algorithm is:\n(1) Updating wt via Eq.(6);\n\n(2) Updating Ft via Eq.(5).\nThe above two steps are iterated until the algorithm converges. We can prove the obtained optimal\nsolution is exactly the global optimal solution for Eq.(1).\n3.1 Convergence Analysis\nIn the following, we prove the convergence of algorithm.\nTheorem 3.1. Under the updating rule of Eq. (6), J1(wt+1) \u2212 J1(wt) \u2264 0.\nThe proof is provided in Appendix.\nDiscussion We note reweighted strategy [26] was also used in solving problems like zero-norm\nof the parameters of linear models. However, it cannot be directly used to solve \u201cexclusive group\nLASSO\u201d problem proposed in this paper, and cannot handle arbitrary structures on feature space.\n4 Uncorrelated feature learning via exclusive group LASSO\nMotivation It is known that in Lasso-type (including elastic net) [24, 32] variable selection, variable\ncorrelations are not taken into account. Therefore, some strongly correlated variables tend to be in\nor out of the model together. However, in practice, feature variables are often correlated. See an\nexample shown on housing dataset [4] with 506 samples and 14 attributes. Although there are only\n14 attributes, feature 5 is highly correlated with feature 6, 7, 11, 12, etc. Moveover, the strongly cor-\nrelated variables may share similar properties, with overlapped or redundant information. Especially\n\n1when wi = 0,\n\nthen Fii is related to subgradient of w w.r.t to wi. However, we can not set\nFii = 0, otherwise the derived algorithm cannot be guaranteed to converge. We can regularize Fii =\n, then the derived algorithm can be proved to minimize the regularized\n\n(IGg )i(cid:107)wGg(cid:107)1/(cid:112)w2\n\ni + \u0001\n\n(cid:19)\n\n1. It is easy to see the regularized exclusive (cid:96)1 norm of w approximates exclusive (cid:96)1 norm of\n\n(cid:18)(cid:80)\n(cid:80)\ng (cid:107)(w + \u0001)Gg(cid:107)2\nw when \u0001 \u2192 0+.\n\ng\n\n4\n\n\fTable 1: Characteristics of datasets\n\nDataset\nisolet\nionosphere\nmnist(0,1)\nLeuml\n\n# data\n1560\n351\n3125\n72\n\n#dimension\n617\n34\n784\n3571\n\n#domain\nUCI\nUCI\nimage\nbiology\n\nwhen the number of selected variables are limited, more discriminant information with minimum\ncorrelations are desirable for prediction or classi\ufb01cation purpose. Therefore, it is natural to eliminate\nthe correlations in the feature learning process.\nFormulation The above observations motivate our work of uncorrelated feature learning via exclu-\nsive group LASSO. We consider the variable selection problem based on the LASSO-type optimiza-\ntion, where we can make the selected variables uncorrelated as much as possible. To be exact, we\npropose to optimize the following objective:\n\nf (w) + \u03b1(cid:107)w(cid:107)1 + \u03b2\n\nmin\nw\u2208(cid:60)p\n\n(cid:107)wGg(cid:107)2\n1,\n\n(7)\n\n(cid:88)\n\ng\n\nwhere f (w) is loss function involving class predictor y \u2208 (cid:60)n and data matrix X =\n[x1, x2,\u00b7\u00b7\u00b7 , xn] \u2208 (cid:60)p\u00d7n, (cid:107)wGg(cid:107)2\n1 is the exclusive group LASSO term involving feature correla-\ntion information \u03b1 and \u03b2 are tuning parameters, which can make balances between plain LASSO\nterm and the exclusive group LASSO term.\nThe core part of Eq.(7) is to use exclusive group LASSO regularizer to eliminate the correlated\nfeatures, which cannot be done by plain LASSO. Let the feature correlation matrix be R = (Rst) \u2208\n(cid:60)p\u00d7p, clearly, R = RT , Rst represents the correlations between features s and t, i.e.,\n\nRst =\n\n, Rst > \u03b8\n\n(8)\n\n|(cid:80)\n(cid:112)(cid:80)\n\n(cid:112)(cid:80)\ni XsiXti|\ni X 2\nsi\n\ni X 2\nti\n\nTo let the selected features uncorrelated as much as possible, for any two features s, t, if their\ncorrelations Rst > \u03b8, then we put them in an exclusive group. Therefore, only one feature can\nsurvive. For example, on the example shown in Fig.2(c), if we use \u03b8 = 0.93 as a threshold, we will\ngenerate the following exclusive group LASSO term:\n\n(cid:107)wGg(cid:107)2\n\n1\n\n= (|w3| + |w10|)2 + (|w5| + |w6|)2 + (|w5| + |w7|)2 + (|w5| + |w11|)2 + (|w6| + |w11|)2\n\n(cid:88)\n\ng\n\n+(|w6| + |w12|)2 + (|w6| + |w14|)2 + (|w7| + |w11|)2.\n\n(9)\nAlgorithm Solving Eq.(7) is to solve a convex optimization problem, because all the three involved\nterms are convex. This also indicates that there exists a unique global solution. Eq.(7) can be\nef\ufb01ciently solved via accelerated proximal gradient (FISTA) method [17, 2], irrespective of what\nkind of loss function used in minimization of empirical risk. Thus solving Eq.(7) is transformed into\nsolving:\n\nmin\nw\u2208(cid:60)p\n\n1\n2\n\n(cid:107)w \u2212 a(cid:107)2\n\n2 + \u03b1(cid:107)w(cid:107)1 + \u03b2\n\n(cid:107)wGg(cid:107)2\n1,\n\n(10)\n\nLt\n\n\u2207f (wt) which involves the current wt value and step size Lt. The challenge\nwhere a = wt \u2212 1\nof solving Eq.(10) is that, it involves two non-smooth terms. Fortunately, we have the following\nlemma to establish the relations between the optimal solution of Eq.(10) to Eq.(11), the solution of\nwhich has been well discussed in \u00a73.\n1\n2\n\n(cid:107)w \u2212 u(cid:107)2\n\n(cid:107)wGg(cid:107)2\n1.\n\nmin\nw\u2208(cid:60)p\n\n2 + \u03b2\n\n(11)\n\n(cid:88)\n\ng\n\nLemma 4.1. The optimal solution to Eq.(10) is the optimal solution to Eq.(11), where\n\nu = arg min\n\nx\n\n1\n2\n\n(cid:107)x \u2212 a(cid:107)2\n\n2 + \u03b1(cid:107)x(cid:107)1 = sgn(a)(a \u2212 \u03b1)+,\n\n(12)\n\nand sgn(.), SGN(.) are the operators de\ufb01ned in the component fashion: if v > 0, sgn(v) = 1, SGN(v) =\n{1}; else if v = 0, sgn(v) = 0, SGN(v) = [\u22121, 1]; else if v < 0, sgn(v) = \u22121, SGN(v) = {\u22121}.\n\nThe proof is provided in Appendix.\n\n5\n\n(cid:88)\n\ng\n\n\f(a) RMSE on linear struc-\nture\n\n(b) MAE on linear struc-\nture\n\n(c) RMSE on hub structure (d) MAE on hub structure\n\n(e) isolet\n\n(f) ionosphere\n\n(g) mnist (0,1)\n\n(h) leuml\n\nFigure 3: (a-d): Feature selection results on synthetic dataset using (a, b) linear structure; (c, d) hub structure.\nEvaluation metrics: RMSE, MAE. x-axis: number of selected features. y-axis: RMSE or MAE error in log\nscale. (g-j): Classi\ufb01cation accuracy using SVM (linear kernel) with different number of selected features on\nfour datasets. Compared methods: Exclusive LASSO of Eq.(7), LASSO, ReliefF [21], F-statistics [3]. x-axis:\nnumber of selected features; y-axis: classi\ufb01cation accuracy.\n\n5 Experiment Results\n\nTo validate the effectiveness of our method, we \ufb01rst conduct experiment using Eq.(7) on two syn-\nthetic datasets, and then show experiments on real-world datasets.\n\n5.1 Synthetic datasets\n\ni\n\ni\n\ni\n\ni\n\n\u223c N (0, 0.1).\n\n2,\u00b7\u00b7\u00b7 , x2\n\n1, x2\n\n\u223c N [0p\u00d71, Ip\u00d7p], x2\n\n2,\u00b7\u00b7\u00b7 , x1\n\nn] \u2208 (cid:60)p\u00d7n, X2 = [x2\n\n\u223c Uniform(\u22120.5, 0.5), and w2 \u2208 (cid:60)p, where each w2\n\nn] \u2208\n(1) Linear-correlated features. Let data X1 = [x1\n1, x1\n(cid:60)p\u00d7n, where each data x1\n\u223c N [0p\u00d71, Ip\u00d7p], I is identity matrix. We generate\na group of p-features, which is a linear combination of features in X1 and X2, i.e., X3 = 0.5(X1 +\nX2) + \u0001, \u0001 \u223c N (\u22120.1e, 0.1Ip\u00d7p). Construct data matrix X = [X1; X2; X3], clearly, X \u2208 (cid:60)3p\u00d7n.\nFeatures in dimension [2p + 1, 3p] are highly correlated with features in dimension [1, p] and [p +\n1, 2p]. Let w1 \u2208 (cid:60)p, where each w1\n\u223c\nUniform(\u22120.5, 0.5). Let \u02dcw = [w1; w2; 0p\u00d71]. We generate predicator y \u2208 (cid:60)n, and y = \u02dcwT X +\n\u0001y, where (\u0001y)i\nWe solve Eq.(7) using current y and X with least square loss. The group settings are: (i, p+i, 2p+i),\nfor 1 \u2264 i \u2264 p. We compare the computed w\u2217 against ground truth solution \u02dcw and plain LASSO\nsolution (i.e., \u03b2 = 0 in Eq.7). We use the root mean square error (RMSE) and mean absolute\nerror (MAE) error to evaluate the differences of values predicted by a model and the values actually\nobserved. We generate n = 1000 data, with p = [120, 140,\u00b7\u00b7\u00b7 , 220, 240] and do 5-fold cross\nvalidation. Generalization error of RMSE and MAE are shown in Figures 3(a) and 3(b). Clearly,\nour approach outperforms standard LASSO solution and exactly recovers the true features.\n(2) Correlated features on Hub structure. Let data X = [X1; X2;\u00b7\u00b7\u00b7 , XB] \u2208 (cid:60)q\u00d7n, where\nIn each block, for\neach block Xb = [X b\n\u223c N (0, 1), zi \u223c\neach data point 1 \u2264 i \u2264 n, X b\nN (0, 1) and \u0001b\n1 0]T , where\n\u223c Uniform(\u22120.5, 0.5). Let \u02dcw = [w1; w2;\u00b7\u00b7\u00b7 ; wB], we generate predicator y \u2208 (cid:60)n, and\nwb\n1\ny = \u02dcwT X + \u0001y, where (\u0001y)i\nThe group settings are: ((b \u2212 1) \u00d7 p + 1,\u00b7\u00b7\u00b7 , b \u00d7 p), for 1 \u2264 b \u2264 B. We generate n = 1000 data,\nB = 10, with varied p = [20, 21,\u00b7\u00b7\u00b7 , 24, 25] and do 5-fold cross validation. Generalization error of\nRMSE and MAE are shown in Figs.3(c),3(d). Clearly, our approach outperforms standard LASSO\nsolution, and recovers the exact features.\n\np:] \u2208 (cid:60)p\u00d7n, 1 \u2264 b \u2264 B, q = p \u00d7 B.\ni, where X b\nji\n\n\u223c Uniform(\u22120.1, 0.1). Let w1, w2,\u00b7\u00b7\u00b7 , wB \u2208 (cid:60)p, where wb = [wb\n\n2\u2264j\u2264p X b\n\nji + 1\n\nB zi + \u0001b\n\n1:; X b\n\n2:;\u00b7\u00b7\u00b7 ; X b\n\n(cid:80)\n\n1i = 1\nB\n\n\u223c N (0, 0.1).\n\ni\n\n6\n\n120140160180200220240\u22122.8\u22122.6\u22122.4\u22122.2\u22122\u22121.8\u22121.6\u22121.4# of featureslog Generalization: RMSE error L1Exclusive lasso+L1optimal solution120140160180200220240\u22123\u22122.8\u22122.6\u22122.4\u22122.2\u22122\u22121.8\u22121.6# of featureslog Generalization: MAE error L1Exclusive lasso+L1optimal solution200210220230240250\u22120.6\u22120.59\u22120.58\u22120.57\u22120.56\u22120.55\u22120.54# of featureslog Generalization: RMSE error L1Exclusive lasso+L1optimal solution200210220230240250\u22120.79\u22120.78\u22120.77\u22120.76\u22120.75\u22120.74\u22120.73\u22120.72\u22120.71# of featureslog Generalization: MAE error L1Exclusive lasso+L1optimal solution010020030040050060074767880828486# of featuresFeature Selection Accuracy F\u2212statisticReliefFLASSOExclusive05101520253035727476788082848688# of featuresFeature Selection Accuracy F\u2212statisticReliefFLASSOExclusive0204060801008082848688909294# of featuresFeature Selection Accuracy F\u2212statisticReliefFLASSOExclusive020406080100919293949596979899# of featuresFeature Selection Accuracy F\u2212statisticReliefFLASSOExclusive\f5.2 Real-world datasets\n\nTo validate the effectiveness of proposed method, we perform feature selection via proposed un-\ncorrelated feature learning framework of Eq.(7) on 4 datasets (shown in Table.1), including 2 UCI\ndatasets: isolet [6], ionosphere [5], 1 image dataset: mnist with only \ufb01gure \u201c0\u201d and \u201c1\u201d [16], and 1\nbiology dataset: Leuml [13].\nWe perform classi\ufb01cation tasks on these different datasets. The compared methods include: pro-\nposed method of Eq.(7) (shown as Exclusive), plain LASSO, ReliefF [21], F-statistics [3]. We use\nlogistic regression as the loss function in our method and plain LASSO method. In our method,\nparameter \u03b1, \u03b2 are tuned to select different numbers of features. Exclusive LASSO groups are set\naccording to feature correlations (i.e., threshold \u03b8 is set to 0.90 in Eq.8). After the speci\ufb01c number\nof features are selected, we feed them into support vector machine (SVM) with linear kernel, and\nclassi\ufb01cation results with different number of selected features are shown in Fig.(3).\nA \ufb01rst glance at the experimental results indicates the better performance of our method as com-\npared to plain LASSO. Moreover, our method is also generally better than the other two popularly\nused feature selection methods, such as ReliefF and F-statistics. The experiment result also further\ncon\ufb01rms our intuition: elimination of correlated features is really helpful for feature selection and\nthus improves the classi\ufb01cation performance. Because (cid:96)1,\u221e [20], (cid:96)2,1 [12, 18], or non-convex fea-\nture learning via (cid:96)p,\u221e [11](0 < p < 1) are designed for multi-task or multi-label feature learning,\nthus we do not compare against these methods.\nFurther, we list the mean and variance of classi\ufb01cation accuracy of different algorithms in the fol-\nlowing table, using 50% of all the features. Compared methods include (1) Lasso (L1); (2) Plain\nexclusive group LASSO (\u03b1 = 0 in Eq. (7)); (3) Exclusive group LASSO (\u03b1 > 0 in Eq. (7)).\ndataset\nisolet\nionosphere\nmnist(0,1)\nleuml\nThe above experiment results indicate that the advantage of our method (exclusive group LASSO)\nover plain LASSO comes from the exclusive LASSO term. The experiment results also suggest that\nthe plain exclusive LASSO performs very similar to LASSO. However, the exclusive group LASSO\n(\u03b1 > 0 in Eq.7) performs de\ufb01nitely better than both standard LASSO and plain exclusive LASSO\n(1%-4% performance improvement). The exclusive LASSO regularizer eliminates the correlated\nand redundant features.\nWe show the running time of plain exclusive LASSO and exclusive group LASSO (\u03b1 > 0 in Eq.7)\nin the following table. We run different algorithms on a Intel i5-3317 CPU, 1.70GHz, 8GRAM\ndesktop.\ndataset\nisolet\nionosphere\nmnist(0,1)\nleuml\n\nexclusive group LASSO (running time: sec)\n51.93\n24.18\n126.51\n144.08\n\nplain exclusive (running time: sec)\n47.24\n22.75\n123.45\n142.19\n\nexclusive group LASSO\n83.24 \u00b1 0.23\n87.28 \u00b1 0.42\n94.51 \u00b1 0.19\n97.70 \u00b1 0.27\n\n# of features\n308\n17\n392\n1785\n\nLASSO\n81.75 \u00b1 0.49\n85.10 \u00b1 0.27\n92.35 \u00b1 0.13\n95.10 \u00b1 0.31\n\nplain exclusive\n82.05 \u00b1 0.50\n85.21 \u00b1 0.31\n93.07 \u00b1 0.20\n95.67\u00b1 0.24\n\nThe above experiment results indicate that the computational cost of exclusive group LASSO is\nslightly higher than that of plain exclusive LASSO. The reason is that, the solution to exclusive\ngroup LASSO is given by simple thresholding on the plain exclusive LASSO result. This further\ncon\ufb01rms our theoretical analysis results shown in Lemma 4.1.\n6 Conclusion\nIn this paper, we propose a new formulation called \u201cexclusive group LASSO\u201d to enforce the sparsity\nfor features at an intra-group level. We investigate its properties and propose an effective algorithm\nwith rigorous convergence analysis. We show applications for uncorrelated feature selection, which\nindicate the good performance of proposed method. Our work can be easily extended for multi-task\nor multi-label learning.\nAcknowledgement The majority of the work was done during the internship of the \ufb01rst author at NEC Labo-\nratories America, Cupertino, CA.\n\n7\n\n\fg (cid:107)vg(cid:107)2\n\n1, and \u2126G\n\n(cid:113)(cid:80)\n\n(cid:113)(cid:80)\n\nE: Note if \u2126G\n\nE(w) = 0, then w = 0. For any scalar a, \u2126G\n\nE(w + \u02dcw) \u2264(cid:113)(cid:80)\n\nAppendix\nProof of a valid norm of \u2126G\nE(aw) =\n|a|\u2126G\nE(w). This proves absolute homogeneity and zero property hold. Next we consider triangle\ninequality. Consider w, \u02dcw \u2208 (cid:60)p. Let vg and \u02dcvg be optimal decomposition of w, \u02dcw such that\n\u2126G\n1. Since vg + \u02dcvg is a decomposition of w + \u02dcw,\nE(w) =\nthus we have: 1 \u2126G\n\u2126G\nE( \u02dcw).\nTo prove Theorem 3.1, we need two lemmas.\nLemma 6.1. Under the updating rule of Eq.(6), J2(wt+1) < J2(wt).\nLemma 6.2. Under the updating rule of Eq.(6),\n\u2264\n\ng (cid:107)\u02dcvg(cid:107)2\ng (cid:107)vg + \u02dcvg(cid:107)2\n\n1 \u2264(cid:113)(cid:80)\n\nJ1(wt+1) \u2212 J1(wt)\n\nJ2(wt+1) \u2212 J2(wt)\n\nE(w) +\n(cid:117)\u2013\n\n(cid:113)(cid:80)\n\ng (cid:107)\u02dcvg(cid:107)2\n\ng (cid:107)vg(cid:107)2\n\n1 = \u2126G\n\nE( \u02dcw) =\n\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n1 +\n\n(13)\n\n.\n\n\u2264\nProof of Theorem 3.1 From Lemma 6.1 and Lemma 6.2, it is easy to see\n(cid:117)\u2013\n0. This completes the proof.\nProof of Lemma 6.1 Eq.(4) is a convex function, and optimal solution of Eq.(6) is obtained by\n(cid:117)\u2013\ntaking derivative \u2202J2\nBefore proof of Lemma 6.2, we need the following Proposition.\n\n\u2202w = 0, thus obtained w\u2217 is global optimal solution, J2(wt+1) < J2(wt).\n\ng=1((cid:107)wGg(cid:107)1)2.\n\nProposition 6.3. wT Fw =(cid:80)G\n1 \u2212(cid:88)\n1 \u2212(cid:88)\n\nProof of Lemma 6.2 Let \u2206 = LHS -RHS of Eq.(13). We have \u2206, where\n(IGg )i(cid:107)wtGg(cid:107)1\n(cid:88)\n\n(IGg )i(cid:107)wtGg(cid:107)1\n\n(cid:88)\n(cid:20)\n(cid:88)\n\n(cid:107)wt+1Gg (cid:107)2\n\n|wt+1\n\n|wt\ni|\n\n|wt\ni|\n\n(wt+1\n\n(wt+1\n\n)2 =\n\n)2 +\n\n\u2206 =\n\n=\n\ni,g\n\ni,g\n\n(\n\ni\n\n|)2 \u2212 (\n\ni\n\n(wt\n\ni)2 \u2212(cid:88)\n(cid:88)\n\n|wt\n\ni|)(\n\ng\n\n1\n\n(cid:107)wtGg(cid:107)2\n(cid:88)\n\n(cid:21)\n\n(14)\n(wt+1\n)2\ni|\n|wt\n\n)\n\ni\n\ni\u2208Gg\n\ni\u2208Gg\n\ni\u2208Gg\n\n(cid:18)\n(cid:19)\nJ1(wt+1)\u2212 J1(wt)\n\ng\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ng\n\n=\n\ng\n\ni,g\n\ni|\n|wt\n\n(IGg )i(cid:107)wtGg(cid:107)1\n(cid:107)wt+1Gg (cid:107)2\n(cid:20)\n(cid:88)\n(cid:88)\n(cid:88)\ni| , bi =(cid:112)|wt\n\ni\u2208Gg\n|\u221a|wt\n|wt+1\n\naibi)2 \u2212 (\n\na2\ni )(\n\ni\u2208Gg\n\ni\u2208Gg\n\n(\n\ni\n\ni\n\n(cid:21)\n\ng\n\n\u2264 0,\n\nb2\ni )\n\n(15)\n\n(16)\n\n(17)\n\nwhere ai =\n\nholds due to cauchy inequality [23]: for any scalar ai, bi, ((cid:80)\nEg(w) = (cid:80)\n\ni|. Due to proposition 6.3, Eq.(14) is equivalent to Eq.(15). Eq.(16)\n(cid:117)\u2013\ni a2\n1. Let w\u2217 be the optimal\n\ni aibi)2 \u2264 ((cid:80)\n\nProof of Lemma 4.1 For notation simplicity, let \u2126G\nsolution to Eq.(11), then we have\n\ni )((cid:80)\n\ng (cid:107)wGg(cid:107)2\n\ni ).\ni b2\n\n0 \u2208 w\u2217 \u2212 u + \u03b2\u2202\u2126G\n\nEg(w\u2217).\n\nIn order to prove that w\u2217 is also the global optimal solution to Eq.(10), i.e.,\n\n0 \u2208 w\n\n\u2217 \u2212 a + \u03b1SGN(w\n\n\u2217\n\nG\n) + \u03b2\u2202\u2126\nEg(w\n\n\u2217\n\n(18)\nFirst, from Eq.(12), we have 0 \u2208 u\u2212 a + \u03b1SGN(u), and this leads to u \u2208 a\u2212 \u03b1SGN(u). According\nto the de\ufb01nition of \u2126G\nEg(w), from Eq.(11), it is easy to verify that (1) if ui = 0, then wi = 0; (2) if\nui (cid:54)= 0, then sign(wi) = sign(ui) and 0 \u2264 |wi| \u2264 |ui|. This indicates that SGN(u) \u2282 SGN(w),\nand thus\n\n).\n\nPut Eqs.(17, 19) together, and this exactly recovers Eq.(18), which completes the proof.\n\n1Note the derivation needs Cauchy inequality [23], where for any scalar ai, bi, ((cid:80)\n((cid:80)\ng)((cid:80)\n\ng). Let ag = (cid:107)vg(cid:107)1, bg = (cid:107)\u02dcvg(cid:107)1, then we can get the inequality.\n\ng a2\n\ng b2\n\n(19)\n\ng agbg)2 \u2264\n\nu \u2208 a \u2212 \u03b1SGN(w).\n\n8\n\n\fReferences\n[1] F. Bach. Structured sparsity and convex optimization. In ICPRAM, 2012.\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[3] J. D. F. Habbema and J. Hermans. Selection of variables in discriminant analysis by f-statistic and error\n\nrate. Technometrics, 1977.\n\n[4] Housing. http://archive.ics.uci.edu/ml/datasets/Housing.\n[5] Ionosphere. http://archive.ics.uci.edu/ml/datasets/Ionosphere.\n[6] isolet. http://archive.ics.uci.edu/ml/datasets/ISOLET.\n[7] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ICML, page 55, 2009.\n[8] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.\n\nJournal of Machine Learning Research, 12:2777\u20132824, 2011.\n\n[9] S. Ji, L. Yuan, Y. Li, Z. Zhou, S. Kumar, and J. Ye. Drosophila gene expression pattern annotation using\n\nsparse features and term-term interactions. In KDD, pages 407\u2013416, 2009.\n\n[10] D. Kong and C. H. Q. Ding. Ef\ufb01cient algorithms for selecting features with arbitrary group constraints\n\nvia group lasso. In ICDM, pages 379\u2013388, 2013.\n\n[11] D. Kong and C. H. Q. Ding. Non-convex feature learning via (cid:96)p,\u221e operator. In AAAI, pages 1918\u20131924,\n\n2014.\n\n[12] D. Kong, C. H. Q. Ding, and H. Huang. Robust nonnegative matrix factorization using (cid:96)2,1-norm. In\n\nCIKM, pages 673\u2013682, 2011.\n\n[13] Leuml. http://www.stat.duke.edu/courses/Spring01/sta293b/datasets.html.\n[14] J. Liu, R. Fujimaki, and J. Ye. Forward-backward greedy algorithms for general convex smooth functions\n\nover a cardinality constraint. In ICML, 2014.\n\n[15] J. Liu and J. Ye. Moreau-yosida regularization for grouped tree structure learning. In NIPS, pages 1459\u2013\n\n1467, 2010.\n\n[16] mnist. http://yann.lecun.com/exdb/mnist/.\n[17] Y. Nesterov. Gradient methods for minimizing composite objective function. ECORE Discussion Paper,\n\n2007.\n\n[18] F. Nie, H. Huang, X. Cai, and C. H. Q. Ding. Ef\ufb01cient and robust feature selection via joint (cid:96)2,1-norms\n\nminimization. In NIPS, pages 1813\u20131821. 2010.\n\n[19] S. Nocedal, J.; Wright. Numerical Optimization. Springer-Verlag, Berlin, New York, 2006.\n[20] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for (cid:96)1,\u221e regularization. In\n\nICML, page 108, 2009.\n\n[21] M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine\n\nLearning, 53(1-2):23\u201369, 2003.\n\n[22] V. Roth. The generalized lasso. IEEE Transactions on Neural Networks, 15(1):16\u201328, 2004.\n[23] J. M. Steele. The Cauchy-Schwarz master class : an introduction to the art of mathematical inequalities.\n\nMAA problem book series. Cambridge University Press, Cambridge, New York, NY, 2004.\n\n[24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1994.\n\n[25] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso.\n\nJournal of the Royal Statistical Society Series B, pages 91\u2013108, 2005.\n\n[26] J. Weston, A. Elisseeff, B. Schlkopf, and P. Kaelbling. Use of the zero-norm with linear models and kernel\n\nmethods. Journal of Machine Learning Research, 3:1439\u20131461, 2003.\n\n[27] T. Yang, R. Jin, M. Mahdavi, and S. Zhu. An ef\ufb01cient primal-dual prox method for non-smooth optimiza-\n\ntion. CoRR, abs/1201.5283, 2012.\n\n[28] L. Yuan, J. Liu, and J. Ye. Ef\ufb01cient methods for overlapping group lasso. In NIPS, pages 352\u2013360, 2011.\n[29] M. Yuan and M. Yuan. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society, Series B, 68:49\u201367, 2006.\n\n[30] J. Zhou, F. Wang, J. Hu, and J. Ye. From micro to macro: data driven phenotyping by densi\ufb01cation of\n\nlongitudinal electronic medical records. In KDD, pages 135\u2013144, 2014.\n\n[31] Y. Zhou, R. Jin, and S. C. H. Hoi. Exclusive lasso for multi-task feature selection. Journal of Machine\n\nLearning Research - Proceedings Track, 9:988\u2013995, 2010.\n\n[32] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, Series B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 872, "authors": [{"given_name": "Deguang", "family_name": "Kong", "institution": "UT Arlington"}, {"given_name": "Ryohei", "family_name": "Fujimaki", "institution": "NEC Laboratories America"}, {"given_name": "Ji", "family_name": "Liu", "institution": "UW-Madison"}, {"given_name": "Feiping", "family_name": "Nie", "institution": "University of Texas Arlington"}, {"given_name": "Chris", "family_name": "Ding", "institution": "University of Texas at Arlington"}]}