{"title": "Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1150, "page_last": 1158, "abstract": "We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as ``forward greedy feature selection algorithm for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the $l_\\infty$ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy.", "full_text": "Group Orthogonal Matching Pursuit for\n\nVariable Selection and Prediction\n\nAur\u00b4elie C. Lozano, Grzegorz \u00b4Swirszcz, Naoki Abe\n\nIBM Watson Research Center,\n\n1101 Kitchawan Road,\n\nYorktown Heights NY 10598,USA\n\n{aclozano,swirszcz,nabe}@us.ibm.com\n\nAbstract\n\nWe consider the problem of variable group selection for least squares regression,\nnamely, that of selecting groups of variables for best regression performance,\nleveraging and adhering to a natural grouping structure within the explanatory\nvariables. We show that this problem can be ef\ufb01ciently addressed by using a cer-\ntain greedy style algorithm. More precisely, we propose the Group Orthogonal\nMatching Pursuit algorithm (Group-OMP), which extends the standard OMP pro-\ncedure (also referred to as \u201cforward greedy feature selection algorithm\u201d for least\nsquares regression) to perform stage-wise group variable selection. We prove that\nunder certain conditions Group-OMP can identify the correct (groups of) vari-\nables. We also provide an upperbound on the l\u221e norm of the difference between\nthe estimated regression coef\ufb01cients and the true coef\ufb01cients. Experimental re-\nsults on simulated and real world datasets indicate that Group-OMP compares\nfavorably to Group Lasso, OMP and Lasso, both in terms of variable selection\nand prediction accuracy.\n\n1 Introduction\n\nWe address the problem of variable selection for regression, where a natural grouping structure\nexists within the explanatory variables, and the goal is to select the correct group of variables, rather\nthan the individual variables. This problem arises in many situations (e.g. in multifactor ANOVA,\ngeneralized additive models, time series data analysis, where lagged variables belonging to the same\ntime series may form a natural group, gene expression analysis from microarrays data, where genes\nbelonging to the same functional cluster may be considered as a group). In these settings, selecting\nthe right groups of variables is often more relevant to the subsequent use of estimated models, which\nmay involve interpreting the models and making decisions based on them.\nRecently, several methods have been proposed to address this variable group selection problem, in\nthe context of linear regression [12, 15]. These methods are based on extending the Lasso formula-\ntion [8] by modifying the l1 penalty to account for the group structure. Speci\ufb01cally, Yuan & Lin [12]\nproposed the Group Lasso, which solves arg min\u03b2\n,\nwhere XG1 , . . . , XGJ are the natural groupings within the variables of X and \u03b2Gj are the coef\ufb01-\ncient vectors for variables in groups Gj. Zhao et al [15] considered a more general penalty class, the\nComposite Absolute Penalties family T (\u03b2) =\n, of which the Group Lasso penalty is a\nspecial instance. This development opens up a new direction of research, namely that of extending\nthe existing regression methods with variable selection to the variable group selection problem and\ninvestigating to what extent they carry over to the new scenario.\nThe present paper establishes that indeed one recent advance in variable selection methods for re-\ngression, \u201cforward greedy feature selection algorithm\u201d, also known as the Orthogonal Matching\n\n(cid:80)J\nj=1 (cid:107)\u03b2Gj(cid:107)2\n\n(cid:80)J\nj=1 (cid:107)\u03b2j(cid:107)l0\n\nj=1 XGj \u03b2Gj(cid:107)2 + \u03bb\n\n(cid:179)\n\n(cid:107)y \u2212(cid:80)J\n\n(cid:180)\n\nlj\n\n1\n2\n\n1\n\n\fPursuit (OMP) algorithm in the signal processing community [5], can be generalized to the current\nsetting of group variable selection. Speci\ufb01cally we propose the \u201cGroup Orthogonal Matching Pur-\nsuit\u201d algorithm (Group-OMP), which extends the OMP algorithm to leverage variable groupings,\nand prove that, under certain conditions, Group-OMP can identify the correct (groups of) variables\nwhen the sample size tends to in\ufb01nity. We also provide an upperbound on the l\u221e norm of the dif-\nference between the estimated regression coef\ufb01cients and the true coef\ufb01cients. Hence our results\ngeneralize those of Zhang [13], which established consistency of the standard OMP algorithm. A\nkey technical contribution of this paper is to provide a condition for Group-OMP to be consistent,\nwhich generalizes the \u201cExact Recovery Condition\u201d of [9](Theorem 3.1) stated for OMP under the\nnoiseless case. This result should also be of interest to the signal processing community in the\ncontext of block-sparse approximation of signals. We also conduct empirical evaluation to com-\npare the performance of Group-OMP with existing methods, on simulated and real world datasets.\nOur results indicate that Group-OMP favorably compares to the Group Lasso, OMP and Lasso al-\ngorithms, both in terms of the accuracy of prediction and that of variable selection. Related work\ninclude [10, 3] using OMP for simultaneous sparse approximation, [11] showing that standard MP\nselects features from correct groups, and [4] that consider a more general setting than ours.\nThe rest of the paper is organized as follows. Section 2 describes the proposed Group-OMP pro-\ncedure. The consistency results are then stated in Section 3. The empirical evaluation results are\npresented in Section 4. We conclude the paper with some discussions in Section 5.\n\n2 Group Orthogonal Matching Pursuit\nConsider the general regression problem y = X \u00af\u03b2 + \u03bd, where y \u2208 Rn is the response vector, X =\n[f1, . . . , fd] \u2208 Rn\u00d7d is the matrix of feature (or variable) vectors fj \u2208 Rn, \u00af\u03b2 \u2208 Rd is the coef\ufb01cient\nvector and \u03bd \u2208 Rn is the noise vector. We assume that the noise components \u03bdi, i = 1, . . . , n,\nare independent Gaussian variables with mean 0 and variance \u03c32. For any G \u2282 {1, . . . , d} let XG\ndenote the restriction of X to the set of variables, {fj, j \u2208 G}, where the colums fj are arranged in\nascending order. Similarly for any vector \u03b2 \u2208 Rd of regression coef\ufb01cients, denote \u03b2G its restriction\nto G, with reordering in ascending order. Suppose that a natural grouping structure exists within the\nvariables of X consisting of J groups XG1 , . . . , XGJ , where Gi \u2282 {1, . . . , d}, Gi \u2229 Gj = \u2205 for\ni (cid:54)= j and XGi \u2208 Rn\u00d7dj . Then, the above regression problem can be decomposed with respect to\n\u00af\u03b2Gj + \u03bd, where \u00af\u03b2Gj \u2208 Rdj . Furthermore, to simplify the exposition,\nthe groups, i.e. y =\nassume that each XGj is orthonormalized, i.e. X\u2217\nGiven \u03b2 \u2208 Rd let supp(\u03b2) = {j : \u03b2j (cid:54)= 0}. For any such G and v \u2208 Rn, denote by \u02c6\u03b2X(G, v) the co-\nef\ufb01cients resulting from applying ordinary least squares (OLS) with non-zero coef\ufb01cients restricted\n2 subject to supp(\u03b2) \u2282 G. Given the above setup, the\nto G, i.e., \u02c6\u03b2X(G, v)=arg min\u03b2\u2208Rd (cid:107)X\u03b2 \u2212 v(cid:107)2\nGroup-OMP procedure we propose is described in Figure 1, which extends the OMP procedure to\ndeal with group selection. Note that this procedure picks the best group in each iteration, with re-\nspect to reduction of the residual error, and it then re-estimates the coef\ufb01cients, \u03b2(k), as in OMP. We\nrecall that this re-estimation step is what distinguishes OMP, and our group version, from standard\nboosting-like procedures.\n\nXGj = Idj .\n\n(cid:80)J\n\nj=1 XGj\n\nGj\n\nGj\n\nX\u2217\n\nXGj = Idj . The response y \u2208 Rn. Precision \u0001 > 0 for the stopping criterion.\n\n\u2022 Input: The data matrix X = [f1, . . . , fd] \u2208 Rn\u00d7d, with group structure G1, . . . , GJ , such that\n\u2022 Output: The selected groups G(k), the regression coef\ufb01cients \u03b2(k).\n\u2022 Initialization: G(0) = \u2205, \u03b2(0) = 0.\nFor k = 1, 2, . . .\n\nLet j(k) = arg maxj (cid:107)X\u2217\nIf ((cid:107)X\u2217\nSet G(k) = G(k\u22121) \u222a Gj(k). Let \u03b2(k) = \u02c6\u03b2X(G(k), y).\n\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 \u2264 \u0001) break\n\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2.\n\n(\u2217)\n\nj(k)\n\nGj\n\nG\n\nEnd\n\nFigure 1: Method Group-OMP\n\n2\n\n\f3 Consistency Results\n\n(cid:83)\n\n(cid:83)\n\nGi\u2208Gbad\n\n3.1 Notation\nLet Ggood denote the set of all the groups included in the true model. We refer to the groups in\nGgood as good groups. Similarly we call Gbad the set of all the groups which are not included. We\nlet ggood and gbad denote the set of \u201cgood incides\u201d and \u201cbad indices\u201d, i.e. ggood =\nGi\nand gbad =\nGi. When they are used to restrict index sets for matrix columns or vectors,\nthey are assumed to be in canonical (ascending) order, as we did for G. Furthermore, the elements\nof Ggood are groups of indices, and |Ggood| is the number of groups in Ggood, while ggood is de\ufb01ned\nin terms of individual indices, i.e. ggood is the set of indices corresponding to the groups in Ggood.\nThe same holds for Gbad and gbad. In this notation supp( \u00af\u03b2) \u2282 ggood.\nWe denote by \u03c1X(Ggood) the smallest eigenvalue of X\u2217\nggood Xggood, i.e.\n2 : supp(\u03b2) \u2282 ggood\n2/(cid:107)\u03b2(cid:107)2\n\n(cid:169)(cid:107)X\u03b2(cid:107)2\n\n\u03c1X(Ggood) = inf \u03b2\n\nHere and throughout the paper we let A\u2217 denote the conjugation of the matrix A (which, for a\nreal matrix A, coincides with its transpose) and A+ denote the Moore\u2013Penrose pseudoinverse of\nIf rows of A are linearly independent A+ = A\u2217(AA\u2217)\u22121 and when\nthe matrix A (c.f. [6, 7]).\ncolumns of A are linearly independent A+ = (A\u2217A)\u22121A\u2217. Generally for u = {u1, . . . , u|ggood|},\nv = {v1, . . . , v|gbad|} we de\ufb01ne\n\nGi\u2208Ggood\n\n(cid:170)\n\n.\n\n(cid:107)u(cid:107)good\n\n(2,1) =\n\nGi\u2208Ggood\n\nj , and (cid:107)v(cid:107)bad\nu2\n\n(2,1) =\n\nGi\u2208Gbad\n\n(cid:80)\n\n(cid:80)\n\n(cid:114) (cid:80)\n\nj\u2208Gi\n\n(cid:114) (cid:80)\n\nj\u2208Gi\n\nand then for any matrix A \u2208 R|ggood|\u00d7|gbad|, let (cid:107)A(cid:107)good/bad\nThen we de\ufb01ne \u00b5X(Ggood) = (cid:107)X +\n\nggood Xgbad(cid:107)good/bad\n\n(2,1)\n\n(2,1)\n\n.\n\n= sup\n(cid:107)v(cid:107)bad\n\n(2,1)=1\n\nv2\nj\n(cid:107)Av(cid:107)good\n(2,1).\n\n3.2 The Noiseless Case\nWe \ufb01rst focus on the noiseless case (i.e. \u03bd \u2261 0). For all k, let rk = X\u03b2(k) \u2212 y. In the noiseless case,\nwe have r0 = \u2212y \u2208 Span(Ggood). So if Group-OMP has not made a mistake up to round k, we\nalso have rk \u2208 Span(Ggood). The following theorem and its corollary provide a condition which\nguarantees that Group-OMP does not make a mistake at the next iteration, given that it has not made\nany mistakes up to that point. By induction on k, it implies that Group-OMP never makes a mistake.\nTheorem 1. Reorder the groups in such a way that Ggood = G1, . . . , Gm and Gbad =\nGm+1, . . . , GJ. Let r \u2208 Span(Xggood). Then the following holds\nGJ r(cid:107)2)(cid:107)\u221e\nr(cid:107)2)(cid:107)\u221e\n\nGm+2 r(cid:107)2, . . . ,(cid:107)X\u2217\nG2 r(cid:107)2, . . . ,(cid:107)X\u2217\n\nGm+1 r(cid:107)2,(cid:107)X\u2217\nG1 r(cid:107)2,(cid:107)X\u2217\n\n\u2264\u00b5X (Ggood).\n\n(cid:107)((cid:107)X\u2217\n\n(cid:107)((cid:107)X\u2217\n\n(1)\n\nGm\n\nProof of Theorem 1. Reorder the groups in such way that Ggood = {G1, . . . , Gm} and Gbad =\n{Gm+1, . . . , GJ}. Let \u03a6\u2217 : Rn \u2192 Rd1 \u2295 Rd2 \u2295 . . . \u2295 Rdm be de\ufb01ned as\n\n(cid:161)\n\n(cid:162)T\n\n\u03a6\u2217(x) =\n\n(X\u2217\n\nG1 x)T , (X\u2217\n\nG2x)T , . . . , (X\u2217\n\nGm\n\nx)T\n\nand analogously let \u03a8\u2217 : Rn \u2192 Rdm+1 \u2295 Rdm+2 \u2295 . . . \u2295 RdJ be de\ufb01ned as\n.\n\nGm+2 x)T , . . . , (X\u2217\n\nGm+1 x)T , (X\u2217\n\n\u03a8\u2217(x) =\n\n(X\u2217\n\nx)T\n\nGJ\n\n(cid:180)T\n\n(cid:179)\n\nWe shall denote V \u03a6 = Rd1 \u2295 Rd2 \u2295 . . . \u2295 Rdm with a norm (cid:107).(cid:107)\u03a6\n(cid:107)(v1, v2, . . . , vm)(cid:107)\u03a6\nAnalogously V \u03a8 = Rdm+1 \u2295 Rdm+2 \u2295 . . . \u2295 RdJ with a norm (cid:107).(cid:107)\u03a8\n(cid:107)(v1, v2, . . . , vJ\u2212m)(cid:107)\u03a8\nIt is easy to verify that (cid:107).(cid:107)\u03a6\ncan be rephrased as\n\n(2,\u221e) de\ufb01ned as:\n(2,\u221e) = (cid:107)((cid:107)v1(cid:107)2,(cid:107)v2(cid:107)2, . . . ,(cid:107)vm(cid:107)2)(cid:107)\u221e for vi \u2208 Rdi, i = 1, . . . , m.\n(2,\u221e) de\ufb01ned as:\n(2,\u221e)=(cid:107)((cid:107)v1(cid:107)2,(cid:107)v2(cid:107)2, . . . ,(cid:107)vJ\u2212m(cid:107)2)(cid:107)\u221e for vj \u2208 Rdm+j , j = 1, . . . , J \u2212 m.\n(2,\u221e) are norms indeed.Now the condition expressed by Eq. (1)\n(cid:107)\u03a8\u2217(r)(cid:107)\u03a8\n(cid:107)\u03a6\u2217(r)(cid:107)\u03a6\n\n\u2264 \u00b5X(Ggood)\n\n(2,\u221e), (cid:107).(cid:107)\u03a8\n\n(2)\n\n(2,\u221e)\n(2,\u221e)\n\n3\n\n\f(cid:83)m\n\n=\n\n(2,\u221e)\n\n(2,\u221e)\n\n(cid:83)m\n\ni=1 XGi = 0\n\ni=1 XGi is a linear isomorphism onto its image.\n\nLemma 1. The map \u03a6\u2217 restricted to Span\nProof of Lemma 1. By de\ufb01nition if \u03a6\u2217(x) = (0)V\u03a6 then x must be orthogonal to each of the sub-\nspaces spanned by XGi, i = 1, . . . , m. Thus ker \u03a6\u2217 \u2229 Span\nLet (\u03a6\u2217)+ denote the inverse mapping whose existence was proved in Lemma 1. The choice\nof symbol is not coincident, the matrix of this mapping is indeed a pseudoinverse of the matrix\n(XG1|XG2| . . .|XGm)T .We have\n\u2264 (cid:107)\u03a8\u2217 \u25e6 (\u03a6\u2217)+(cid:107)(2,\u221e),\nwhere the last term is the norm of the operator \u03a8\u2217 \u25e6 (\u03a6\u2217)+ : V \u03a6 \u2192 V \u03a8. We are going to need the\nfollowing\nLemma 2. A dual space of V \u03a6 is (V \u03a6)\u2217 = Rd1 \u2295 Rd2 \u2295 . . . \u2295 Rdm with a norm (cid:107).(cid:107)\u03a6\nas: (cid:107)(v1, v2, . . . , vm)(cid:107)\u03a6\nA dual space of V \u03a8 is (V \u03a8)\u2217 = Rdm+1 \u2295 Rdm+2 \u2295 . . . \u2295 RnJ with a norm (cid:107).(cid:107)\u03a8\n(cid:107)(v1, v2, . . . , vJ\u2212m)(cid:107)\u03a8\n\n(2,1) de\ufb01ned as:\n\n(cid:107)\u03a8\u2217((\u03a6\u2217)+\u03a6\u2217(r))(cid:107)\u03a8\n\n(cid:107)\u03a8\u2217(r)(cid:107)\u03a8\n(cid:107)\u03a6\u2217(r)(cid:107)\u03a6\n\n(2,1) de\ufb01ned\n\n(cid:107)\u03a6\u2217(r)(cid:107)\u03a6\n\n(2,\u221e)\n(2,\u221e)\n\n(2,1) = (cid:107)((cid:107)v1(cid:107)2,(cid:107)v2(cid:107)2, . . . ,(cid:107)vm(cid:107)2)(cid:107)1 .\n(2,1) = (cid:107)((cid:107)v1(cid:107)2,(cid:107)v2(cid:107)2, . . . ,(cid:107)vJ\u2212m(cid:107)2)(cid:107)1 .\nJ(cid:80)\ni (cid:107)2 (as (cid:96)\u2217\n\nJ(cid:80)\nRdm+1 \u2295 Rdm+2 \u2295 . . . \u2295 RdJ .\ni , vi(cid:105)| =\ni , vi(cid:105)| =\n|(cid:104)v\u2217\ni , vi(cid:105)| = (cid:107)v\u2217\n\nProof of Lemma 2. We prove for V \u03a8, the proof for V \u03a6 is identical.\nLet v\u2217 = (v\u2217\n(cid:107)v\u2217(cid:107) = sup\n\nJ\u2212m)\n\n1, v\u2217\n2, . . . , v\u2217\n|v\u2217(v)| = sup\nvi\u2208Rni\n(cid:107)v(cid:107)2,\u221e=1\nThe last equality follows from sup\nvi\u2208Rni\n(cid:107)vi(cid:107)2=1\n\n(cid:107)v(cid:107)2,\u221e=1\n\n|(cid:104)v\u2217\n\n|(cid:104)v\u2217\n\nsup\nvi\u2208Rni\n(cid:107)vi(cid:107)2=1\n2 = (cid:96)2) and Schwartz inequality.\n\ni=m+1\n\nWe have\n(cid:107)v\u2217\ni (cid:107)2.\n\nJ(cid:80)\n\ni=m+1\n\ni=m+1\n\nv\u2208V \u03a8\n\n\u2208\n\nA fundamental fact from Functional Analysis states that a (Hermitian) conjugation is an isometric\nisomorphism. Thus\n\n(cid:107)\u03a8\u2217 \u25e6 (\u03a6\u2217)+(cid:107)(2,\u221e) = (cid:107)(\u03a6)+ \u25e6 \u03a8(cid:107)(2,1).\n\n(3)\nWe used here (A\u2217)\u2217 = A and (A\u2217)+ = (A+)\u2217. The right hand side of (3) is equal to\n(cid:107)X +\nin matrix notation. Thus the inequality (1) holds. This concludes the proof\nof Theorem 1.\nCorollary 1. Under the conditions of Theorem 1, if \u00b5X(Ggood) < 1 then the following holds\n\nggood Xgbad(cid:107)good/bad\n\n(2,1)\n\n(cid:107)((cid:107)X\u2217\n\nGm+1 r(cid:107)2,(cid:107)X\u2217\nG1 r(cid:107)2,(cid:107)X\u2217\n\nGm+2 r(cid:107)2, . . . ,(cid:107)X\u2217\nG2 r(cid:107)2, . . . ,(cid:107)X\u2217\n\nGm\n\n(cid:107)((cid:107)X\u2217\n\nGJ r(cid:107)2)(cid:107)\u221e\nr(cid:107)2)(cid:107)\u221e\n\n< 1.\n\n(4)\n\nIntuitively, the condition \u00b5X(Ggood) < 1 guarantees that no bad group \u201cmimics\u201d any good group too\nwell. Note that Theorem 1 and Corollary 1 are the counterpart to Theorem 3.3 in [9] which states the\nExact Recovery condition for the standard OMP algorithm, namely that (cid:107)X +\nggood Xgbad(cid:107)(1,1) < 1,\nwhere ggood is not de\ufb01ned in terms of groups, but rather in terms of the variables present in the true\nmodel (since the notion of groups does not pertain to OMP in its original form).\n\n3.3 The Noisy Case\n\nThe following theorem extends the results of Theorem 1 to deal with the non-zero Gaussian noise\n\u03bd. It shows that under certain conditions the Group-OMP algorithm does not select bad groups. A\nsketch of the proof is provided at the end of this section.\nTheorem 2. Assume that \u00b5X(Ggood) < 1 and 1 \u2265 \u03c1X(Ggood) > 0. For any \u03b7 \u2208 (0, 1/2), with\nprobability at least 1 \u2212 2\u03b7, if the stopping criterion of the Group-OMP algorithm is such that\n\n(cid:112)\n\n\u0001 >\n\n1\n\n1 \u2212 \u00b5X(Ggood) \u03c3\n\n2d ln(2d/\u03b7),\n\nthen when the algorithm stops all of the following hold:\n\n(C1)G(k\u22121) \u2282 Ggood\n\n4\n\n\f(C2)(cid:107)\u03b2(k\u22121) \u2212 \u02c6\u03b2X(Ggood, y)(cid:107)2 \u2264 \u0001\n(C3)(cid:107) \u02c6\u03b2X(Ggood, y) \u2212 \u00af\u03b2(cid:107)\u221e \u2264 \u03c3\n(C4)|Ggood \\ G(k\u22121)| \u2264 2\n\n(cid:175)(cid:175)(cid:169)\n\n\u221a\n\n|Ggood\\G(k\u22121)|\n\u03c1X(Ggood)\n2 ln(2|ggood|/\u03b7)\n\n(cid:113)\nGj \u2208 Ggood : (cid:107) \u00af\u03b2Gj(cid:107)2 <\n\n\u03c1X(Ggood)\n\n\u221a\n\n8\u0001\u03c1X(Ggood)\u22121\n\n(cid:170)(cid:175)(cid:175) .\n\n1\n\n(cid:112)\n\n(cid:112)\n\n1\u2212\u00b5X (Ggood) \u03c3\n\n(2 ln(2|Ggood|/\u03b7))/\u03c1X(Ggood).\n\n2d ln(2d/\u03b7) and minGj\u2208Ggood (cid:107) \u00af\u03b2Gj(cid:107)2 \u2265 \u221a\n\nWe thus obtain the following theorem which states the main consistency result for Group-OMP.\nTheorem 3. Assume that \u00b5X(Ggood) < 1 and 1 \u2265 \u03c1X(Ggood) > 0. For any \u03b7 \u2208 (0, 1/2), with\nprobability at least 1 \u2212 2\u03b7, if the stopping criterion of the Group-OMP algorithm is such that\n8\u0001\u03c1X(Ggood)\u22121 then when the\n\u0001 >\nalgorithm stops G(k\u22121) = Ggood and (cid:107)\u03b2(k\u22121) \u2212 \u00af\u03b2(cid:107)\u221e \u2264 \u03c3\nExcept for the condition on \u00b5X(Ggood) (and the de\ufb01nition of \u00b5X(Ggood) itself), the conditions in\nTheorem 2 and Theorem 3 are similar to those required for the standard OMP algorithm [13], the\nmain advantage being that for Group-OMP it is the l2 norm of the coef\ufb01cient groups for the true\nmodel that need to be lower-bounded, rather than the amplitude of the individual coef\ufb01cients.1\nProof Sketch of Theorem 2. To prove the theorem a series of lemmas are needed, whose proofs are\nomitted due to space constraint, as they can be derived using arguments similar to Zhang [13] for\nthe standard OMP case. The following lemma gives a lower bound on the correlation between the\ngood groups and the residuals from the OLS prediction where the coef\ufb01cients have been restricted\nto a set of good groups.\nLemma 3. Let G \u2282 Ggood, i.e., G is a set of good groups. Let \u03b2 = \u02c6\u03b2X(G, y), \u03b2(cid:48) = \u02c6\u03b2X(Ggood, y),\nf = X\u03b2 and f(cid:48) = X\u03b2(cid:48). Then maxGj\u2208Ggood (cid:107)X\u2217\n\n\u221a\n\u03c1X (Ggood)\n\u221a|Ggood\\G| (cid:107)f \u2212 f(cid:48)(cid:107)2.\n\n(y \u2212 f)(cid:107)2 \u2265\n\nGj\n\nThe following lemma relates the parameter \u02c6\u03b2X(Ggood), which is estimated by OLS given that the\nset of good groups has been correctly identi\ufb01ed, to the true parameter \u00af\u03b2.\nLemma 4. For all \u03b7 \u2208 (0, 1), with probability at least 1 \u2212 \u03b7, we have\n\u03c1X (Ggood)\n\n(cid:107) \u02c6\u03b2X(Ggood, y) \u2212 \u02c6\u03b2X(Ggood, Ey)(cid:107)\u221e \u2264 \u03c3\n\n2 ln(2|ggood|/\u03b7)\n\n(cid:113)\n\n.\n\nThe following lemma provides an upper bound on the correlation of the bad features to the residuals\nfrom the prediction by OLS given that the set of good groups has been correctly identi\ufb01ed.\nLemma 5. Let \u03b2(cid:48) = \u02c6\u03b2X(Ggood, y) and f(cid:48) = X\u03b2(cid:48). We have\n\n(cid:180)\n\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\n\nGj\n\n(f(cid:48) \u2212 y)(cid:107)2 \u2264 \u03c3\n\n2d ln(2d/\u03b7)\n\n\u2265 1 \u2212 \u03b7.\n\n(cid:179)\n\nP\n\nWe are now ready to prove Theorem 2. We \ufb01rst prove that for each iteration k before the Group-\nOMP algorithm stops, G(k\u22121) \u2282 Ggood by induction on k. Now, suppose that the claim holds after\nk \u2212 1 iterations, where k \u2265 1. So at the beginning of the kth iteration, we have G(k\u22121) \u2282 Ggood. We\nhave\n\nmax\n\nGj(cid:54)\u2208Ggood\n\n(cid:107)X\u2217\n(cid:107)X\u2217\n\nGj (X\u03b2(k\u22121) \u2212 y)(cid:107)2\nGj X(\u03b2(k\u22121) \u2212 \u03b2(cid:48))(cid:107)2 + max\nGj(cid:54)\u2208Ggood\n\n(cid:107)X\u2217\n\nGj (X\u03b2(cid:48) \u2212 y)(cid:107)2\n\nmax\n\nGj(cid:54)\u2208Ggood\n\n\u2264\n\u2264 \u00b5X(Ggood) max\nGj\u2208Ggood\n= \u00b5X(Ggood) max\nGj\u2208Ggood\n\n(cid:112)\n\n5\n\nGj X(\u03b2(k\u22121) \u2212 \u03b2(cid:48))(cid:107)2 + max\n(cid:107)X\u2217\nGj(cid:54)\u2208Ggood\nGj (X\u03b2(k\u22121) \u2212 y)(cid:107)2 + max\n(cid:107)X\u2217\nGj(cid:54)\u2208Ggood\n(cid:170)\n\nGj (X\u03b2(cid:48) \u2212 y)(cid:107)2\n(cid:107)X\u2217\nGj (X\u03b2(cid:48) \u2212 y)(cid:107)2\n(cid:107)X\u2217\n\n(5)\n\n(6)\n\n(cid:169) 1\n\n1The sample size n is explicitly part of the conditions in [13] while it\n\nis implicit here due to\nthe different ways of normalizing the matrix X. One recovers the same dependency on n by con-\n\u221a\n\u221a\n\u221a\nX(cid:48) (Ggood) =\nn, de\ufb01ning (as in [13]) \u03c1(cid:48)\nn, \u00af\u03b2(cid:48) = \u00af\u03b2/\nsidering X(cid:48) =\nnX, \u03b2(cid:48)(k) = \u03b2(k)/\n2: supp(\u03b2) \u2282 ggood\n2/(cid:107)\u03b2(cid:107)2\nn(cid:107)X(cid:48)\u03b2(cid:107)2\nX(cid:48)(Ggood) = \u03c1X(Ggood) and \u02c6\u03b2X(cid:48)(Ggood, y) =\n, and noting that \u03c1(cid:48)\n\u221a\ninf \u03b2\nn. If X had i.i.d. entries, with mean 0, variance 1/n and \ufb01nite 4th moment, \u03c1X (Ggood) con-\n\u02c6\u03b2X(Ggood, y)/\nverges a.s. to (1 \u2212 \u221a\ng)2 as n \u2192 \u221e and |ggood|/n \u2192 g \u2264 1 [2]. Hence the rates in C2-C4 are unaffected by\n\u03c1X (Ggood).\n\n\fGj\n\n(X\u03b2(cid:48) \u2212 y) = 0(dj) holds.\n\nHere Eq. 5 follows by applying Theorem 1, and Eq. 6 is due to the fact that for all Gj \u2208 Ggood\nX\u2217\nLemma 5 together with the condition on \u0001 of Theorem 2 implies that with probability at least 1 \u2212 \u03b7,\n(7)\n\n2d ln(2d/\u03b7) < (1 \u2212 \u00b5X(Ggood))\u0001.\n\nGj (X\u03b2(cid:48) \u2212 y)(cid:107)2 \u2264 \u03c3\n\n(cid:112)\n\n(cid:107)X\u2217\n\nmax\n\nGj(cid:54)\u2208Ggood\n\nLemma 3 together with the de\ufb01nition of \u03c1X(Ggood) implies\n\nmax\n\nGj\u2208Ggood\n\n(cid:107)X\u2217\n\nGj (y \u2212 X\u03b2(k\u22121))(cid:107)2 \u2265\n\n\u03c1X(Ggood)\n|Ggood \\ G(k\u22121)|(cid:107)\u03b2(k\u22121) \u2212 \u03b2(cid:48)(cid:107)2\n\n(8)\n\n(9)\n\n(cid:112)\n\n1\n\nWe then have to deal with the following cases.\nCase 1: (cid:107)\u03b2(k\u22121) \u2212 \u03b2(cid:48)(cid:107)2 > \u0001\n\n|Ggood\\G(k\u22121)|\n\u03c1X (Ggood)\n\n\u221a\n\n6\n\nGj\n\n\u221a\n\nmax\n\n(cid:107)X\u2217\n\nfollows\n\nGj\u2208Ggood\nthe\n\n. It follows that\nGj (X\u03b2(cid:48) \u2212 y)(cid:107)2/(1 \u2212 \u00b5X(Ggood)),\nGj (y \u2212 X\u03b2(k\u22121))(cid:107)2 > \u0001 > max\nGj(cid:54)\u2208G (cid:107)X\u2217\ninequality\nimplies\n7.\nfrom Eq.\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 < maxGj\u2208Ggood (cid:107)X\u2217\n\nThen Eq.\nthat\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2. So a good\n\nwhere\nlast\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\ngroup is selected, i.e., Gi(k) \u2208 Ggood and Eq. 9 implies that the algorithm does not stop.\nCase 2: (cid:107)\u03b2(k\u22121) \u2212 \u03b2(cid:48)(cid:107)2 \u2264 \u0001\nCase 2.1: Gi(k) \u2208 Ggood and the procedure does not stop.\nCase 2.2: Gi(k) \u2208 Ggood and the procedure stops.\nCase 2.3: Gi(k)\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\nthe last\nmaxGj(cid:54)\u2208Ggood (cid:107)X\u2217\nwhere the last inequality follows by Eq. 7. Hence the algorithm stops.\n\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 \u2264\n(cid:54)\u2208 Ggood in which case we have maxGj\u2208Ggood (cid:107)X\u2217\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 \u2264 \u00b5X(Ggood) maxGj\u2208Ggood (cid:107)X\u2217\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 +\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 +\n(X\u03b2(cid:48) \u2212 y)(cid:107)2 \u2264 \u00b5X(Ggood) maxGj(cid:54)\u2208Ggood (cid:107)X\u2217\n(X\u03b2(cid:48) \u2212 y)(cid:107)2, where the second inequality follows from Eq. 6 and\ninequality once again. We thus obtain that\n(X\u03b2(cid:48) \u2212 y)(cid:107)2 < \u0001,\n1\u2212\u00b5X (Ggood) maxGj(cid:54)\u2208Ggood (cid:107)X\u2217\n(X\u03b2(k\u22121) \u2212 y)(cid:107)2 \u2264\n\nfollows from applying the \ufb01rst\n\n. We then have three possibilities.\n\n|Ggood\\G(k\u22121)|\n\u03c1X (Ggood)\n\nGj\n\nGj\n\nGj\n\nGj\n\nGj\n\nGj\n\nGj\n\nGj\n\nGj\n\n\u221a\n\n\u221a\n\nThe above cases imply that if the algorithm does not stop we have Gi(k) \u2208 Ggood, and hence G(k) \u2286\nGgood and if the algorithm stops we have (cid:107)\u03b2(k\u22121) \u2212 \u03b2(cid:48)(cid:107)2 \u2264 \u0001\n. Thus by induction,\nif the Group-OMP algorithm stops at iteration k, we have that G(k\u22121) \u2286 Ggood and (cid:107)\u03b2(k\u22121) \u2212\n\u03b2(cid:48)(cid:107)2 \u2264 \u0001\n. So (C1) and (C2) are satis\ufb01ed. Lemma 4 implies that (C3) holds, and\ntogether with the theorem\u2019s condition on \u0001 also implies that with probability at least 1 \u2212 \u03b7, we have\n(cid:107) \u02c6\u03b2X(Ggood, y) \u2212 \u02c6\u03b2X(Ggood, Ey)(cid:107)\u221e \u2264 \u03c3\n\u03c1X(Ggood). This\nallows us to show that (C4) holds, using similar arguments as in [13], which we omit due to space\nconstraints. This leads to Theorem 2.\n\n(2 ln(2|Ggood|/\u03b7))/\u03c1X(Ggood) < \u0001/\n\n|Ggood\\G(k\u22121)|\n\u03c1X (Ggood)\n\n|Ggood\\G(k\u22121)|\n\u03c1X (Ggood)\n\n(cid:112)\n\n(cid:112)\n\n4 Experiments\n\n4.1 Simulation Results\n\nWe empirically evaluate the performance of the proposed Group-OMP method, against comparison\nmethods OMP, Group Lasso, Lasso and OLS (Ordinary Least Square). Comparison with OMP will\ntest the effect of \u201cgrouping\u201d OMP, while Group Lasso is included as a representative existing method\nof group variable selection. We compare the performance of these methods in terms of the accu-\nracy of variable selection, variable group selection and prediction. As measure of variable (group)\nselection accuracy we use the F1 measure, which is de\ufb01ned as F1 = 2P R\nP +R , where P denotes the\nprecision and R denotes the recall. For computing variable group F1 for a variable selection method,\n\n6\n\n\f(cid:162)\n\n(cid:161)(cid:107)Y \u2212 X\u03b2(cid:107)2 + \u03bb(cid:107)\u03b2(cid:107)1\n\nwe consider a group to be selected if any of the variables in the group is selected.2 As measure of\nprediction accuracy, we use the model error, de\ufb01ned as Model error = ( \u02c6\u03b2 \u2212 \u00af\u03b2)\u2217E(X\u2217X)( \u02c6\u03b2 \u2212 \u00af\u03b2),\nwhere \u00af\u03b2 are the true model coef\ufb01cients and \u02c6\u03b2 the estimated coef\ufb01cients. Recall that Lasso solves\narg min\u03b2\n. So the tuning parameter for Lasso and Group Lasso is the\npenalty parameter \u03bb. For Group-OMP and OMP rather than parameterizing the models according to\nprecision \u0001, we do so using the iteration number (i.e. a stopping point). We consider two estimates:\nthe \u201coracle estimate\u201d and the \u201choldout validated estimate\u201d. For the oracle estimate, the tuning pa-\nrameter is chosen so as to minimize the model error. Note that such estimate can only be computed\nin simulations and not in practical situations, but it is useful for evaluating the relative performance\nof comparison methods, independently of the appropriateness of the complexity parameter. The\nholdout-validated estimate is a practical version of the oracle estimate, obtained by selecting the\ntuning parameter by minimizing the average squared error on a validation set. We now describe the\nexperimental setup.\nExperiment 1: We use an additive model with categorical variables taken from [12](model I).\nConsider variables Z1, . . . , Z15, where Zi \u223c N (0, 1)(i = 1, . . . , 15) and cov(Zi, Zj) = 0.5|i\u2212j|.\nLet W1, . . . , W15 be such that Wi = 0 if Zi < \u03a6\u22121(1/3), Wi = 1 if Zi > \u03a6\u22121(2/3) and Wi = 2\nif \u03a6\u22121(1/3) \u2264 Zi \u2264 \u03a6\u22121(2/3), where \u03a6\u22121 is the quantile function for the normal distribution.\nThe responses in the data are generated using the true model:\nY = 1.8I(W1 = 1)\u22121.2I(W1 = 0)+ I(W3 = 1)+0.5I(W3 = 0)+ I(W5 = 1)+ I(W5 = 0)+ \u03bd,\nwhere I denote the indicator function and \u03bd \u223c N (0, \u03c3 = 1.476). Then let (X2(i\u22121)+1, X2i) =\n(I(Wi = 1), I(Wi = 0)), which are the variables that the estimation methods use as the explanatory\nvariables, with the following variable groups: Gi = {2i \u2212 1, 2i}(i = 1, . . . , 15). We ran 100 runs,\neach with 50 observations for training and 25 for validation.\nExperiment 2: We use an additive model with continuous variables taken from [12](model III),\nwhere the groups correspond to the expansion of each variable into a third-order polynomial.\n.\nConsider variables Z1, . . . , Z17, with Zi i.i.d. \u223c N (0, 1) (i = 1, . . . , 17). Let W1, . . . , W16 be\n\u221a\nde\ufb01ned as Wi = (Zi + Z17)/\n3 W6 + \u03bd,\nwhere \u03bd \u223c N (0, \u03c3 = 2). Then let the explanatory variables be (X3(i\u22121)+1, X3(i\u22121)+2, X3i) =\nwith the variable groups Gi = {3(i \u2212 1) + 1, 3(i \u2212 1) + 2, 3i}(i = 1, . . . , 16). We\nW 3\nran 100 runs, each with 100 observations for training and 50 for validation.\nExperiment 3: We use an additive model with continuous variables similar to that of [16]. Consider\nthree independent hidden variables Z1, . . . , Z3 such that Zi \u223c N (0, \u03c3 = 1). Consider 40 predictors\nde\ufb01ned as: Xi = Z(cid:98)(i\u22121)/3(cid:99)+1 + \u03bdi for i = 1, . . . , 15 and Xi \u223c N (0, 1) for i = 16, . . . , 40, where\n\u03bdi i.i.d. \u223c N (0, \u03c3 = 0.11/2). The true model is\ni=6 Xi + 2\nand the groups are Gk = {5(k \u2212 1) + 1, . . . , 5k}, for k = (1, . . . , 3), and Gk = k + 12, for k > 3.\nWe ran 100 runs, each with 500 observations for training and 50 for validation.\nExperiment 4: We use an additive model with continuous variables taken from [15]. Consider \ufb01ve\nhidden variables Z1, . . . , Z5 such that Zi i.i.d. \u223c N (0, \u03c3 = 1). Consider 10 measurements of each\nof these hidden variables such that Xi = (0.05)Z(cid:98)(i\u22121)/10(cid:99)+1 + (1\u2212 0.052)1/2\u03bdi, i=1,. . . ,50, where\n\u03bdi \u223c N (0, 1) and cov(\u03bdi, \u03bdj) = 0.5|i\u2212j|. The true model is Y = X \u00af\u03b2 + \u03bd, where \u03bd \u223c N (0, \u03c3 =\n19.22), and\n\n(cid:80)15\ni=11 Xi + \u03bd, where \u03bd \u223c N (0, \u03c3 = 15)\n\n2. The true model is Y = W 3\n\n3 + W 2\n\n3 + W3 + 1\n\n(cid:80)10\n\ni=1 Xi + 4\n\n6 \u2212 W 2\n\n6 + 2\n\n3 W 3\n\n(cid:80)5\n\nY = 3\n\n(cid:161)\n\n(cid:162)\n\ni , W 2\n\ni , Wi\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 7\n\n2\n1\n0\n\n\u00af\u03b2i =\n\nfor i = 1, . . . , 10\nfor i = 11, . . . , 20\nfor i = 21, . . . , 30\nfor i = 31, . . . , 50\n\nThe groups are Gk = {10(k \u2212 1) + 1, . . . , 10k}, for k = (1, . . . , 5). We ran 100 runs, each with\n300 observations for training and 50 for validation.\nThe results of the four experiments are presented in Table 1. We note that F1 (Var) and F1 (Group)\nare identical for the grouped methods for Experiments 1, 2 and 4, since in these the groups have\nequal size. Overall, Group-OMP performs consistently better than all the comparison methods, with\nrespect to all measures considered . In particular, Group-OMP does better than OMP not only for\n\n2Other ways of translating variable selection to variable group selection are possible, but the F1 measure is\n\nrelatively robust with respect to this choice.\n\n7\n\n\fF1 (Var)\n\nOLS\n\nLasso (Oracle)\nLasso (Holdout)\nOMP (Oracle)\nOMP (Holdout)\n\nGroup Lasso (Oracle)\nGroup Lasso (Holdout)\nGroup-OMP (Oracle)\nGroup-OMP (Holdout)\n\nF1 (Group)\n\nOLS\n\nLasso (Oracle)\nLasso (Holdout)\nOMP (Oracle)\nOMP (Holdout)\n\nGroup Lasso (Oracle)\nGroup Lasso (Holdout)\nGroup-OMP (Oracle)\nGroup-OMP (Holdout)\n\nME\nOLS\n\nLasso (Oracle)\nLasso (Holdout)\nOMP (Oracle)\nOMP (Holdout)\n\nGroup Lasso (Oracle)\nGroup Lasso (Holdout)\nGroup-OMP (Oracle)\nGroup-OMP (Holdout)\n\nExp 1\n\n0.333 \u00b1 0\n\n0.483 \u00b1 0.010\n0.389 \u00b1 0.012\n0.531 \u00b1 0.019\n0.422 \u00b1 0.014\n0.545 \u00b1 0.010\n0.624 \u00b1 0.017\n0.730 \u00b1 0.017\n0.615 \u00b1 0.020\n\nExp 1\n\n0.333 \u00b1 0\n\n0.458 \u00b1 0.012\n0.511 \u00b1 0.010\n0.687 \u00b1 0.018\n0.621 \u00b1 0.020\n0.545 \u00b1 0.010\n0.624 \u00b1 0.017\n0.730 \u00b1 0.017\n0.615 \u00b1 0.020\n\nExp 2\n\n0.222 \u00b1 0\n\n0.541 \u00b1 0.010\n0.528 \u00b1 0.015\n0.787 \u00b1 0.009\n0.728 \u00b1 0.013\n0.449 \u00b1 0.011\n0.459 \u00b1 0.016\n0.998 \u00b1 0.002\n0.921 \u00b1 0.012\n\nExp 2\n\n0.222 \u00b1 0\n\n0.346 \u00b1 0.008\n0.340 \u00b1 0.014\n0.808 \u00b1 0.020\n0.721 \u00b1 0.025\n0.449 \u00b1 0.011\n0.459 \u00b1 0.016\n0.998 \u00b1 0.002\n0.921 \u00b1 0.012\n\nExp 3\n\n0.545 \u00b1 0\n\n0.771 \u00b1 0.007\n0.758 \u00b1 0.015\n0.532 \u00b1 0.004\n0.477 \u00b1 0.006\n0.693 \u00b1 0.005\n0.706 \u00b1 0.013\n0.999 \u00b1 0.001\n0.918 \u00b1 0.011\n\nExp 4\n\n0.750 \u00b1 0\n\n0.817 \u00b1 0.004\n0.810 \u00b1 0.005\n0.781 \u00b1 0.005\n0.741 \u00b1 0.006\n0.755 \u00b1 0.002\n0.794 \u00b1 0.008\n0.998 \u00b1 0.002\n0.890 \u00b1 0.011\n\nExp 3\n\n0.194 \u00b1 0\n\n0.494 \u00b1 0.011\n0.547 \u00b1 0.029\n0.224 \u00b1 0.004\n0.421 \u00b1 0.026\n0.317 \u00b1 0.006\n0.364 \u00b1 0.018\n0.998 \u00b1 0.001\n0.782 \u00b1 0.025\n\nExp 4\n\n0.750 \u00b1 0\n\n0.751 \u00b1 0.001\n0.776 \u00b1 0.006\n0.842 \u00b1 0.010\n0.827 \u00b1 0.010\n0.755 \u00b1 0.002\n0.794 \u00b1 0.008\n0.998 \u00b1 0.002\n0.890 \u00b1 0.011\n\nExp 1\n\n3.184 \u00b1 0.129\n1.203 \u00b1 0.078\n2.536 \u00b1 0.097\n0.711 \u00b1 0.020\n0.945 \u00b1 0.031\n0.457 \u00b1 0.021\n1.279 \u00b1 0.017\n0.601 \u00b1 0.0273\n0.965 \u00b1 0.050\n\nExp 2\n\n7.063 \u00b1 0.251\n1.099 \u00b1 0.067\n1.309 \u00b1 0.080\n1.052 \u00b1 0.061\n1.394 \u00b1 0.102\n0.867 \u00b1 0.052\n1.047 \u00b1 0.075\n0.379 \u00b1 0.035\n0.605 \u00b1 0.089\n\nExp 3\n\n19.592 \u00b1 0.451\n9.228 \u00b1 0.285\n12.987 \u00b1 0.670\n19.006 \u00b1 0.443\n28.246 \u00b1 1.942\n11.538 \u00b1 0.370\n14.979 \u00b1 0.538\n6.727 \u00b1 0.252\n12.553 \u00b1 1.469\n\nExp 4\n\n46.845 \u00b1 0.985\n30.343 \u00b1 0.796\n38.089 \u00b1 1.353\n38.497 \u00b1 0.926\n48.564 \u00b1 1.957\n31.053 \u00b1 0.831\n37.359 \u00b1 1.260\n27.765 \u00b1 0.703\n35.989 \u00b1 1.127\n\nTable 1: Average F1 score at the variable level and group level, and model error for the models\noutput by Ordinary Least Squares, Lasso, OMP, Group Lasso, and Group-OMP.\n\nBoston Housing\nPrediction Error\n\nNumber of Original Variables\n\n29.30 \u00b1 3.25\n\nOLS\n13 \u00b1 0\n\nLasso\n\n17.82 \u00b1 0.48\n12.82 \u00b1 0.05\n\nOMP\n\n19.10 \u00b1 0.78\n11.51 \u00b1 0.20\n\nGroup Lasso\n18.45 \u00b1 0.59\n12.50 \u00b1 0.13\n\nGroup-OMP\n17.60 \u00b1 0.51\n9.09 \u00b1 0.31\n\nTable 2: Average test set prediction error, average number of original variables, for the models\noutput by OLS, Lasso, OMP, Group Lasso, and Group-OMP on the\u201c Boston Housing\u201d dataset.\n\nvariable group selection, but also for variable selection and predictive accuracy. Against Group-\nLasso, Group-OMP does better in all four experiments with respect to variable (group) selection\nwhen using Oracle, while it does worse in one case when using holdout validation. Group-OMP also\ndoes better than Group-Lasso with respect to the model error in three out of the four experiments.\n\n4.2 Experiment on a real dataset\n\nWe use the \u201cBoston Housing\u201d dataset (UCI Machine Learning Repository). The continuous vari-\nables appear to have non-linear effects on the target value, so for each such variable, say Xi, we\ni , and consider them as a variable\nconsider its third-order polynomial expansion, i.e., Xi, X 2\ngroup. We ran 100 runs, where for each run we select at random half of the instances as training\nexamples, one quarter as validation set, and the remaining quarter as test examples. The penalty\nparameter was chosen with holdout validation for all methods. The average test set prediction error,\nthe average number of selected original variables (i.e. groups) are reported in Table 2. These results\ncon\ufb01rm that Group-OMP has the highest prediction accuracy among the comparison methods, and\nalso leads to the sparsest model.\n\ni and X 3\n\n5 Concluding Remarks\n\nIn addition to its merits in terms of consistency and accuracy, Group-OMP is particulary attractive\ndue to its computational ef\ufb01ciency (the entire path is computed in J rounds, where J is the num-\nber of groups). Interesting directions for future research include comparing the conditions for the\nconsistency of Group-OMP to those for Group Lasso and the bounds on their respective accuracy in\nestimating the regression coef\ufb01cients, evaluating modi\ufb01ed versions of Group-OMP where the group\nselection step (\u2217) in Figure 1 includes a penalty to account for the group size, and considering a\nforward/backward extension that allows correcting for mistakes (similarly to [14]).\n\n8\n\n\fReferences\n[1] BACH, F.R., Consistency of the Group Lasso and Multiple Kernel Learning, J. Mach. Learn.\n\nRes., 9, 1179-1225, 2008.\n\n[2] BAI D., YIN Y.Q., Limit of the smallest eigenvalue of a large dimensional sample covariance\n\nmatrix, Ann. Probab. 21, 1275-1294, 1993.\n\n[3] CHEN J., HUO X., Sparse representations for multiple measurement vectors (MMV) in an\novercomplete dictionary, in Proc. of the 2005 IEEE Int. Conf. on Acoustics, Speech, and Signal\nProc., 2005.\n\n[4] HUANG J., ZHANG T., METAXAS D., Learning with Structured Sparsity, in ICML\u201909, 2009.\n[5] MALLAT S., ZHANG Z., Matching pursuits with time-frequency dictionaries, IEEE Transac-\n\ntions on Signal Processing, 41, 3397-3415, 1993.\n\n[6] MOORE, E.H, On the reciprocal of the general algebraic matrix, Bulletin of the American\n\nMathematical Society 26, 394-395, 1920.\n\n[7] PENROSE, R., A generalized inverse for matrices, Proceedings of the Cambridge Philosophical\n\nSociety 51, 406-413, 1955.\n\n[8] TIBSHIRANI, R., Regression shrinkage and selection via the lasso, J. Royal. Statist. Soc B.,\n\n58(1), 267-288, 1996.\n\n[9] TROPP J.A., Greed is good: Algorithmic results for sparse approximation, IEEE Trans. Info.\n\nTheory, 50(10), 2231-2242, 2004.\n\n[10] TROPP J.A., GILBERT A.C. , STRAUSS M.J., Algorithms for simultaneous sparse approxi-\n\nmation, Part I: greedy pursuit, Signal Proc. 86 (3), 572-588, 2006.\n\n[11] PEOTTA L., VANDERGHEYNST P., Matching Pursuit with Block Incoherent Dictionaries, Sig-\n\nnal Proc. 55 (9), 2007.\n\n[12] YUAN, M., LIN, Y., Model selection and estimation in regression with grouped variables, J.\n\nR. Statist. Soc. B, 68, 4967, 2006.\n\n[13] ZHANG, T., On the consistency of feature selection using greedy least squares regression, J.\n\nMachine Learning Research, 2008.\n\n[14] ZHANG, T., Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear\n\nModels, in NIPS08, 2008.\n\n[15] ZHAO, P, ROCHA, G. AND YU, B., Grouped and hierarchical model selection through com-\n\nposite absolute penalties, Manuscript, 2006.\n\n[16] ZOU, H., HASTIE T., Regularization and variable selection via the Elastic Net., J. R. Statist.\n\nSoc. B, 67(2) 301-320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 868, "authors": [{"given_name": "Grzegorz", "family_name": "Swirszcz", "institution": null}, {"given_name": "Naoki", "family_name": "Abe", "institution": null}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": null}]}