{"title": "High-dimensional support union recovery in multivariate regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1217, "page_last": 1224, "abstract": null, "full_text": "High-dimensional support union recovery in\n\nmultivariate regression\n\nGuillaume Obozinski\nDepartment of Statistics\n\nUC Berkeley\n\ngobo@stat.berkeley.edu\n\nMartin J. Wainwright\nDepartment of Statistics\n\nDept. of Electrical Engineering and Computer Science\n\nUC Berkeley\n\nwainwright@stat.berkeley.edu\n\nDepartment of Electrical Engineering and Computer Science\n\nMichael I. Jordan\n\nDepartment of Statistics\n\nUC Berkeley\n\njordan@stat.berkeley.edu\n\nAbstract\n\nWe study the behavior of block (cid:96)1/(cid:96)2 regularization for multivariate regression,\nwhere a K-dimensional response vector is regressed upon a \ufb01xed set of p co-\nvariates. The problem of support union recovery is to recover the subset of\ncovariates that are active in at least one of the regression problems. Study-\ning this problem under high-dimensional scaling (where the problem parame-\nters as well as sample size n tend to in\ufb01nity simultaneously), our main result\nis to show that exact recovery is possible once the order parameter given by\n\u03b8(cid:96)1/(cid:96)2(n, p, s) : = n/[2\u03c8(B\u2217) log(p \u2212 s)] exceeds a critical threshold. Here n is\nthe sample size, p is the ambient dimension of the regression model, s is the size\nof the union of supports, and \u03c8(B\u2217) is a sparsity-overlap function that measures a\ncombination of the sparsities and overlaps of the K-regression coef\ufb01cient vectors\nthat constitute the model. This sparsity-overlap function reveals that block (cid:96)1/(cid:96)2\nregularization for multivariate regression never harms performance relative to a\nnaive (cid:96)1-approach, and can yield substantial improvements in sample complexity\n(up to a factor of K) when the regression vectors are suitably orthogonal rela-\ntive to the design. We complement our theoretical results with simulations that\ndemonstrate the sharpness of the result, even for relatively small problems.\n\n1 Introduction\nA recent line of research in machine learning has focused on regularization based on block-structured\nnorms. Such structured norms are well motivated in various settings, among them kernel learn-\ning [3, 8], grouped variable selection [12], hierarchical model selection [13], simultaneous sparse\napproximation [10], and simultaneous feature selection in multi-task learning [7]. Block-norms that\ncompose an (cid:96)1-norm with other norms yield solutions that tend to be sparse like the Lasso, but the\nstructured norm also enforces blockwise sparsity, in the sense that parameters within blocks are\nmore likely to be zero (or non-zero) simultaneously.\nThe focus of this paper is the model selection consistency of block-structured regularization in the\nsetting of multivariate regression. Our goal is to perform model or variable selection, by which we\nmean extracting the subset of relevant covariates that are active in at least one regression. We refer\nto this problem as the support union problem. In line with a large body of recent work in statistical\nmachine learning (e.g., [2, 9, 14, 11]), our analysis is high-dimensional in nature, meaning that we\nallow the model dimension p (as well as other structural parameters) to grow along with the sample\nsize n. A great deal of work has focused on the case of ordinary (cid:96)1-regularization (Lasso) [2, 11, 14],\nshowing for instance that the Lasso can recover the support of a sparse signal even when p (cid:192) n.\n\n1\n\n\fSome more recent work has studied consistency issues for block-regularization schemes, including\nclassical analysis (p \ufb01xed) of the group Lasso [1], and high-dimensional analysis of the predic-\ntive risk of block-regularized logistic regression [5]. Although there have been various empirical\ndemonstrations of the bene\ufb01ts of block regularization, the generalizations of the result of [11] ob-\ntained by [6, 4] fail to capture the improvements observed in practice. In this paper, our goal is to\nunderstand the following question: under what conditions does block regularization lead to a quan-\nti\ufb01able improvement in statistical ef\ufb01ciency, relative to more naive regularization schemes? Here\nstatistical ef\ufb01ciency is assessed in terms of the sample complexity, meaning the minimal sample size\nn required to recover the support union; we wish to know how this scales as a function of prob-\nlem parameters. Our main contribution is to provide a function quantifying the bene\ufb01ts of block\nregularization schemes for the problem of multivariate linear regression, showing in particular that,\nunder suitable structural conditions on the data, the block-norm regularization we consider never\nharms performance relative to naive (cid:96)1-regularization and can lead to substantial gains in sample\ncomplexity.\nMore speci\ufb01cally, we consider the following problem of multivariate linear regression: a group of\nK scalar outputs are regressed on the same design matrix X \u2208 Rn\u00d7p. Representing the regression\ncoef\ufb01cients as a p \u00d7 K matrix B\u2217, the regression model takes the form\n\nY = XB\u2217 + W,\n\n(cid:110)\n\n(1)\nwhere Y \u2208 Rn\u00d7K and W \u2208 Rn\u00d7K are matrices of observations and zero-mean noise respectively\nand B\u2217 has columns \u03b2\u2217(1), . . . , \u03b2\u2217(K) which are the parameter vectors of each univariate regression.\nWe are interested in recovering the unionofthesupportsof individual regressions, more speci\ufb01cally\nwe would like to recover S = \u222akSk. The Lasso is often\nif Sk =\npresented as a relaxation of the so-called (cid:96)0 regularization, i.e., the count of the number of non-zero\nparameter coef\ufb01cients, an intractable non-convex function. More generally, block-norm regulariza-\ntions can be thought of as the relaxation of a non-convex regularization which counts the number of\ncovariates i for which at least one of the univariate regression parameters \u03b2\nis non-zero. More\nspeci\ufb01cally, let \u03b2\u2217\n\ni \u2208 {1, . . . , p}, \u03b2\n\n\u2217(k)\ni\n\n\u2217(k)\ni\n\n(cid:54)= 0\n\n(cid:111)\n\ni denote the ith row of B\u2217, and de\ufb01ne, for q \u2265 1,\n= |{i \u2208 {1, . . . , p}, (cid:107)\u03b2\u2217\n\ni (cid:107)q > 0}|\n\nand\n\n(cid:107)B\u2217(cid:107)(cid:96)0/(cid:96)q\n\n(cid:107)B\u2217(cid:107)(cid:96)1/(cid:96)q\n\n=\n\n(cid:107)\u03b2\u2217\ni (cid:107)q\n\np(cid:88)\n\ni=1\n\nAll (cid:96)0/(cid:96)q norms de\ufb01ne the same function, but differ conceptually in that they lead to different (cid:96)1/(cid:96)q\nrelaxations. In particular the (cid:96)1/(cid:96)1 regularization is the same as the usual Lasso. The other conceptu-\nally most natural block-norms are (cid:96)1/(cid:96)2 and (cid:96)1/(cid:96)\u221e. While (cid:96)1/(cid:96)\u221e is of interest, it seems intuitively\nto be relevant essentially to situations where the support is exactly the same for all regressions, an\nassumption that we are not willing to make.\n\nIn the current paper, we focus on the (cid:96)1/(cid:96)2 case and consider the estimator (cid:98)B obtained by solving\n\nthe following disguised second-order cone program:\n|||Y \u2212 XB|||2\n\nmin\n\nB\u2208Rp\u00d7K\n\n(cid:189)\n\n1\n2n\n\n(cid:80)\n\n(cid:190)\n\nF + \u03bbn (cid:107)B(cid:107)(cid:96)1/(cid:96)2\n\n,\n\n(2)\n\ni,j m2\n\nwhere |||M|||F : = (\nij)1/2 denotes the Frobenius norm. We study the support union problem\nunder high-dimensional scaling, meaning that the number of observations n, the ambient dimen-\nsion p and the size of the union of supports s can all tend to in\ufb01nity. The main contribution of\nthis paper is to show that under certain technical conditions on the design and noise matrices, the\nmodel selection performance of block-regularized (cid:96)1/(cid:96)2 regression (2) is governed by the control\nparameter \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) : =\n2 \u03c8(B\u2217,\u03a3SS ) log(p\u2212s), where n is the sample size, p is the ambient\ndimension, s = |S| is the size of the union of the supports, and \u03c8(\u00b7) is a sparsity-overlap function\nde\ufb01ned below. More precisely, the probability of correct support union recovery converges to one for\nall sequences (n, p, s, B\u2217) such that the control parameter \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) exceeds a \ufb01xed critical\nthreshold \u03b8crit < +\u221e. Note that \u03b8(cid:96)1/(cid:96)2 is a measure of the sample complexity of the problem\u2014that\nis, the sample size required for exact recovery as a function of the problem parameters. Whereas\nthe ratio (n/ log p) is standard for high-dimensional theory on (cid:96)1-regularization (essentially due to\ncovering numberings of (cid:96)1 balls), the function \u03c8(B\u2217, \u03a3SS) is a novel and interesting quantity, which\n\nn\n\n2\n\n\fmeasures both the sparsity of the matrix B\u2217, as well as the overlap between the different regression\ntasks (columns of B\u2217).\nIn Section 2, we introduce the models and assumptions, de\ufb01ne key characteristics of the problem and\nstate our main result and its consequences. Section 3 is devoted to the proof of this main result, with\nmost technical results deferred to the appendix. Section 4 illustrates with simulations the sharpness\nof our analysis and how quickly the asymptotic regime arises.\n\n1.1 Notations\nFor a (possibly random) matrix M \u2208 Rp\u00d7K, and for parameters 1 \u2264 a \u2264 b \u2264 \u221e, we distinguish\nthe (cid:96)a/(cid:96)b block norms from the (a, b)-operator norms, de\ufb01ned respectively as\n\n(cid:189) p(cid:88)\n\n(cid:181) K(cid:88)\n\n(cid:182) a\n\nb\n\n(cid:190) 1\n\na\n\n(cid:107)M(cid:107)(cid:96)a/(cid:96)b\n\n: =\n\n|mik|b\n\n(cid:80)\nalthough (cid:96)\u221e/(cid:96)p norms belong to both families (see Lemma B.0.1). For brevity, we denote the\nj |Mij| as |||M|||\u221e.\nspectral norm |||M|||2, 2 as |||M|||2, and the (cid:96)\u221e-operator norm |||M|||\u221e, \u221e = maxi\n\nk=1\n\ni=1\n\nand\n\n|||M|||a, b : = sup\n(cid:107)x(cid:107)b=1\n\n(cid:107)M x(cid:107)a,\n\n(3)\n\n2 Main result and some consequences\n\nThe analysis of this paper applies to multivariate linear regression problems of the form (1), in which\nthe noise matrix W \u2208 Rn\u00d7K is assumed to consist of i.i.d. elements Wij \u223c N(0, \u03c32). In addition,\nwe assume that the measurement or design matrices X have rows drawn in an i.i.d. manner from a\nzero-mean Gaussian N(0, \u03a3), where \u03a3 (cid:194) 0 is a p \u00d7 p covariance matrix.\nSuppose that we partition the full set of covariates into the support set S and its complement Sc, with\n|S| = s, |Sc| = p \u2212 s. Consider the following block decompositions of the regression coef\ufb01cient\nmatrix, the design matrix and its covariance matrix:\n\n(cid:183)\n\n(cid:184)\n\n(cid:183)\n\n(cid:184)\n\nB\u2217 =\n\nB\u2217\nB\u2217\n\nS\nSc\n\n, X = [XS XSc] ,\n\nand \u03a3 =\n\n\u03a3SS\n\u03a3SSc\n\u03a3ScS \u03a3ScSc\n\n.\n\ni to denote the ith row of B\u2217, and assume that the sparsity of B\u2217 is assessed as follows:\ni (cid:54)= 0}, with s = |S|.\n\nWe use \u03b2\u2217\n(A0) Sparsity: The matrix B\u2217 has row support S : = {i \u2208 {1, . . . , p} | \u03b2\u2217\nIn addition, we make the following assumptions about the covariance \u03a3 of the design matrix:\n(A1) Bounded eigenspectrum: There exist a constant Cmin > 0 (resp. Cmax < +\u221e) such that all\n(A2) Mutual incoherence: There exists \u03b3 \u2208 (0, 1] such that\n(A3) Self incoherence: There exists a constant Dmax such that\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\u03a3ScS(\u03a3SS)\u22121\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(\u03a3SS)\u22121\n\neigenvalues of \u03a3SS (resp. \u03a3) are greater than Cmin (resp. smaller than Cmax).\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\u221e \u2264 1 \u2212 \u03b3.\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\u221e \u2264 Dmax.\n\nAssumption A1 is a standard condition required to prevent excess dependence among elements of\nthe design matrix associated with the support S. The mutual incoherence assumption A2 is also\nwell known from previous work on model selection with the Lasso [10, 14]. These assumptions are\ntrivially satis\ufb01ed by the standard Gaussian ensemble (\u03a3 = Ip) with Cmin = Cmax = Dmax = \u03b3 = 1.\nMore generally, it can be shown that various matrix classes satisfy these conditions [14, 11].\n\n2.1 Statement of main result\n\nWith the goal of estimating the union of supports S, our main result is a set of suf\ufb01cient conditions\nusing the following procedure. Solve the block-regularized problem (2) with regularization param-\n\neter \u03bbn > 0, thereby obtaining a solution (cid:98)B = (cid:98)B(\u03bbn). Use this solution to compute an estimate\nof the support union as (cid:98)S((cid:98)B) : =\nde\ufb01ned if the solution (cid:98)B is unique, and as part of our analysis, we show that the solution (cid:98)B is indeed\n\n(cid:110)\ni \u2208 {1, . . . , p} | (cid:98)\u03b2i (cid:54)= 0\n\n. This estimator is unambiguously\n\nunique with high probability in the regime of interest. We study the behavior of this estimator for a\n\n(cid:111)\n\n3\n\n\f(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) \u03b6(BS)T (\u03a3SS)\u22121\u03b6(BS)\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\nsequence of linear regressions indexed by the triplet (n, p, s), for which the data follows the general\nmodel presented in the previous section with de\ufb01ning parameters B\u2217(n) and \u03a3(n) satisfying A0-\n\nA3. As (n, p, s) tends to in\ufb01nity, we give conditions on the triplet and properties of B\u2217 for which (cid:98)B\nis unique, and such that P[(cid:98)S = S] \u2192 1.\nThe central objects in our main result are the sparsity-overlap function, and the sample complexity\nparameter, which we de\ufb01ne here. For any vector \u03b2i (cid:54)= 0, de\ufb01ne \u03b6(\u03b2i) : = \u03b2i(cid:107)\u03b2i(cid:107)2\n. We extend the\nfunction \u03b6 to any matrix BS \u2208 Rs\u00d7K with non-zero rows by de\ufb01ning the matrix \u03b6(BS) \u2208 Rs\u00d7K\nwith ith row [\u03b6(BS)]i = \u03b6(\u03b2i). With this notation, we de\ufb01ne the sparsity-overlap function \u03c8(B)\nand the sample complexity parameter \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) as\n\u03c8(B) : =\nFinally, we use b\u2217\nthis notation, we have the following result:\nTheorem 1. Consider a random design matrix X drawn with i.i.d. N(0, \u03a3) row vectors, an obser-\nvation matrix Y speci\ufb01ed by model (1), and a regression matrix B\u2217 such that (b\u2217\nmin)2 decays strictly\nn max{s, log(p \u2212 s)}, for any function f(p) \u2192 +\u221e. Suppose that we solve\nmore slowly than f (p)\nthe block-regularized program (2) with regularization parameter \u03bbn = \u0398\nFor any sequence (n, p, B\u2217) such that the (cid:96)1/(cid:96)2 control parameter \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) exceeds the\ncritical threshold \u03b8crit(\u03a3) : = Cmax\n\n2 \u03c8(B\u2217) log(p\u2212s) .\ni (cid:107)2 to denote the minimal (cid:96)2 row-norm of the matrix B\u2217\n(cid:180)\n\n\u03b32 , then with probability greater than 1 \u2212 exp(\u2212\u0398(log p)),\n\nmin : = mini\u2208S (cid:107)\u03b2\u2217\n\n\u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217): =\n\nf(p) log(p)/n\n\n(cid:179)(cid:112)\n\nS. With\n\nand\n\n2\n\n(a) the block-regularized program (2) has a unique solution (cid:98)B, and\n(b) its support set (cid:98)S((cid:98)B) is equal to the true support union S.\n\nn\n\n(4)\n\n.\n\nRemarks:\n\u03b8crit(\u03a3) = 1. (ii) A technical condition that we require on the regularization parameter is\n\n(i) For the standard Gaussian ensemble (\u03a3 = Ip), the critical threshold is simply\n\n\u03bb2\nnn\n\nlog(p \u2212 s)\n\n\u2192 \u221e\n\n(5)\n\nwhich is satis\ufb01ed by the choice given in the statement.\n\n2.2 Some consequences of Theorem 1\n\ns\n\ns\n\n,\n\nCmin\n\nKCmax\n\ni = sign(\u03b2\u2217\n\n]. At the most pessimistic extreme, suppose that B\u2217 : = \u03b2\u2217(cid:126)1 T\n\nIt is interesting to consider some special cases of our main result. The simplest special case is the\nunivariate regression problem (K = 1), in which case the function \u03b6(\u03b2\u2217) outputs an s-dimensional\nsign vector with elements z\u2217\ni ), so that \u03c8(\u03b2\u2217) = z\u2217T (\u03a3SS)\u22121z\u2217 = \u0398(s). Consequently,\nthe order parameter of block (cid:96)1/(cid:96)2-regression for univariate regresion is given by \u0398(n/(2s log(p \u2212\ns)), which matches the scaling established in previous work on the Lasso [11].\nMore generally, given our assumption (A1) on \u03a3SS, the sparsity overlap \u03c8(B\u2217) always lies in the\nK\u2014that is, B\u2217\ninterval [\nconsists of K copies of the same coef\ufb01cient vector \u03b2\u2217 \u2208 Rp, with support of cardinality |S| = s.\n\u221a\nWe then have [\u03b6(B\u2217)]ij = sign(\u03b2\u2217\nK, from which we see that \u03c8(B\u2217) = z\u2217T (\u03a3SS)\u22121z\u2217, with\ni )/\nz\u2217 again the s-dimensional sign vector with elements z\u2217\ni ), so that there is no bene\ufb01t in\nsample complexity relative to the naive strategy of solving separate Lasso problems and construct-\ning the union of individually estimated supports. This might seem a pessimistic result, since under\nmodel (1), we essentially have Kn observations of the coef\ufb01cient vector \u03b2\u2217 with the same design\nmatrix but K independent noise realizations. However, the thresholds as well as the rates of conver-\ngence in high-dimensional results such as Theorem 1 are not determined by the noise variance, but\nrather by the number of interfering variables (p \u2212 s).\nAt the most optimistic extreme, consider the case where \u03a3SS = Is and (for s > K) suppose that\nB\u2217 is constructed such that the columns of the s \u00d7 K matrix \u03b6(B\u2217) are all orthogonal and of equal\nlength. Under this condition, we have\n\ni = sign(\u03b2\u2217\n\n4\n\n\fCorollary 1 (Orthonormal tasks). If the columns of the matrix \u03b6(B\u2217) are all orthogonal with equal\nlength and \u03a3SS = Is\u00d7s then the block-regularized problem (2) succeeds in union support recovery\nonce the sample complexity parameter n/(2 s\n\nK log(p \u2212 s)) is larger than 1.\n\nS\n\nTZ\u2217\n\nS = \u03b6(B\u2217\n\nS) are orthogonal, and Z\u2217\n\nFor the standard Gaussian ensemble, it is known [11] that the Lasso fails with probability one for\nall sequences such that n < (2 \u2212 \u03bd)s log(p \u2212 s) for any arbitrarily small \u03bd > 0. Consequently,\nCorollary 1 shows that under suitable conditions on the regression coef\ufb01cient matrix B\u2217, (cid:96)1/(cid:96)2 can\nprovides a K-fold reduction in the number of samples required for exact support recovery.\nAs a third illustration, consider, for \u03a3SS = Is\u00d7s, the case where the supports Sk of individual\nregression problems are all disjoint. The sample complexity parameter for each of the individual\nLassos is n/(2sk log(p \u2212 sk)) where |Sk| = sk, so that the sample size required to recover the\nsupport union from individual Lassos scales as n = \u0398(maxk[sk log(p \u2212 sk)]). However, if the\nsupports are all disjoint, then the columns of the matrix Z\u2217\nS =\ndiag(s1, . . . , sK) so that \u03c8(B\u2217) = maxk sk and the sample complexity is the same. In other words,\neven though there is no sharing of variables at all there is surprisingly no penalty from regularizing\njointly with the (cid:96)1/(cid:96)2-norm. However, this is not always true if \u03a3SS (cid:54)= Is\u00d7s and in many situations\n(cid:96)1/(cid:96)2-regularization can have higher sample complexity than separate Lassos.\n3 Proof of Theorem 1\n\nHigh-level proof outline: At a high level, our proof is based on the notion of what we refer to as\na primal-dual witness: we \ufb01rst formulate the problem (2) as a second-order cone program (SOCP),\nwith the same primal variable B as in (2) and a dual variable Z whose rows coincide at optimality\n\nIn addition to previous notations, the proofs use the shorthands: (cid:98)\u03a3SS= 1\nand \u03a0S = XS((cid:98)\u03a3SS)\u22121X T\nwith the subgradient of the (cid:96)1/(cid:96)2 norm. We then construct a primal matrix (cid:98)B along with a dual\nmatrix (cid:98)Z such that, under the conditions of Theorem 1, with probability converging to 1:\n(a) The pair ((cid:98)B,(cid:98)Z) satis\ufb01es the Karush-Kuhn-Tucker (KKT) conditions of the SOCP.\nnot have a unique solution a priori, a strict feasibility condition satis\ufb01ed by the dual variables (cid:98)Z\nguarantees that (cid:98)B is the unique optimal solution of (2).\n(c) The support union \u02c6S of (cid:98)B is identical to the support union S of B\u2217.\nan optimal primal-dual pair for which the primal solution (cid:98)B correctly recovers the support set S:\nLemma 1. Suppose that there exists a primal-dual pair ((cid:98)B,(cid:98)Z) that satisfy the conditions:\n\nIn spite of the fact that for general high-dimensional problems (with p (cid:192) n), the SOCP need\n\nAt the core of our constructive procedure is the following convex-analytic result, which characterizes\n\nS denotes the orthogonal projection onto the range of XS.\n\nS XS,(cid:98)\u03a3ScS= 1\n\nn X T\n\nn X T\n\nScXS\n\n(b)\n\n(cid:98)\u03a3SS((cid:98)BS \u2212 B\u2217\n(cid:176)(cid:176)(cid:176)(cid:98)ZSc\n\nS) \u2212 1\n\n\u03bbn\n\n(cid:176)(cid:176)(cid:176)\n\nn\n\nX T\n\n(cid:98)ZS = \u03b6((cid:98)BS)\nS W = \u2212\u03bbn(cid:98)ZS\n(cid:98)BSc = 0.\n\n: =\n\n(cid:96)\u221e/(cid:96)2\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:98)\u03a3ScS((cid:98)BS \u2212 B\u2217\n\nS) \u2212 1\n\nn\n\nX T\n\nScW\n\n(cid:96)\u221e/(cid:96)2\n\nconstruction.\n\nThen ((cid:98)B,(cid:98)Z) is the unique optimal solution to the block-regularized problem, with (cid:98)S((cid:98)B) = S by\nAppendix A proves Lemma 1, with the strict feasibility of (cid:98)ZSc given by (6c) to certify uniqueness.\nBased on Lemma 1, we construct the primal dual pair ((cid:98)B,(cid:98)Z) as follows. First, we set (cid:98)BSc = 0, to\nsatisfy condition (6d). Next, we obtain the pair ((cid:98)BS,(cid:98)ZS) by solving a restricted version of (2):\n\n3.1 Construction of primal-dual witness\n\n(6d)\n\n(cid:98)BS = arg min\n\nBS\u2208Rs\u00d7K\n\n(cid:40)\n\n(cid:183)\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)Y \u2212 X\n\n(cid:184)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)2\n\nF\n\nBS\n0Sc\n\n1\n2n\n\n+ \u03bbn(cid:107)BS(cid:107)(cid:96)1/(cid:96)2\n\n.\n\n(7)\n\n5\n\n(6a)\n\n(6b)\n\n(6c)\n\n< \u03bbn\n\n(cid:176)(cid:176)(cid:176)(cid:176)\n\n(cid:41)\n\n\fn X T\n\ninvertible, we may solve as follows\n\nS XS is strictly positive de\ufb01nite\nwith probability one, which implies that the restricted problem (7) is strictly convex and therefore\n\nSince s < n, the empirical covariance (sub)matrix (cid:98)\u03a3SS = 1\nhas a unique optimum (cid:98)BS. We then choose (cid:98)ZS to be the solution of equation (6b). Since any\nsuch matrix (cid:98)ZS is also a dual solution to the SOCP (7), it must be an element of the subdifferential\n\u2202(cid:107)(cid:98)BS(cid:107)(cid:96)1/(cid:96)2. It remains to show that this construction satis\ufb01es conditions (6a) and (6c). In order to\nsatisfy condition (6a), it suf\ufb01ces to show that (cid:98)\u03b2i (cid:54)= 0, i \u2208 S. From equation (6b) and since(cid:98)\u03a3SS is\n(cid:179)(cid:98)\u03a3SS\n(cid:180)\u22121\nFor any row i \u2208 S, we have (cid:107)(cid:98)\u03b2i(cid:107)2 \u2265 (cid:107)\u03b2\u2217\n(8)\ni (cid:107)2 \u2212 (cid:107)US(cid:107)(cid:96)\u221e/(cid:96)2 . Thus, it suf\ufb01ces to show that the\n(cid:189)\nto show that no row of (cid:98)BS is identically zero. We establish this result later in this section.\nTurning to condition (6c), by substituting expression (8) for the difference ((cid:98)BS \u2212 B\u2217\n\nfollowing event occurs with high probability\n\n(cid:107)US(cid:107)(cid:96)\u221e/(cid:96)2 \u2264 1\n2 b\u2217\n\ntion (6c), we obtain a (p \u2212 s) \u00d7 K random matrix VSc, whose row j \u2208 Sc is given by\n\n((cid:98)BS \u2212 B\u2217\n\n\u2212 \u03bbn(cid:98)ZS\n\nS) into equa-\n\nX T\nS W\nn\n\nE(US)\n\n= : US.\n\nS) =\n\n(cid:190)\n\n(cid:183)\n\n(cid:184)\n\n: =\n\n(9)\n\nmin\n\n(cid:181)\n\nVj\n\n: = X T\nj\n\n(cid:182)\n\n((cid:98)\u03a3SS)\u22121(cid:98)ZS\n(cid:111)\n\n[\u03a0S \u2212 In] W\nn\n\n\u2212 \u03bbn\n\nXS\nn\n\n(cid:110)\n(cid:107)VSc(cid:107)(cid:96)\u221e/(cid:96)2 < \u03bbn\n\nIn order for condition (6c) to hold, it is necessary and suf\ufb01cient that the probability of the event\n\n: =\n\nE(VSc)\nconverges to one as n tends to in\ufb01nity.\nCorrect inclusion of supporting covariates: We begin by analyzing the probability of E(US).\nLemma 2. Under assumption A3 and conditions (5) of Theorem 1, with probability 1 \u2212\nexp(\u2212\u0398(log s)), we have\n\n(11)\n\n(cid:179)(cid:112)\n\n(cid:180)(cid:180)\n\n(cid:179)(cid:112)\n\n(cid:180)\n\n(cid:179)\n\n(cid:107)US(cid:107)(cid:96)\u221e/(cid:96)2 \u2264 O\n\n(log s)/n\n\n+ \u03bbn\n\nDmax + O\n\ns2/n\n\n.\n\nThis lemma is proved in in the Appendix. With the assumed scaling n = \u2126 (s log(p \u2212 s)), and the\nassumed slow decrease of b\u2217\nfor\nsome \u03b5n \u2192 0, we have\n\nmin, which we write explicitly as (b\u2217\n\nf (p) max{s,log(p\u2212s)}\n\nmin)2 \u2265 1\n\n\u03b52\nn\n\nn\n\n.\n\n(10)\n\n(cid:107)US(cid:107)(cid:96)\u221e/(cid:96)2\n\nb\u2217\n\nmin\n\n\u2264 O(\u03b5n),\n\n(12)\n\n(cid:107)V (cid:107)(cid:96)\u221e/(cid:96)2 \u2264 3(cid:88)\n\nT (cid:48)\n\ni\n\ni=1\n\n1\n\u03bbn\n\nso that the conditions of Theorem 1 ensure that E(US) occurs with probability converging to one.\nCorrect exclusion of non-support: Next we analyze the event E(VSc). For simplicity, in the\nfollowing arguments, we drop the index Sc and write V for VSc. In order to show that (cid:107)V (cid:107)(cid:96)\u221e/(cid:96)2 <\n\u03bbn with probability converging to one, we make use of the decomposition\n\n1\n\u03bbn\n\n(cid:107)E [V |XS, W ] \u2212 E [V |XS](cid:107)(cid:96)\u221e/(cid:96)2\n\nT (cid:48)\n2 : =\nLemma 3. Under assumption A2, T (cid:48)\n\nwhere\n\nand\n\nT (cid:48)\n1 : =\n\nT (cid:48)\n3 : =\n\n1\n\u03bbn\n1\n\u03bbn\n\n(cid:107)E [V | XS](cid:107)(cid:96)\u221e/(cid:96)2 ,\n(cid:107)V \u2212 E [V |XS, W ](cid:107)(cid:96)\u221e/(cid:96)2 .\n\n1 \u2264 1 \u2212 \u03b3 . Under conditions (5) of Theorem 1, T (cid:48)\n\n2 = op(1).\n\n(cid:107)V (cid:107)(cid:96)\u221e/(cid:96)2 < 1 with high probability, it suf\ufb01ces to show that T (cid:48)\nTherefore, to show that 1\n3 < \u03b3\n\u03bbn\nwith high probability. Until now, we haven\u2019t appealed to the sample complexity parameter\n\u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217). In the next section, we prove that \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) > \u03b8crit(\u03a3) implies that T (cid:48)\n3 < \u03b3\nwith high probability.\n\n6\n\n\f(cid:183)\n\nCmax\n\nMn\n\n(13)\n\nP[T (cid:48)\n\nmax\nj\u2208Sc\n\n(cid:162)\n\nn\n\n(cid:161)\n\nd=\n\nand\n\n2\n\nn\n\nn\n1\n2\n\n(cid:111)\n\n.\n\n\u03a3Sc | S\n\njj \u03beT\n\nj Mn\u03bej,\n\n2 | W, XS\n\n: = \u03bb2\nn\nn\n\n1\nn2 W T (\u03a0S \u2212 In)W.\n\nLemma 4. Conditionally on W and XS, we have\n\nwhere \u03bej \u223c N((cid:126)0K, IK) and where the K \u00d7 K matrix Mn = Mn(XS, W ) is given by\n\nBut the covariance matrix Mn is itself concentrated. Indeed,\nLemma 5. Under the conditions (5) of Theorem 1, for any \u03b4 > 0, the following event T (\u03b4) has\nprobability converging to 1:\n\n(cid:161)(cid:107)Vj \u2212 E [Vj | XS, W ](cid:107)2\n(cid:162)\n(cid:98)Z T\nS ((cid:98)\u03a3SS)\u22121(cid:98)ZS +\n(cid:110)\n|||Mn|||2 \u2264 \u03bb2\n(14)\n3 \u2265 \u03b3] \u2264 P[T (cid:48)\n3 \u2265 \u03b3 | T (\u03b4)] + P[T (\u03b4)c], but, from lemma 5,\n(cid:183)\n(cid:184)\nj Mn\u03bej \u2264 Cmax |||Mn|||2 max\nj\u2208Sc\n2 \u2265 2t\u2217(n, B\u2217)\n\nT (\u03b4) : =\nn\nFor any \ufb01xed \u03b4 > 0, we have P[T (cid:48)\nP[T (\u03b4)c] \u2192 0, so that it suf\ufb01ces to deal with the \ufb01rst term.\nGiven that (\u03a3Sc | S)jj \u2264 (\u03a3ScSc)jj \u2264 Cmax for all j, on the event T (\u03b4), we have\nmax\nj\u2208Sc\n\u03b32\n\n(cid:107)\u03bej(cid:107)2\nwith t\u2217(n, B\u2217) : =\n\n\u03c8(B\u2217) (1 + \u03b4) .\nFinally using the union bound and a large deviation bound for \u03c72 variates we get the following\ncondition which is equivalent to the condition of Theorem 1: \u03b8(cid:96)1/(cid:96)2(n, p ; B\u2217) > \u03b8crit(\u03a3):\nLemma 6. P\n\n\u2192 0 if t\u2217(n, B\u2217) > (1 + \u03bd) log(p \u2212 s) for some \u03bd > 0.\n\n(\u03a3Sc | S)jj \u03beT\nmax\nj\u2208Sc\n3 \u2265 \u03b3|T (\u03b4)] \u2264 P\n\n(cid:184)\n2 \u2265 2t\u2217(n, B\u2217)\n\n2 \u2264 Cmax \u03bb2\n\n\u03c8(B\u2217)\n\n\u03c8(B\u2217)\n\n(cid:107)\u03bej(cid:107)2\n\n(cid:107)\u03bej(cid:107)2\n\n(cid:107)\u03bej(cid:107)2\n\n(1 + \u03b4)\n\n(cid:112)\n\nij| = 1/\n\n\u221a\n2, 1/\n\nij in {\u22121/\nS), and b\u2217\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) \u03b6(B\u2217)T \u03b6(B\u2217)\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\nmax\nj\u2208Sc\n4 Simulations\nIn this section, we illustrate the sharpness of Theorem 1 and furthermore ascertain how quickly\nthe predicted behavior is observed as n, p, s grow in different regimes, for two regression tasks\n(i.e., K = 2). In the following simulations, the matrix B\u2217 of regression coef\ufb01cients is designed\n\u221a\n2} to yield a desired value of \u03c8(B\u2217). The design matrix X is\nwith entries \u03b2\u2217\n\u221a\nsampled from the standard Gaussian ensemble. Since |\u03b2\u2217\n2 in this construction, we have\nmin = 1. Moreover, since \u03a3 = Ip, the sparsity-overlap \u03c8(B\u2217) is simply\nB\u2217\nS = \u03b6(B\u2217\n2 . From our analysis, the sample complexity parameter \u03b8(cid:96)1/(cid:96)2 is controlled by the\n\u201cinterference\u201d of irrelevant covariates, and not by the variance of a noise component.\nWe consider linear sparsity with s = \u03b1p, for \u03b1 = 1/8, for various ambient model dimen-\nsions p \u2208 {32, 256, 1024}. For each value of p, we perform simulations varying the sample\nsize n to match corresponding values of the basic Lasso sample complexity parameter, given\nby \u03b8Las\nIn each case, we solve the block-\nregularized problem (2) with sample size n = 2\u03b8Lass log(p \u2212 s) using the regularization parameter\n\u03bbn =\nFor our construction of matrices B\u2217, we choose both p and the scalings for the sparsity so that the\nobtained values for s that are multiples of four, and construct the columns Z (1)\u2217 and Z (2)\u2217 of the\nmatrix B\u2217 = \u03b6(B\u2217) from copies of vectors of length 4. Denoting by \u2297 the usual matrix tensor\nproduct, we consider:\nIdentical regressions: We set Z (1)\u2217 = Z (2)\u2217 = 1\u221a\nOrthogonal regression: Here B\u2217 is constructed with Z (1)\u2217 \u22a5 Z (2)\u2217, so that \u03c8(B\u2217) = s\n\n(cid:126)1s, so that the sparsity-overlap is \u03c8(B\u2217) = s.\n2 , the most\n(cid:126)1s/2\u2297(1,\u22121)T .\nIntermediate angles: In this intermediate case, the columns Z (1)\u2217 and Z (2)\u2217 are at a 60\u25e6 angle,\n(cid:126)1s/4 \u2297 (1, 1, 1,\u22121)T .\nFigure 1 shows plots of all three cases and the reference Lasso case for the three different values\nof the ambient dimension and the two types of sparsity described above. Note how the curves all\nundergo a threshold phenomenon, with the location consistent with the predictions of Theorem 1.\n\n: = n/(2s log(p \u2212 s)), in the interval [0.25, 1.5].\nlog(p \u2212 s) (log s)/n. In all cases, the noise level is set at \u03c3 = 0.1.\n\nfavorable situation. To achieve this, we set Z (1)\u2217 = 1\u221a\n\nwhich leads to \u03c8(B\u2217) = 3\n\n4 s. We set Z (1)\u2217 = 1\u221a\n\n(cid:126)1s and Z (2)\u2217 = 1\u221a\n\n(cid:126)1s and Z (2)\u2217 = 1\u221a\n\n2\n\n2\n\n2\n\n2\n\n2\n\nn\n\n7\n\n\fFigure 1. Plots of support recovery probability P[(cid:98)S = S] versus the basic (cid:96)1 control parameter\n\n\u03b8Las=n/[2s log(p \u2212 s)] for linear sparsity s=p/8, and for increasing values of p \u2208 {32, 256, 1024}\nfrom left to right. Each graph shows four curves corresponding to the case of independent (cid:96)1 regular-\nization (pluses), and for (cid:96)1/(cid:96)2 regularization, the cases of identical regression (crosses), intermediate\nangles (nablas), and orthogonal regressions (squares). As plotted in dotted vertical lines, Theorem 1\npredicts that identical case should succeed for \u03b8Las>1 (same as ordinary Lasso), intermediate case for\n\u03b8Las>0.75, and orthogonal case for \u03b8Las>0.50. The shift of these curves con\ufb01rms this prediction.\n\n5 Discussion\nWe studied support union recovery under high-dimensional scaling with the (cid:96)1/(cid:96)2 regularization,\nand shown that its sample complexity is determined by the function \u03c8(B\u2217). The latter integrates\nthe sparsity of each univariate regression with the overlap of all the supports and the discrepancies\nbetween each of the vectors of parameter estimated.\nIn favorable cases, for K regressions, the\nsample complexity for (cid:96)1/(cid:96)2 is K times smaller than that of the Lasso. Moreover, this gain is not\nobtained at the expense of an assumption of shared support over the data.\nIn fact, for standard\nGaussian designs, the regularization seems \u201cadaptive\u201d in sense that it doesn\u2019t perform worse than\nthe Lasso for disjoint supports. This is not necessarily the case for more general designs and in some\nsituations, which need to be characterized in future work, it could do worse than the Lasso.\nReferences\n[1] F. Bach. Consistency of the group Lasso and multiple kernel learning. Technical report, INRIA -\n\nD\u00b4epartement d\u2019Informatique, Ecole Normale Sup\u00b4erieure, 2008.\n\n[2] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In\n\nProc. Int. Conf. Machine Learning (ICML). Morgan Kaufmann, 2004.\n\n[3] D. Donoho, M. Elad, and V. M. Temlyakov. Stable recovery of sparse overcomplete representations in\n\nthe presence of noise. IEEE Trans. Info Theory, 52(1):6\u201318, January 2006.\n\n[4] H. Liu and J. Zhang. On the (cid:96)1\u2212(cid:96)q regularized regression. Technical Report arXiv:0802.1517v1, Carnegie\n\nMellon University, 2008.\n\n[5] L. Meier, S. van de Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Technical report,\n\nMathematics Department, Swiss Federal Institute of Technology Z\u00a8urich, 2007.\n\n[6] Y. Nardi and A. Rinaldo. On the asymptotic properties of the group lasso estimator for linear models.\n\nElectronic Journal of Statistics, 2:605\u2013633, 2008.\n\n[7] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection and joint subspace selection for multiple\n\nclassi\ufb01cation problems. Statistics and Computing, 2009. To appear.\n\n[8] M. Pontil and C.A. Michelli. Learning the kernel function via regularization. Journal of Machine Learning\n\nResearch, 6:1099\u20131125, 2005.\n\n[9] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. SpAM: sparse additive models. In Neural Info.\n\nProc. Systems (NIPS) 21, Vancouver, Canada, December 2007.\n\n[10] J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans.\n\nInfo Theory, 52(3):1030\u20131051, March 2006.\n\n[11] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity using using\n(cid:96)1-constrained quadratic programs. Technical Report 709, Department of Statistics, UC Berkeley, 2006.\n[12] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society B, 1(68):4967, 2006.\n\n[13] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute\n\npenalties. Technical report, Statistics Department, UC Berkeley, 2007.\n\n[14] P. Zhao and B. Yu. Model selection with the lasso. J. of Machine Learning Research, pages 2541\u20132567,\n\n2007.\n\n8\n\n00.511.500.20.40.60.81qP(support correct)p=32 s=p/8=4 L1Z1=Z2\u00d0 (Z1,Z2)=60oZ1^ Z200.511.500.20.40.60.81qP(support correct)p=256 s=p/8=3200.511.500.20.40.60.81qP(support correct)p=1024 s=p/8=128\f", "award": [], "sourceid": 3432, "authors": [{"given_name": "Guillaume", "family_name": "Obozinski", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}