{"title": "Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2304, "page_last": 2312, "abstract": "Given $n$ noisy samples with $p$ dimensions, where $n \\ll p$, we show that the multi-stage thresholding procedures can accurately estimate a sparse vector $\\beta \\in \\R^p$ in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of $s$, which is the number of non-zero elements in the true parameter $\\beta$. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if $X$ obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand\\{e}s-Tao 07) achieves the $\\ell_2$ loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model.", "full_text": "Thresholding Procedures for High Dimensional\n\nVariable Selection and Statistical Estimation\n\nShuheng Zhou\n\nSeminar f\u00a8ur Statistik\n\nETH Z\u00a8urich\n\nCH-8092, Switzerland\n\nAbstract\n\nGiven n noisy samples with p dimensions, where n \u226a p, we show that the multi-\nstep thresholding procedure can accurately estimate a sparse vector \u03b2 \u2208 Rp in a\nlinear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov\n09). Thus our conditions for model selection consistency are considerably weaker\nthan what has been achieved in previous works. More importantly, this method al-\nlows very signi\ufb01cant values of s, which is the number of non-zero elements in the\ntrue parameter. For example, it works for cases where the ordinary Lasso would\nhave failed. Finally, we show that if X obeys a uniform uncertainty principle and\nif the true parameter is suf\ufb01ciently sparse, the Gauss-Dantzig selector (Cand`es-\nTao 07) achieves the \u21132 loss within a logarithmic factor of the ideal mean square\nerror one would achieve with an oracle which would supply perfect information\nabout which coordinates are non-zero and which are above the noise level, while\nselecting a suf\ufb01ciently sparse model.\n\n1\n\nIntroduction\n\nIn a typical high dimensional setting, the number of variables p is much larger than the number of\nobservations n. This challenging setting appears in linear regression, signal recovery, covariance\nselection in graphical modeling, and sparse approximations. In this paper, we consider recovering\n\u03b2 \u2208 Rp in the following linear model:\n\nY = X\u03b2 + \u01eb,\n\n(1.1)\nwhere X is an n \u00d7 p design matrix, Y is a vector of noisy observations and \u01eb is the noise term. We\nassume throughout this paper that p \u2265 n (i.e. high-dimensional), \u01eb \u223c N (0, \u03c32In), and the columns\nof X are normalized to have \u21132 norm \u221an. Given such a linear model, two key tasks are to identify\nthe relevant set of variables and to estimate \u03b2 with bounded \u21132 loss.\nIn particular, recovery of the sparsity pattern S = supp(\u03b2) := {j : \u03b2j 6= 0}, also known as variable\n(model) selection, refers to the task of correctly identifying the support set (or a subset of \u201csigni\ufb01-\ncant\u201d coef\ufb01cients in \u03b2) based on the noisy observations. Even in the noiseless case, recovering \u03b2 (or\nits support) from (X, Y ) seems impossible when n \u226a p. However, a line of recent research shows\nthat it becomes possible when \u03b2 is also sparse: when it has a relatively small number of nonzero\ncoef\ufb01cients and when the design matrix X is also suf\ufb01ciently nice, which we elaborate below. One\nimportant stream of research, which we also adopt here, requires computational feasibility for the\nestimation methods, among which the Lasso and the Dantzig selector are both well studied and\nshown with provable nice statistical properties; see for example [11, 9, 19, 21, 5, 18, 12, 2]. For a\nchosen penalization parameter \u03bbn \u2265 0, regularized estimation with the \u21131-norm penalty, also known\n\n1\n\n\fas the Lasso [16] or Basis Pursuit [6] refers to the following convex optimization problem\n\nwhere the scaling factor 1/(2n) is chosen by convenience; The Dantzig selector [5] is de\ufb01ned as,\n\n1\n2nkY \u2212 X\u03b2k2\n\n2 + \u03bbnk\u03b2k1,\n\n(1.2)\n\n\u03b2\n\nb\u03b2 = arg min\nb\u03b2\u2208Rp(cid:13)(cid:13)(cid:13)b\u03b2(cid:13)(cid:13)(cid:13)1\n\n(DS) arg min\n\nsubject to (cid:13)(cid:13)(cid:13)(cid:13)\n\n1\nn\n\nX T (Y \u2212 Xb\u03b2)(cid:13)(cid:13)(cid:13)(cid:13)\u221e \u2264 \u03bbn.\n\n(1.3)\n\nOur goal in this work is to recover S as accurately as possible: we wish to obtain b\u03b2 such that\n| supp(b\u03b2)\\ S| (and sometimes |S\u25b3 supp(b\u03b2)| also) is small, with high probability, while at the same\ntime kb\u03b2\u2212\u03b2k2\n2 is bounded within logarithmic factor of the ideal mean square error one would achieve\nwith an oracle which would supply perfect information about which coordinates are non-zero and\nwhich are above the noise level (hence achieving the oracle inequality as studied in [7, 5]); We deem\nthe bound on \u21132-loss as a natural criteria for evaluating a sparse model when it is not exactly S. Let\ns = |S|. Given T \u2286 {1, . . . , p}, let us de\ufb01ne XT as the n \u00d7 |T| submatrix obtained by extracting\ncolumns of X indexed by T ; similarly, let \u03b2T \u2208 R|T |, be a subvector of \u03b2 \u2208 Rp con\ufb01ned to T .\nFormally, we study a Multi-step Procedure: First we obtain an initial estimator \u03b2init using the Lasso\nas in (1.2) or the Dantzig selector as in (1.3), with \u03bbn = \u0398(\u03c3p2 log p/n).\n1. We then threshold the estimator \u03b2init with t0, with the general goal such that, we get a\nset I1 with cardinality at most 2s; in general, we also have |I1 \u222a S| \u2264 2s, where I1 =\n{j \u2208 {1, . . . , p} : \u03b2j,init \u2265 t0} for some t0 to be speci\ufb01ed. Set I = I1.\n(OLS) estimator to obtain b\u03b2, where we set b\u03b2I = (X T\nI Y and b\u03b2I c = 0.\n3. We then possibly threshold b\u03b2I1 with t1 = 4\u03bbnp|I1| (to be speci\ufb01ed), to obtain I2, repeat\nstep 2 with I = I2 to obtain b\u03b2I and set all other coordinates to zero; return b\u03b2.\n\nOur algorithm is constructive in that it does not rely on the unknown parameters s, \u03b2min :=\nminj\u2208S |\u03b2j| or those that characterize the incoherence conditions on X; instead, our choice of \u03bbn\nand thresholding parameters only depends on \u03c3, n, and p. In our experiments, we apply only the\n\ufb01rst two steps, which we refer to as a two-step procedure; In particular, the Gauss-Dantzig selector\nis a two-step procedure with the Dantzig selector as \u03b2init [5]. In theory, we apply the third step only\nwhen \u03b2min is suf\ufb01ciently large and when we wish to get a \u201csparser\u201d model I.\n\n2. We then feed (Y, XI ) to either the Lasso estimator as in (1.2) or the ordinary least squares\n\nI XI )\u22121X T\n\nMore de\ufb01nitions. For a matrix A, let \u039bmin(A) and \u039bmax(A) denote the smallest and the largest\neigenvalues respectively. We refer to a vector \u03c5 \u2208 Rp with at most s non-zero entries, where s \u2264 p,\nas a s-sparse vector. Throughout this paper, we assume that n \u2265 2s and\n\n\u039bmin(2s)\n\n\u25b3\n=\n\n\u03c56=0;2s\u2212sparse kX\u03c5k2\n\nmin\n\n2 /(nk\u03c5k2\n\n2) > 0.\n\n(1.4)\n\nIt is clear that n \u2265 2s is necessary, as any submatrix with more than n columns must be singular. In\n2 /(nk\u03c5k2\ngeneral, we also assume \u039bmax(s)\n2) < \u221e. As de\ufb01ned in [4],\nthe s-restricted isometry constant \u03b4s of X is the smallest quantity such that\n\n= max\u03c56=0;s\u2212sparse kX\u03c5k2\n\n\u25b3\n\n(1 \u2212 \u03b4s)k\u03c5k2\n\n2 \u2264 kXT \u03c5k2\n\n2 /n \u2264 (1 + \u03b4s)k\u03c5k2\n2 ,\n\nfor all T \u2286 {1, . . . , p} with |T| \u2264 s and coef\ufb01cients sequences (\u03c5j)j\u2208T .\nIt is clear that \u03b4s is\nnon-decreasing in s and 1 \u2212 \u03b4s \u2264 \u039bmin(s) \u2264 \u039bmax(s) \u2264 1 + \u03b4s. Hence \u03b42s < 1 implies (1.4).\nOccasionally, we use \u03b2T \u2208 R|T |, where T \u2286 {1, . . . , p}, to also represent its 0-extended version\n\u03b2\u2032 \u2208 Rp such that \u03b2\u2032\nOracle inequalities. The following idea has been explained in [5]; we hence describe it here only\nbrie\ufb02y. Note that due to different normalization of columns of X, our expressions are slightly\n\nT = \u03b2T ; for example in (1.5) below.\n\nT c = 0 and \u03b2\u2032\n\n2\n\n\fdifferent from those in [5]. Consider the least square estimator b\u03b2I = (X T\n|I| \u2264 s and consider the ideal least-squares estimator \u03b2\u22c4\nwhich minimizes the expected mean squared error. It follows from [5] that for \u039bmax(s) < \u221e,\n\nI XI )\u22121X T\n\nI\u2286{1,...,p}, |I|\u2264s\n\narg min\n\n\u03b2\u22c4 =\n\n2\n\n2\n\n,\n\nI Y , where\n\nEk\u03b2 \u2212 \u03b2\u22c4k2\n\n2 \u2265 min (1, 1/\u039bmax(s))\n\nmin(\u03b22\n\ni , \u03c32/n).\n\nNow we check if for \u039bmax(s) < \u221e, it holds with high probability that\n\n= O(log p)\n\nmin(\u03b22\n\ni , \u03c32/n), so that\n\npXi=1\n\nE(cid:13)(cid:13)(cid:13)\u03b2 \u2212b\u03b2I(cid:13)(cid:13)(cid:13)\npXi=1\n\n(1.5)\n\n(1.6)\n\n(1.7)\n\n(1.8)\n\n2\n\n2\n\n2\n\n2\n\n(cid:13)(cid:13)(cid:13)b\u03b2 \u2212 \u03b2(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)b\u03b2 \u2212 \u03b2(cid:13)(cid:13)(cid:13)\n\nThese bounds are meaningful since\n\n= O(log p) max(1, \u039bmax(s))Ek\u03b2\u22c4 \u2212 \u03b2k2\npXi=1\n\nI\u2286{1,...,p}k\u03b2 \u2212 \u03b2Ik2\n\ni , \u03c32/n) = min\n\nmin(\u03b22\n\n2 + |I|\u03c32\n\nn\n\n2 in view of (1.6).\n\nX T \u01eb\n\nrepresents the ideal squared bias and variance. We elaborate on conditions on the design, under\nwhich we accomplish these goals using the multi-step procedures in the rest of this section. We now\nde\ufb01ne a constant \u03bb\u03c3,a,p for each a > 0, by which we bound the maximum correlation between the\nnoise and covariates of X, which we only apply to X with column \u21132 norm bounded by \u221an; Let\n\nTa :=(cid:26)\u01eb :(cid:13)(cid:13)(cid:13)(cid:13)\n\nn (cid:13)(cid:13)(cid:13)(cid:13)\u221e \u2264 \u03bb\u03c3,a,p(cid:27), where \u03bb\u03c3,a,p = \u03c3\u221a1 + ar 2 log p\n\nn\n\nP (Ta) \u2265 1 \u2212 (p\u03c0 log ppa)\u22121, for a \u2265 0; see [5].\n\n(1.10)\nVariable selection. Our \ufb01rst result in Theorem 1.1 shows that consistent variable selection is pos-\nsible under the Restricted Eigenvalue conditions, as formalized in [2]. Similar conditions have been\nused by [10] and [17].\nAssumption 1.1 (Restricted Eigenvalue assumption RE(s, k0, X) [2]) For some integer 1 \u2264\ns \u2264 p and a positive number k0, the following holds:\n\n, hence\n\n(1.9)\n\n1\n\n\u25b3\n= min\n\nK(s, k0, X)\n\nJ0\u2286{1,...,p},\n\n|J0|\u2264s\n\n\u201a\u201a\u201a\u03c5J c\n\n0\n\nmin\n\u03c56=0,\n\u201a\u201a\u201a1\n\u2264k0k\u03c5J0k1\n\nkX\u03c5k2\n\u221ank\u03c5J0k2\n\n> 0.\n\n(1.11)\n\nIf RE(s, k0, X) is satis\ufb01ed with k0 \u2265 1, then the square submatrices of size \u2264 2s of X T X are nec-\nessarily positive de\ufb01nite (see [2]) and hence (1.4) must hold. We do not impose any extra constraint\non s besides what is allowed in order for (1.11) to hold. Note that when s > n/2, it is impossible\nfor the restricted eigenvalue assumption to hold as XI for any I such that |I| = 2s becomes singular\nin this case. Hence our algorithm is especially relevant if one would like to estimate a parameter \u03b2\nsuch that s is very close to n; See Section 4 for such examples. Let \u03b2min := minj\u2208S |\u03b2j|.\nTheorem 1.1 (Variable selection under Assumption 1.1) Suppose that RE(s, k0, X) condition\nholds, where k0 = 1 for the DS and = 3 for the Lasso. Suppose \u03bbn \u2265 B\u03bb\u03c3,a,p for \u03bb\u03c3,a,p as in (1.9),\nwhere B \u2265 1 for the DS and \u2265 2 for the Lasso. Let B2 =\n\nB\u039bmin(2s) . Let s \u2265 K 4(s, k0, X) and\n\n1\n\n\u03b2min \u2265 4\u221a2 max(K(s, k0, X), 1)\u03bbn\u221as + max(cid:16)4K 2(s, k0, X),\u221a2B2(cid:17) \u03bbn\u221as.\n\nThen with probability at least P (Ta), the multi-step procedure returns b\u03b2 such that\n\nS \u2286 I := supp(b\u03b2), where |I \\ S| <\n\u03bb2\n\u03c3,a,p|I|\nmin(|I|) \u2264\n\u039b2\nwhich satis\ufb01es (1.7) and (1.8) given that \u03b2min \u2265 \u03c3/\u221an andPp\n\nkb\u03b2 \u2212 \u03b2k2\n\nB2\n2\n16\n2 log p(1 + a)s\u03c32(1 + B2\n\ni=1 min(\u03b22\n\nmin(2s)\n\n2 \u2264\n\n2 /16)\n\nn\u039b2\n\nand\n\n,\n\ni , \u03c32/n) = s\u03c32/n.\n\n3\n\n\fOur analysis builds upon the rate of convergence bounds for \u03b2init derived in [2]. The \ufb01rst implica-\ntion of this work and also one of the motivations for analyzing the thresholding methods is: under\nAssumption 1.1, one can obtain consistent variable selection for very signi\ufb01cant values of s, if only\n\na few extra variables are allowed to be included in the estimator b\u03b2. In our simulations, we recover\n\nthe exact support set S with very high probability using a two-step procedure. Note that we did not\noptimize the lower bound on s as we focus on cases when the support of S is large.\n\nThresholding that achieves the oracle inequalities. The natural question upon obtaining Theo-\nrem 1.1 is: is there a good thresholding rule that enables us to obtain a suf\ufb01ciently sparse estimator\n\nb\u03b2 when some components of \u03b2S (and hence \u03b2min) are well below \u03c3/\u221an, which also satis\ufb01es the\n\noracle inequality as in (1.7)? Before we answer this question, we de\ufb01ne s0 as the smallest integer\nsuch that\n\nmin(\u03b22\n\npXi=1\n\ni , \u03bb2\u03c32) \u2264 s0\u03bb2\u03c32, where \u03bb =p2 log p/n,\n| h XT c, XT \u2032 c\u2032 i /n| \u2264 \u03b8s,s\u2032 kck2 kc\u2032k2\n\n(1.12)\n\nand the (s, s\u2032)-restricted orthogonality constant [4] \u03b8s,s\u2032 as the smallest quantity such that\n\n(1.13)\nholds for all disjoint sets T, T \u2032 \u2286 {1, . . . , p} of cardinality |T| \u2264 s and |T \u2032| < s\u2032, where s + s\u2032 \u2264\np. Note that \u03b8 is non-decreasing in s, s\u2032 and small values of \u03b8s,s\u2032 indicates that disjoint subsets\ncovariates in XT and XT \u2032 span nearly orthogonal subspaces.\n\nTheorem 1.2 says that under a uniform uncertainty principle (UUP), thresholding of an initial\n\nDantzig selector \u03b2init, at the level of \u0398(\u03c3p2 log p/n) indeed identi\ufb01es a sparse model I of car-\n\n2-loss for its corresponding least-squares estimator is indeed\ndinality at most 2s0 such that the \u21132\nbounded within O(log p) of the ideal mean square error as in (1.5), when \u03b2 is as sparse as required\nby the Dantzig selector to achieve such an oracle inequality [5]. This is accomplished without any\nknowledge of the signi\ufb01cant coordinates of \u03b2 and not being able to observe parameter values.\nAssumption 1.2 (A Uniform Uncertainly Principle) [5] For some integer 1 \u2264 s < n/3, assume\n\u03b42s + \u03b8s,2s < 1, which implies that \u03bbmin(2s) > \u03b8s,2s given that 1 \u2212 \u03b42s \u2264 \u039bmin(2s).\nTheorem 1.2 Choose \u03c4, a > 0 and set \u03bbn = \u03bbp,\u03c4 \u03c3, where \u03bbp,\u03c4 := (\u221a1 + a + \u03c4 \u22121)p2 log p/n,\nin (1.3). Suppose \u03b2 is s-sparse with \u03b42s + \u03b8s,2s < 1 \u2212 \u03c4 . Let threshold t0 be chosen from the\nrange (C1\u03bbp,\u03c4 \u03c3, C4\u03bbp,\u03c4 \u03c3] for some constants C1, C4 to be de\ufb01ned. Then with probability at least\n1\u2212(\u221a\u03c0 log ppa)\u22121, the Gauss-Dantzig selectorb\u03b2 selects a model I := supp(b\u03b2) such that |I| \u2264 2s0,\n\n(1.14)\n\n3 log p \u03c32/n +\n\n2 \u2264 2C 2\n\n|I \\ S| \u2264 s0 \u2264 s, and kb\u03b2 \u2212 \u03b2k2\n\nmin(\u03b22\n\ni , \u03c32/n)! ,\n\npXi=1\n\nwhere C3 depends on a, \u03c4 , \u03b42s, \u03b8s,2s and C4; see (3.3).\n\nOur analysis builds upon [5]. Note that allowing t0 to be chosen from a range (as wide as one\nwould like, with the cost of increasing the constant C3 in (1.14)), saves us from having to estimate\nC1, which indeed depends on \u03b42s and \u03b8s,2s. Assumption 1.2 implies that Assumption 1.1 holds for\n\nk0 = 1 with K(s, k0, X) = p\u039bmin(2s)/(\u039bmin(2s) \u2212 \u03b8s,2s) \u2264 p\u039bmin(2s)/(1 \u2212 \u03b42s \u2212 \u03b8s,2s)\n\n(see [2]); It is an open question if we can derive the same result under Assumption 1.1.\n\nPrevious work. Finally, we brie\ufb02y review related work in multi-step procedures and the role of\nsparsity for high-dimensional statistical inference. Before this work, hard thresholding idea has\nbeen shown in [5] (via Gauss-Dantzig selector) as a method to correct the bias of the initial Dantzig\nselector. The empirical success of the Gauss-Dantzig selector in terms of improving the statistical\naccuracy is strongly evident in their experimental results. Our theoretical analysis on the oracle\ninequalities, which hold for the Gauss-Dantzig selector under a uniform uncertainty principle, is\nexactly inspired by their theoretical analysis of the initial Dantzig selector under the same conditions.\nFor the Lasso, [12] has also shown in theoretical analysis that thresholding is effective in obtaining\n\n4\n\n\fa two-step estimatorb\u03b2 that is consistent in its support with \u03b2; however, the choice of threshold level\n\ndepends on the unknown value \u03b2min (which needs to be suf\ufb01ciently large) and s, and their theory\ndoes not directly yield (or imply) an algorithm for \ufb01nding such parameters. Further, as pointed out\nby [2], a weakening of their condition is still suf\ufb01cient for Assumption 1.1 to hold.\n\nThe sparse recovery problem under arbitrary noise is also well studied, see [3, 15, 14]. Although\nas argued in [3, 14], the best accuracy under arbitrary noise has essentially been achieved in both\nwork, their bounds are worse than that in [5] (hence the present paper) under the stochastic noise as\ndiscussed in the present paper; see more discussions in [5]. Moreover, greedy algorithms in [15, 14]\nrequire s to be part of their input, while the iterative algorithms in the present paper do not have such\nrequirement, and hence adapt to the unknown level of sparsity s well. A more general framework\non multi-step variable selection was studied by [20]. They control the probability of false positives\nat the price of false negatives, similar to what we aim for in the present paper. Unfortunately, their\nanalysis is constrained to the case when s is a constant. Finally, under a restricted eigenvalue con-\n\ndition slightly stronger than Assumption 1.1, [22] requires s = O(pn/ log p) in order to achieve\n\nvariable selection consistency using the adaptive Lasso [23] as the second step procedure.\n\nOrganization of the paper. We prove Theorem 1.1 essentially in Section 2. A thresholding frame-\nwork for the general setting is described in Section 3, which also sketches the proof of Theorem 1.2.\nSection 4 brie\ufb02y discusses the relationship between linear sparsity and random design matrices.\nSection 5 includes simulation results showing that our two-step procedure is consistent with our\ntheoretical analysis on variable selection.\n\n2 Thresholding procedure when \u03b2min is large\nWe use a penalization parameter \u03bbn = B\u03bb\u03c3,a,p and assume \u03b2min > C\u03bbn\u221as for some constants\nB, C throughout this section; we \ufb01rst specify the thresholding parameters in this case. We then show\nin Theorem 2.1 that our algorithm works under any conditions so long as the rate of convergence\nof the initial estimator obeys the bounds in (2.2). Theorem 1.1 is a corollary of Theorem 2.1 under\nAssumption 1.1, given the rate of convergence bounds for \u03b2init following derivations in [2].\n\nThe Iterative Procedure. We obtain an initial estimator \u03b2init using the Lasso or the Dantzig selector.\n\nbSi+1 =(cid:26)j \u2208 bSi : b\u03b2(i)\n\nLet bS0 = {j : \u03b2j,init > 4\u03bbn}, and b\u03b2(0) := \u03b2init; Iterate through the following steps twice, for i =\n0, 1: (a) Set ti = 4\u03bbnq|bSi|; (b) Threshold b\u03b2(i) with ti to obtain I := bSi+1, where\nj \u2265 4\u03bbnq|bSi|(cid:27) and compute b\u03b2(i+1)\n= b\u03b2(2)\nReturn the \ufb01nal set of variables in bS2 and output b\u03b2 such that b\u03b2 bS2\nTheorem 2.1 Let \u03bbn \u2265 B\u03bb\u03c3,a,p, where B \u2265 1 is a constant suitably chosen such that the initial\nestimator \u03b2init satis\ufb01es on Ta, for \u03c5init = \u03b2init \u2212 \u03b2 and some constants B0, B1,\nk\u03c5init,Sk2 \u2264 B0\u03bbn\u221as and k\u03c5init,Sck1 \u2264 B1\u03bbns;\n\nand b\u03b2j = 0,\u2200j \u2208 bSc\n\nI XI )\u22121X T\n\n= (X T\n\n(2.2)\n\n(2.1)\n\nI Y.\n\n2.\n\nwhere B2 = 1/(B\u039bmin(2s)). Then for s \u2265 B2\n\n1 /16, it holds on Ta that |bSi| \u2264 2s,\u2200i = 1, 2, and\nwhere b\u03b2(i) are the OLS estimators based on I = bSi; Finally, the Iterative Procedure includes the\ncorrect set of variables in bS2 such that S \u2286 bS2 \u2286 bS1 and\n\nSuppose \u03b2min \u2265 (cid:16)max(cid:16)pB1, 2(cid:17) 2\u221a2 + max(cid:16)B0,\u221a2B2(cid:17)(cid:17) \u03bbn\u221as,\nkb\u03b2(i) \u2212 \u03b2k2 \u2264 \u03bb\u03c3,a,pq|bSi|/\u039bmin(|bSi|) \u2264 \u03bbnB2\u221a2s,\u2200i = 1, 2,\n(cid:12)(cid:12)(cid:12)bS2 \\ S(cid:12)(cid:12)(cid:12) :=(cid:12)(cid:12)(cid:12)supp(b\u03b2) \\ S(cid:12)(cid:12)(cid:12) \u2264\n\n1\n16B2\u039b2\n\nB2\n2\n16\n\n(2.3)\n\n(2.4)\n\n(2.5)\n\n.\n\nmin(|bS1|) \u2264\n\nI\n\nbS2\n\n5\n\n\fRemark 2.2 Without the knowledge of \u03c3, one could use b\u03c3 \u2265 \u03c3 in \u03bbn; this will put a stronger\nbS1 such that |bS1| \u2264 2s and bS1 \u2287 S, we only need to threshold \u03b2init at t0 = B1\u03bbn (see Section 3 and\n\nrequirement on \u03b2min, but all conclusions of Theorem 2.1 hold. We also note that in order to obtain\nLemma 3.2 for an example); instead of having to estimate B1, we use t0 = \u0398(\u03bbn\u221as) to threshold.\n\n3 A thresholding framework for the general setting\n\nIn this section, we wish to derive a meaningful criteria for consistency in variable selection, when\n\u03b2min is well below the noise level. Suppose that we are given an initial estimator \u03b2init that achieves\nthe rate of convergence bound as in (1.14), which adapts nearly ideally to the uncertainty in the\nsupport set S and the \u201csigni\ufb01cant\u201d set. We show that although we cannot guarantee the presence\n\nof variables indexed by {j : |\u03b2j| < \u03c3p2 log p/n} to be included in the \ufb01nal set I (cf. (3.7)) due\n\nto their lack of strength, we wish to include the signi\ufb01cant variables from S in I such that the OLS\nestimator based on I achieves this almost ideal rate of convergence as \u03b2init does, even though some\nvariables from S are missing in I. Here we pay a price for the missing variables in order to obtain a\nsparse model I. Toward this goal, we analyze the following algorithm under Assumption 1.2.\nThe General Two-step Procedure: Assume \u03b42s + \u03b8s,2s < 1 \u2212 \u03c4 , where \u03c4 > 0;\n\n1. First we obtain an initial estimator \u03b2min using the Dantzig selector with \u03bbp,\u03c4 := (\u221a1 + a+\n\u03c4 \u22121)p2 log p/n, where \u03c4, a \u2265 0; we then threshold \u03b2init with t0, chosen from the range\n\n(C1\u03bbp,\u03c4 \u03c3, C4\u03bbp,\u03c4 \u03c3], to obtain a set I of cardinality at most 2s, (we prove a stronger result\nin Lemma 3.2), where\n\nI := {j \u2208 {1, . . . , p} : \u03b2j,init > t0} ,\n\nfor C1 as de\ufb01ned in (3.3);\n\n(3.1)\n\n2. In the second step, given a set I of cardinality at most 2s, we run the OLS regression to\n\nobtain obtained via (3.1),b\u03b2I = (X T\n\nI XI )\u22121X T\n\nI Y and set b\u03b2j = 0,\u2200j 6\u2208 I.\n\nTheorem 2 in [5] has shown that the Dantzig selector achieves nearly the ideal level of MSE.\nProposition 3.1 [5] Let Y = X\u03b2 + \u01eb, for \u01eb being i.i.d. N (0, \u03c32) and kXjk2\nand set \u03bbn = \u03bbp,\u03c4 \u03c3 := (\u221a1 + a + \u03c4 \u22121)\u03c3p2 log p/n in (1.3). Then if \u03b2 is s-sparse with \u03b42s +\n\u03b8s,2s < 1\u2212 \u03c4 , the Dantzig selector obeys with probability at least 1\u2212 (\u221a\u03c0 log ppa)\u22121,(cid:13)(cid:13)(cid:13)b\u03b2 \u2212 \u03b2(cid:13)(cid:13)(cid:13)\n2 (\u221a1 + a + \u03c4 \u22121)2 log p(cid:0)\u03c32/n +Pp\n\nFrom this point on we let \u03b4 := \u03b42s and \u03b8 := \u03b8s,2s; Analysis in [5] (Theorem 2) and the current paper\nyields the following constants, where C3 has not been optimized,\n\ni=1 min(cid:0)\u03b22\n\ni , \u03c32/n(cid:1)(cid:1) .\n\n2 = n. Choose \u03c4, a > 0\n\n2 \u2264\n\n2C 2\n\n2\n\nC2 = 2C \u2032\n\n0 +\n\nwhere C \u2032\n\n0 =\n\n1 + \u03b4\n1 \u2212 \u03b4 \u2212 \u03b8\n\nC0\n\n1 \u2212 \u03b4 \u2212 \u03b8\n\n+\n\n\u03b8(1 + \u03b4)\n\n(1 \u2212 \u03b4 \u2212 \u03b8)2 ,\n\n(3.2)\n\nwhere C0 = 2\u221a2(cid:16)1 + 1\u2212\u03b42\n\n1\u2212\u03b4\u2212\u03b8(cid:17) + (1 + 1/\u221a2) (1+\u03b4)2\n\n1\u2212\u03b4\u2212\u03b8 ; We now de\ufb01ne\n\nand C 2\n\n3 = 3(\u221a1 + a + \u03c4 \u22121)2((C \u2032\n\n0 + C4)2 + 1) +\n\nC1 = C \u2032\n\n0 +\n\n1 + \u03b4\n1 \u2212 \u03b4 \u2212 \u03b8\n\n4(1 + a)\n\u039b2\nmin(2s0)\n\n.\n\n(3.3)\n\nWe \ufb01rst set up the notation following that in [5]. We order the \u03b2j\u2019s in decreasing order of magnitude\n(3.4)\ni , \u03bb2\u03c32) \u2264 s0\u03bb2\u03c32, where \u03bb =\n\nRecall that s0 is the smallest integer such that Pp\np2 log p/n. Thus by de\ufb01nition of s0, as essentially shown in [5], that 0 \u2264 s0 \u2264 s and\nmin(cid:18)\u03b22\n\n\u03c32\n\nn(cid:19)! (3.5)\n\n|\u03b21| \u2265 |\u03b22|... \u2265 |\u03b2p|.\n\ni=1 min(\u03b22\n\ni ,\n\n+\n\npXi=1\n\nmin(\u03b22\n\ni , \u03bb2\u03c32) \u2264 2 log p \u03c32\nj , \u03bb2\u03c32) \u2265 (s0 + 1) min(\u03b22\n\nn\n\nmin(\u03b22\n\ns0\u03bb2\u03c32 \u2264 \u03bb2\u03c32 +\ns0+1Xj=1\n\nand s0\u03bb2\u03c32 \u2265\n\ns0+1, \u03bb2\u03c32) for s < p,\n\n(3.6)\n\npXi=1\n\n6\n\n\fwhich implies that min(\u03b22\n\ns0+1, \u03bb2\u03c32) < \u03bb2\u03c32 and hence by (3.4),\n\n|\u03b2j| < \u03bb\u03c3 for all j > s0.\n\n(3.7)\n\nWe now show in Lemma 3.2 that thresholding at the level of C\u03bb\u03c3 at step 1 selects a set I of at most\n2s0 variables, among which at most s0 are from Sc.\nLemma 3.2 Choose \u03c4 > 0 such that \u03b42s + \u03b8s,2s < 1 \u2212 \u03c4 . Let \u03b2init be the \u21131-minimizer subject to\nthe constraints, for \u03bb :=p2 log p/n and \u03bbp,\u03c4 := (\u221a1 + a + t\u22121)p2 log p/n,\nGiven some constant C4 \u2265 C1, for C1 as in (3.3), choose a thresholding parameter t0 so that\n\nX T (Y \u2212 X\u03b2init)(cid:13)(cid:13)(cid:13)(cid:13)\u221e \u2264 \u03bbp,\u03c4 \u03c3.\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\n(3.8)\n\n1\nn\n\nC4\u03bbp,\u03c4 \u03c3 \u2265 t0 > C1\u03bbp,\u03c4 \u03c3; Set I = {j : |\u03b2j,init| > t0}.\n\nThen with probability at least P (Ta), as detailed in Proposition 3.1, we have for C \u2032\n\n0 as in (3.2),\n\n|I| \u2264 2s0, and |I \u222a S| \u2264 s + s0, and\nk\u03b2Dk2 \u2264 q(C \u2032\n\n0 + C4)2 + 1\u03bbp,\u03c4 \u03c3\u221as0, where D := {1, . . . , p} \\ I.\n\n(3.9)\n\n(3.10)\n\nNext we show that even if we miss some columns of X in S, we can still hope to get the convergence\nrate as required in Theorem 1.2 so long as k\u03b2Dk2 is bounded and I is suf\ufb01ciently sparse, for example,\nas bounded in Lemma 3.2. We \ufb01rst show in Lemma 3.3 a general result on rate of convergence of\nthe OLS estimator based on a chosen model I, where a subset of relevant variables are missing.\nLemma 3.3 (OLS estimator with missing variables) Let D := {1, . . . , p} \\ I and SR = D \u2229 S\nsuch that I \u2229 SR = \u2205. Suppose |I \u222a SR| \u2264 2s. Then we have on Ta, for the least squares estimator\nbased on I,b\u03b2I = (X T\n(cid:13)(cid:13)(cid:13)b\u03b2I \u2212 \u03b2(cid:13)(cid:13)(cid:13)\n\n2 \u2264 (cid:16)(cid:16)\u03b8|I|,|SR| k\u03b2Dk2 + \u03bb\u03c3,a,pp|I|(cid:17) /\u039bmin(|I|)(cid:17)2\n\nNow Theorem 1.2 is an immediate corollary of Lemma 3.2 and 3.3 in view of (3.5), given that\n|SR| < s, and |I| \u2264 2s0 and |I \u222a SR| \u2264 |I \u222a S| \u2264 s + s0 \u2264 2s as in Lemma 3.2 (3.9). Hence it is\nclear by (3.10) that we cannot cut too many \u201csigni\ufb01cant\u201d variables; in particular, for those that are\nlarger \u03bb\u03c3\u221as0, we can cut at most a constant number of them.\n\nI XI )\u22121X T\n2\n\nI Y , it holds that\n\n+ k\u03b2Dk2\n2 .\n\n4 Linear sparsity and random matrices\n\nA special case of design matrices that satisfy the Restricted Eigenvalue assumptions are the random\ndesign matrices. This is shown in a large body of work, for example [3, 4, 5, 1, 13], which shows\nthat the uniform uncertainty principle (UUP) holds for \u201cgeneric\u201d or random design matrices for very\nsigni\ufb01cant values of s. For example, it is well known that for a random matrix with i.i.d. Gaussian\nvariables (that is, Gaussian Ensemble, subject to normalizations of columns), and the Bernoulli and\nSubgaussian Ensembles [1, 13], the UUP holds for s = O(n/ log(p/n)); hence the thresholding\nprocedure can recover a sparse model using nearly a constant number of measurements per non-\nzero component despite the stochastic noise, when n is a nonnegligible fraction of p. See [5] for\nother examples of random designs. In our simulations as shown in Section 5, exact recovery rate of\nthe sparsity pattern is very high for a few types of random matrices using a two-step procedure, once\nthe number of samples passes a certain threshold. For example, for an i.i.d. Gaussian Ensemble, the\nthreshold for exact recovery is n = \u0398(s log(p/n)), where \u0398 hides a very small constant, when \u03b2min\nis suf\ufb01ciently large; this shows a strong contrast with the ordinary Lasso, for which the probability of\nsuccess in terms of exact recovery of the sparsity pattern tends to zero when n < 2s log(p\u2212 s) [19].\nIn an ongoing work, the author is exploring thresholding algorithms for a broader class of random\ndesigns that satisfy the Restricted Eigenvalue assumptions.\n\n7\n\n\f(a) p = 256\n\n(b) p = 512\n\ns = 8 Two\u2212step\ns = 8 Lasso\ns = 64 Two\u2212step\ns = 64 Lasso\n\n20\n\n50\n\n100\n\n200\nn\n\n500\n\n1000\n\n(c) p = 1024\n\ns=18\ns=36\ns=64\ns=103\ns=128\ns=192\ns=256\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n.\n\nb\no\nr\nP\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n.\n\nb\no\nr\nP\n\n0\n1\n\n.\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n0\n\n.\n\n0\n1\n\n.\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n.\n0\n\n2\n.\n0\n\n0\n.\n0\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n.\n\nb\no\nr\nP\n\nn\n\n0\n1\n\n.\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n0\n\n.\n\n0\n0\n8\n\n0\n0\n7\n\n0\n0\n6\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\ns=9\ns=18\ns=32\ns=57\ns=64\ns=96\ns=128\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\nn\n\n(d) p = 1024 Sample size vs. Sparsity\n\nProb. of succ.\n\n90%\n80%\n\n200\n\n400\n\n600\n\n800\n\nn\n\n50\n\n100\n\n150\n\ns\n\n200\n\n250\n\nFigure 1: (a) Compare the probability of success under s = 8 and 64 for p = 256. The two-step\nprocedure requires much fewer samples than the ordinary Lasso. (b) (c) show the probability of\nsuccess of the two-step procedure under different levels of sparsity when n increases for p = 512\nand 1024 respectively; (d) The number of samples n increases almost linearly with s for p = 1024.\n\n5\n\nIllustrative experiments\n\nIn our implementation, we choose to use the Lasso as the initial estimator. We show in Figure 1\nthat the two-step procedure indeed recovers a sparse model using a small number of samples per\nnon-zero component in \u03b2 when X is a Gaussian Ensemble. Similar behavior was also observed\nfor the Bernoulli Ensemble in our simulations. We run under three cases of p = 256, 512, 1024;\nfor each p, we increase the sparsity s by roughly equal steps from s = 0.2p/log 0.2p to p/4. For\neach tuple (p, s, n), we \ufb01rst generate a random Gaussian Ensemble of size n \u00d7 p as X, where\nXij \u223c N (0, 1), which is then normalized to have column \u21132-norm \u221an. For a given (p, s, n) and\nX, we repeat the following experiment 100 times: 1) Generate a vector \u03b2 of length p: within \u03b2\nrandomly choose s non-zero positions; for each position, we assign a value of 0.9 or \u22120.9 randomly.\n2) Generate a vector \u01eb of length p according to N (0, Ip), where Ip is the identity matrix. 3) Compute\nY = X\u03b2 + \u01eb. Y and X are then fed to the two-step procedure to obtain b\u03b2. 4) We then compare\nb\u03b2 with \u03b2; if all components match in signs, we count this experiment as a success. At the end of\nalways choose the b\u03b2 that best matches \u03b2 in terms of support. Inside the two-step procedure, we\nalways \ufb01x \u03bbn \u2248 0.69p2 log p/n and threshold \u03b2init at t0 = ftq log p\nbs, where bs = |bS0| for\nbS0 = {j : \u03b2j,init \u2265 0.5\u03bbn}, and ft is a constant chosen from the range of [1/6, 1/3].\n\nthe 100 experiments, we compute the percentage of successful runs as the probability of success.\nWe compare with the ordinary Lasso, for which we search over the full path of LARS [8] and\n\nAcknowledgments. This research was supported by the Swiss National Science Foundation\n(SNF) Grant 20PA21-120050/1. The author thanks Larry Wasserman, Sara van de Geer and Pe-\nter B\u00a8uhlmann for helpful discussions, comments and their kind support throughout this work.\n\n\u221a\n\nn\n\n8\n\n\fReferences\n\n[1] R. G. Baraniuk, M. Davenport, R. A. DeVore, and M. B. Wakin. A simple proof of the restricted isometry\n\nproperty for random matrices. Constructive Approximation, 28(3):253\u2013263, 2008.\n\n[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The\n\nAnnals of Statistics, 37(4):1705\u20131732, 2009.\n\n[3] E. Cand`es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements.\n\nCommunications in Pure and Applied Mathematics, 59(8):1207\u20131223, August 2006.\n\n[4] E. Cand`es and T. Tao. Decoding by Linear Programming. IEEE Trans. Info. Theory, 51:4203\u20134215, 2005.\n[5] E. Cand`es and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of\n\nStatistics, 35(6):2313\u20132351, 2007.\n\n[6] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c and Statistical Computing, 20:33\u201361, 1998.\n\n[7] D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81:425\u2013455,\n\n1994.\n\n[8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407\u2013\n\n499, 2004.\n\n[9] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of\n\nover-parametrization. Bernoulli, 10:971\u2013988, 2004.\n\n[10] V. Koltchinskii. Dantzig selector and sparsity oracle inequalities. Bernoulli, 15(3):799\u2013828, 2009.\n[11] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34(3):1436\u20131462, 2006.\n\n[12] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.\n\nAnnals of Statistics, 37(1):246\u2013270, 2009.\n\n[13] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for bernoulli and\n\nsubgaussian ensembles. Constructive Approximation, 28(3):277\u2013289, 2008.\n\n[14] D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples.\n\nApplied and Computational Harmonic Analysis, 26(3):301\u2013321, 2008.\n\n[15] D. Needell and R. Vershynin. Signal recovery from incomplete and inaccurate measurements via regu-\nIEEE Journal of Selected Topics in Signal Processing, to appear,\n\nlarized orthogonal matching pursuit.\n2009.\n\n[16] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267\u2013288,\n\n1996.\n\n[17] S. A. van de Geer. The deterministic Lasso. The JSM Proceedings, American Statistical Association, 2007.\n[18] S. A. van de Geer. High-dimensional generalized linear models and the Lasso. The Annals of Statistics,\n\n36:614\u2013645, 2008.\n\n[19] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using \u21131-constrained\nto appear, also posted as Technical Report\n\nquadratic programming. IEEE Trans. Inform. Theory, 2008.\n709, 2006, Department of Statistics, UC Berkeley.\n\n[20] L. Wasserman and K. Roeder. High dimensional variable selection. The Annals of Statistics, 37(5A):2178\u2013\n\n2201, 2009.\n\n[21] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research,\n\n7:2541\u20132567, 2006.\n\n[22] S. Zhou, S. van de Geer, and P. B\u00a8uhlmann. Adaptive Lasso for high dimensional regression and gaussian\n\ngraphical modeling, 2009. arXiv:0903.2515.\n\n[23] H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association,\n\n101:1418\u20131429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 851, "authors": [{"given_name": "Shuheng", "family_name": "Zhou", "institution": null}]}