{"title": "Estimating LASSO Risk and Noise Level", "book": "Advances in Neural Information Processing Systems", "page_first": 944, "page_last": 952, "abstract": "We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefficient vector $\\theta_0\\in R^p$ from noisy linear observation $y=X\\theta_0+w\\in R^n$ and the popular estimation procedure of solving an $\\ell_1$-penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the $\\ell_2$ estimation risk $\\|\\hat{\\theta}-\\theta_0\\|_2$ and the variance of the noise. These can be used to select the regularization parameter optimally. Our approach combines Stein unbiased risk estimate (Stein'81) and recent results of (Bayati and Montanari'11-12) on the analysis of approximate message passing and risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices $X$ of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on the validity of a certain conjecture from statistical physics. Our approach is the first that provides an asymptotically consistent risk estimator. In addition, we demonstrate through simulation that our variance estimation outperforms several existing methods in the literature.", "full_text": "Estimating LASSO Risk and Noise Level\n\nMohsen Bayati\nStanford University\n\nbayati@stanford.edu\n\nMurat A. Erdogdu\nStanford University\n\nerdogdu@stanford.edu\n\nAndrea Montanari\nStanford University\n\nmontanar@stanford.edu\n\nAbstract\n\nWe study the fundamental problems of variance and risk estimation in high di-\nmensional statistical modeling. In particular, we consider the problem of learning\na coef\ufb01cient vector \u03b80 \u2208 Rp from noisy linear observations y = X\u03b80 + w \u2208 Rn\n(p > n) and the popular estimation procedure of solving the (cid:96)1-penalized least\nsquares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In\n\nthis context, we develop new estimators for the (cid:96)2 estimation risk (cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2 and\n\nthe variance of the noise when distributions of \u03b80 and w are unknown. These can\nbe used to select the regularization parameter optimally. Our approach combines\nStein\u2019s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b]\non the analysis of approximate message passing and the risk of LASSO.\nWe establish high-dimensional consistency of our estimators for sequences of ma-\ntrices X of increasing dimensions, with independent Gaussian entries. We es-\ntablish validity for a broader class of Gaussian designs, conditional on a certain\nconjecture from statistical physics.\nTo the best of our knowledge, this result is the \ufb01rst that provides an asymptotically\nconsistent risk estimator for the LASSO solely based on data. In addition, we\ndemonstrate through simulations that our variance estimation outperforms several\nexisting methods in the literature.\n\n1\n\nIntroduction\n\nIn Gaussian random design model for the linear regression, we seek to reconstruct an unknown\ncoef\ufb01cient vector \u03b80 \u2208 Rp from a vector of noisy linear measurements y \u2208 Rn:\n\ny = X\u03b80 + w,\n\n(1.1)\nwhere X \u2208 Rn\u00d7p is a measurement (or feature) matrix with iid rows generated through a multivari-\nate normal density. The noise vector, w, has iid entries with mean 0 and variance \u03c32. While this\nproblem is well understood in the low dimensional regime p (cid:28) n, a growing corpus of research\naddresses the more challenging high-dimensional scenario in which p > n. The Basis Pursuit De-\nnoising (BPDN) or LASSO [CD95, Tib96] is an extremely popular approach in this regime, that\n\ufb01nds an estimate for \u03b80 by minimizing the following cost function\n\nwith \u03bb > 0. In particular, \u03b80 is estimated by (cid:98)\u03b8(\u03bb; X, y) = argmin\u03b8 CX,y(\u03bb, \u03b8). This method is\n\nCX,y(\u03bb, \u03b8) \u2261 (2n)\u22121 (cid:107)y \u2212 X\u03b8(cid:107)2\n\nwell suited for the ubiquitous case in which \u03b80 is sparse, i.e. a small number of features effectively\npredict the outcome. Since this optimization problem is convex, it can be solved ef\ufb01ciently, and fast\nspecialized algorithms have been developed for this purpose [BT09].\nResearch has established a number of important properties of LASSO estimator under suitable con-\nditions on the design matrix X, and for suf\ufb01ciently sparse vectors \u03b80. Under irrepresentability\nconditions, the LASSO correctly recovers the support of \u03b80 [ZY06, MB06, Wai09]. Under weaker\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1 ,\n\n(1.2)\n\n1\n\n\f\u221a\n\n\u221a\n\nconditions such as restricted isometry or compatibility properties the correct recovery of support fails\n\nhowever, the (cid:96)2 estimation error (cid:107)(cid:98)\u03b8\u2212 \u03b80(cid:107)2 is of the same order as the one achieved by an oracle esti-\nprovided asymptotic formulas for MSE or other operating characteristics of(cid:98)\u03b8, for Gaussian design\n\nmator that knows the support [CRT06, CT07, BRT09, BdG11]. Finally, [DMM09, RFG09, BM12b]\n\n2 \u2264 C k\u03bb2, with k = (cid:107)\u03b80(cid:107)0 the number of nonzero entries in \u03b80, as long as \u03bb \u2265 c\u03c3\n\nquestion of estimating accurately the (cid:96)2 error (cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nmatrices X.\nWhile the aforementioned research provides solid justi\ufb01cation for using the LASSO estimator, it is\nof limited guidance to the practitioner. For instance, a crucial question is how to set the regularization\nparameter \u03bb. This question becomes even more urgent for high-dimensional methods with multiple\nregularization terms. The oracle bounds of [CRT06, CT07, BRT09, BdG11] suggest to take \u03bb =\nlog p with c a dimension-independent constant (say c = 1 or 2). However, in practice a factor\nc \u03c3\ntwo in \u03bb can make a substantial difference for statistical applications. Related to this issue is the\n2. The above oracle bounds have the form\nlog p.\nAs a consequence, minimizing the bound does not yield a recipe for setting \u03bb. Finally, estimating\nthe noise level is necessary for applying these formulae, and this is in itself a challenging question.\nThe results of [DMM09, BM12b] provide exact asymptotic formulae for the risk, and its dependence\non the regularization parameter \u03bb. This might appear promising for choosing the optimal value of\n\u03bb, but has one serious drawback. The formulae of [DMM09, BM12b] depend on the empirical\ndistribution1 of the entries of \u03b80, which is of course unknown, as well as on the noise level2. A step\ntowards the resolution of this problem was taken in [DMM11], which determined the least favorable\nnoise level and distribution of entries, and hence suggested a prescription for \u03bb, and a predicted risk\nin this case. While this settles the question (in an asymptotic sense) from a minimax point of view,\nit would be preferable to have a prescription that is adaptive to the distribution of the entries of \u03b80\nand to the noise level.\nOur starting point is the asymptotic results of [DMM09, DMM11, BM12a, BM12b]. These provide\n\na construction of an unbiased pseudo-data (cid:98)\u03b8u that is asymptotically Gaussian with mean \u03b80. The\nLASSO estimator(cid:98)\u03b8 is obtained by applying a denoiser function to(cid:98)\u03b8u. We then use Stein\u2019s Unbiased\n\nRisk Estimate (SURE) [Ste81] to derive an expression for the (cid:96)2 risk (mean squared error) of this\noperation. What results is an expression for the mean squared error of the LASSO that only depends\non the observed data y and X. Finally, by modifying this formula we obtain an estimator for the\nnoise level.\nWe prove that these estimators are asymptotically consistent for sequences of design matrices X\nwith converging aspect ratio and iid Gaussian entries. We expect that the consistency holds far\nbeyond this case. In particular, for the case of general Gaussian design matrices, consistency holds\nconditionally on a conjectured formula stated in [JM13] on the basis of the \u201creplica method\u201d from\nstatistical physics.\nFor the sake of concreteness, let us brie\ufb02y describe our method in the case of standard Gaussian\ndesign that is when the design matrix X has iid Gaussian entries. We construct the unbiased pseudo-\ndata vector by\n\n(1.3)\nOur estimator of the mean squared error is derived from applying SURE to unbiased pseudo-data.\n\n(cid:98)R(y, X, \u03bb, \u03c4 ) \u2261 \u03c4 2(cid:16)\n\n(cid:98)\u03b8u =(cid:98)\u03b8 + X T (y \u2212 X(cid:98)\u03b8)/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0] .\nIn particular, our estimator is (cid:98)R(y, X, \u03bb,(cid:98)\u03c4 ) where\n(cid:17)\n+ (cid:107)X T (y \u2212 X(cid:98)\u03b8)(cid:107)2\n2(cid:107)(cid:98)\u03b8(cid:107)0/p \u2212 1\nHere(cid:98)\u03b8(\u03bb; X, y) is the LASSO estimator and(cid:98)\u03c4 = (cid:107)y \u2212 X(cid:98)\u03b8(cid:107)2/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0].\nOur estimator of the noise level is(cid:98)\u03c32/n =(cid:98)\u03c4 2 \u2212 (cid:98)R(y, X, \u03bb,(cid:98)\u03c4 )/\u03b4\n\n2\n\n(cid:46)(cid:2)p(n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0)2(cid:3)\n\n(1.4)\n\nwhere \u03b4 = n/p. Although our rigorous results are asymptotic in the problem dimensions, we show\nthrough numerical simulations that they are accurate already on problems with a few thousands of\n\n1The probability distribution that puts a point mass 1/p at each of the p entries of the vector.\n2Note that our de\ufb01nition of noise level \u03c3 corresponds to \u03c3\n\n\u221a\nn in most of the compressed sensing literature.\n\n2\n\n\fFigure 1: Red color represents the estimated values by our estimators and green color represents the\ntrue values to be estimated. Left: MSE versus regularization parameter \u03bb. Here, \u03b4 = 0.5, \u03c32/n =\n0.2, X \u2208 Rn\u00d7p with iid N1(0, 1) entries where n = 4000. Right: \u02c6\u03c32/n versus \u03bb. Comparison of\ndifferent estimators of \u03c32 under the same model parameters. Scaled Lasso\u2019s prescribed choice of\n(\u03bb, \u02c6\u03c32/n) is marked with a bold x.\n\nvariables. To the best of our knowledge, this is the \ufb01rst method for estimating the LASSO mean\nsquare error solely based on data. We compare our approach with earlier work on the estimation of\nthe noise level. The authors of [NSvdG10] target this problem by using a (cid:96)1-penalized maximum\nlog-likelihood estimator (PMLE) and a related method called \u201cScaled Lasso\u201d [SZ12] (also studied by\n[BC13]) considers an iterative algorithm to jointly estimate the noise level and \u03b80. Moreover, authors\nof [FGH12] developed a re\ufb01tted cross-validation (RCV) procedure for the same task. Under some\nconditions, the aforementioned studies provide consistency results for their noise level estimators.\nWe compare our estimator with these methods through extensive numerical simulations.\nThe rest of the paper is organized as follows. In order to motivate our theoretical work, we start with\nnumerical simulations in Section 2. The necessary background on SURE and asymptotic distribu-\ntional characterization of the LASSO is presented in Section 3. Finally, our main theoretical results\ncan be found in Section 4.\n\n2 Simulation Results\n\nIn this section, we validate the accuracy of our estimators through numerical simulations. We also\nanalyze the behavior of our variance estimator as \u03bb varies, along with four other methods. Two of\nthese methods rely on the minimization problem,\n\n((cid:98)\u03b8,(cid:98)\u03c3) = argmin\u03b8,\u03c3\n\n(cid:26)(cid:107)y \u2212 X\u03b8(cid:107)2\n\n2\n\n2nh1(\u03c3)\n\n+ h2(\u03c3) + \u03bb\n\n(cid:107)\u03b8(cid:107)1\n23 h3(\u03c3)\n\n(cid:27)\n\n,\n\nwhere for PMLE h1(\u03c3) = \u03c32, h2(\u03c3) = log(\u03c3), h3(\u03c3) = \u03c3 and for the Scaled Lasso h1(\u03c3) = \u03c3,\nh2(\u03c3) = \u03c3/2, and h3(\u03c3) = 1. The third method is a na\u00a8\u0131ve procedure that estimates the variance in\ntwo steps: (i) use the LASSO to determine the relevant variables; (ii) apply ordinary least squares\non the selected variables to estimate the variance. The fourth method is Re\ufb01tted Cross-Validation\n(RCV) by [FGH12] which also has two-stages. RCV requires sure screening property that is the\nmodel selected in its \ufb01rst stage includes all the relevant variables. Note that this requirement may\nnot be satis\ufb01ed for many values of \u03bb.\nIn our implementation of RCV, we used the LASSO for\nvariable selection.\nIn our simulation studies, we used the LASSO solver l1 ls [SJKG07]. We simulated across 50\nreplications within each, we generated a new Gaussian design matrix X. We solved for LASSO\nover 20 equidistant \u03bb\u2019s in the interval [0.1, 2]. For each \u03bb, a new signal \u03b80 and noise independent\nfrom X were generated.\n\n3\n\nllllllllllllllllllll0.00.10.20.30.00.51.01.52.0lMSEResults in a Single RunlEstimated MSETrue MSE90% Confidence BandsEstimated MSETrue MSEAsymptoticsAsymptotic MSEMSE Estimationllllllllllllllllllllxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0.00.20.40.60.00.51.01.52.0ls^2nlAMP.LASSON.LASSOPMLERCV.LASSOSCALED.LASSOTRUENoise Level Estimation\fFigure 2: Red color represents the estimated values by our estimators and green color represents\nthe true values to be estimated. Left: MSE versus regularization parameter \u03bb. Here, \u03b4 = 0.5,\n\u03c32/n = 0.2, rows of X \u2208 Rn\u00d7p are iid from Np(0, \u03a3) where n = 5000 and \u03a3 has entries 1\non the main diagonal, 0.4 on above and below the main diagonal. Right: Comparison of different\nestimators of \u03c32/n. Parameter values are the same as in Figure 1. Scaled Lasso\u2019s prescribed choice\nof (\u03bb, \u02c6\u03c32/n) is marked with a bold x.\n\nThe results are demonstrated in Figures 1 and 2. Figure 1 is obtained using n = 4000, \u03b4 = 0.5 and\n\u03c32/n = 0.2. The coordinates of true signal independently get values 0, 1, \u22121 with probabilities 0.9,\niid\u223c N1(0, 1).\n0.05, 0.05 respectively. For each replication, we used a design matrix X where Xi,j\nFigure 2 is obtained with n = 5000 and same values of \u03b4 and \u03c32 as in Figure 1. The coordinates\nof true signal independently get values 0, 1, \u22121 with probabilities 0.9, 0.05, 0.05 respectively. For\neach replication, we used a design matrix X where each row is independently generated through\nNp(0, \u03a3) where \u03a3 has 1 on the main diagonal and 0.4 above and below the diagonal.\nAs can be seen from the \ufb01gures, the asymptotic theory applies quite well to the \ufb01nite dimensional\ndata. We refer reader to [BEM13] for a more detailed simulation analysis.\n\n3 Background and Notations\n\n3.1 Preliminaries and De\ufb01nitions\n\n\u0001t = y \u2212 X\u03b8t +\n\n\u0001t\u22121(cid:10)\u03b7(cid:48)\n\nt(yt\u22121)(cid:11) ,\n\n1\n\u03b4\n\nFirst, we need to provide a brief introduction to approximate message passing (AMP) algorithm\nsuggested by [DMM09] and its connection to LASSO (see [DMM09, BM12b] for more details).\nFor an appropriate sequence of non-linear denoisers {\u03b7t}t\u22650, the AMP algorithm constructs a se-\nquence of estimates {\u03b8t}t\u22650, pseudo-data {yt}t\u22650, and residuals {\u0001t}t\u22650 where \u03b8t, yt \u2208 Rp and\n\u0001t \u2208 Rn. These sequences are generated according to the iteration\n\n\u03b8t+1 = \u03b7t(yt) ,\n\nyt = \u03b8t + X T \u0001t/n ,\n\n(3.1)\nwhere \u03b4 \u2261 n/p and the algorithm is initialized with \u03b80 = \u00010 = 0 \u2208 Rp. In addition, each denoiser\n\u03b7t(\u00b7) is a separable function and its derivative is denoted by \u03b7(cid:48)\nt(\u00b7 ). Given a scalar function f and a\nvector u \u2208 Rm, we let f (u) denote the vector (f (u1), . . . , f (um)) \u2208 Rm obtained by applying f\n\ncomponent-wise and (cid:104)u(cid:105) \u2261 m\u22121(cid:80)m\n\ni=1 ui is the average of the vector u \u2208 Rm.\n\nNext, consider the state evolution for the AMP algorithm. For the random variable \u03980 \u223c p\u03b80,\na positive constant \u03c32 and a given sequence of non-linear denoisers {\u03b7t}t\u22650, de\ufb01ne the sequence\n{\u03c4 2\nt }t\u22650 iteratively by\n\n\u03c4 2\nt+1 = Ft(\u03c4 2\n\n(3.2)\n0}/\u03b4 and Z \u223c N1(0, 1) is independent of \u03980. From Eq. 3.2, it is apparent\nwhere \u03c4 2\nthat the function Ft depends on the distribution of \u03980. It is shown in [BM12a] that the pseudo-data\n\n0 = \u03c32 + E{\u03982\n\nt ) ,\n\nE{ [\u03b7t(\u03980 + \u03c4 Z) \u2212 \u03980]2} ,\n\nFt(\u03c4 2) \u2261 \u03c32 +\n\n1\n\u03b4\n\n4\n\nllllllllllllllllllll0.00.10.20.30.00.51.01.52.0lMSEResults in a Single RunlEstimated MSETrue MSE90% Confidence BandsEstimated MSETrue MSEAsymptoticsAsymptotic MSEMSE Estimationllllllllllllllllllllxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0.00.20.40.60.00.51.01.52.0ls^2nlAMP.LASSON.LASSOPMLERCV.LASSOSCALED.LASSOTRUENoise Level Estimation\fyt has the same asymptotic distribution as \u03980 + \u03c4tZ. This result can be roughly interpreted as the\npseudo-data generated by AMP is the summation of the true signal and a normally distributed noise\nwhich has zero mean. Its variance is determined by the state evolution. In other words, each iteration\ni \u2248 \u03b80,i +N1(0, \u03c4 2\nproduces a pseudo-data that is distributed normally around the true signal, i.e. yt\nt ).\nThe importance of this result will appear later when we use Stein\u2019s method in order to obtain an\nestimator for the MSE and the variance of the noise.\nWe will use state evolution in order to describe the behavior of a speci\ufb01c type of converging sequence\nde\ufb01ned as the following:\nDe\ufb01nition 1. The sequence of instances {\u03b80(n), X(n), \u03c32(n)}n\u2208N indexed by n is said to be a\nconverging sequence if \u03b80(n) \u2208 Rp, X(n) \u2208 Rn\u00d7p, \u03c32(n) \u2208 R and p = p(n) is such that n/p \u2192\n\u03b4 \u2208 (0,\u221e), \u03c32(n)/n \u2192 \u03c32\n(a) The empirical distribution of {\u03b80,i(n)}p\n\np\u03b80 on R with bounded 2nd moment. Further, as n \u2192 \u221e, p\u22121(cid:80)p\n\n0 for some \u03c30 \u2208 R and in addition the following conditions hold:\n\ni=1, converges in distribution to a probability measure\n\ni=1 \u03b80,i(n)2 \u2192 Ep\u03b80\n\n{\u03982\n0}.\n\nthen n\u22121/2 maxi\u2208[p] (cid:107)X(n)ei(cid:107)2 \u2192 1,\n\n(b) If {ei}1\u2264i\u2264p \u2282 Rp denotes the standard basis,\nn\u22121/2 mini\u2208[p] (cid:107)X(n)ei(cid:107)2 \u2192 1, as n \u2192 \u221e with [p] \u2261 {1, . . . , p}.\nWe provide rigorous results for the special class of converging sequences when entries of X are iid\nN1(0, 1) (i.e., standard gaussian design model). We also provide results (assuming Conjecture 4.4 is\ncorrect) when rows of X are iid multivariate normal Np(0, \u03a3) (i.e., general gaussian design model).\nIn order to discuss the LASSO connection for the AMP algorithm, we need to use a speci\ufb01c class of\ndenoisers and apply an appropriate calibration to the state evolution. Here, we provide brie\ufb02y how\nthis can be done and we refer the reader to [BEM13] for a detailed discussion.\nDenote by \u03b7 : R \u00d7 R+ \u2192 R the soft thresholding denoiser where\n\n(cid:40) x \u2212 \u03be\n\n\u03b7(x; \u03be) =\n\n0\nx + \u03be\n\nif x > \u03be\nif \u2212\u03be \u2264 x \u2264 \u03be .\nif x < \u2212\u03be\n\nAlso, denote by \u03b7(cid:48)(\u00b7 ; \u00b7 ), the derivative of the soft-thresholding function with respect to its \ufb01rst\nargument. We will use the AMP algorithm with the soft-thresholding denoiser \u03b7t(\u00b7 ) = \u03b7(\u00b7 ; \u03bet )\nalong with a suitable sequence of thresholds {\u03bet}t\u22650 in order to obtain a connection to the LASSO.\nLet \u03b1 > 0 be a constant and at every iteration t, choose the threshold \u03bet = \u03b1\u03c4t. It was shown in\n[DMM09] and [BM12b] that the state evolution has a unique \ufb01xed point \u03c4\u2217 = limt\u2192\u221e \u03c4t, and there\nexists a mapping \u03b1 (cid:55)\u2192 \u03c4\u2217(\u03b1), between those two parameters. Further, it was shown that a function\n\u03b1 (cid:55)\u2192 \u03bb(\u03b1) with domain (\u03b1min(\u03b4),\u221e) for some constant \u03b1min, and given by\n\n\u03bb(\u03b1) \u2261 \u03b1\u03c4\u2217(cid:0)1 \u2212 1\n\nE(cid:2)\u03b7(cid:48)(\u03980 + \u03c4\u2217Z; \u03b1\u03c4\u2217)(cid:3)(cid:1) ,\n\n\u03b4\n\nadmits a well-de\ufb01ned continuous and non-decreasing inverse \u03b1 : (0,\u221e) \u2192 (\u03b1min,\u221e). In particu-\nlar, the functions \u03bb (cid:55)\u2192 \u03b1(\u03bb) and \u03b1 (cid:55)\u2192 \u03c4\u2217(\u03b1) provide a calibration between the AMP algorithm and\nthe LASSO where \u03bb is the regularization parameter.\n\n3.2 Distributional Results for the LASSO\n\nWe will proceed by stating a distributional result on LASSO which was established in [BM12b].\nTheorem 3.1. Let {\u03b80(n), X(n), \u03c32(n)}n\u2208N be a converging sequence of instances of the standard\n\nGaussian design model. Denote the LASSO estimator of \u03b80(n) by(cid:98)\u03b8(n, \u03bb) and the unbiased pseudo-\ndata generated by LASSO by(cid:98)\u03b8u(n, \u03bb) \u2261(cid:98)\u03b8 + X T (y \u2212 X(cid:98)\u03b8)/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0].\nThen, as n \u2192 \u221e, the empirical distribution of {(cid:98)\u03b8u\ni=1 converges weakly to the joint distri-\nbution of (\u03980 + \u03c4\u2217Z, \u03980) where \u03980 \u223c p\u03b80, \u03c4\u2217 = \u03c4\u2217(\u03b1(\u03bb)), Z \u223c N1(0, 1) and \u03980 and Z are\nindependent random variables.\n(cid:1)\ncal distribution of {(cid:98)\u03b8i, \u03b80,i}p\n\ni=1 converges weakly to the joint distribution of(cid:0)\u03b7(\u03980 + \u03c4\u2217Z; \u03be\u2217), \u03980\n\nThe above theorem combined with the stationarity condition of the LASSO implies that the empiri-\n\ni , \u03b80,i}p\n\n5\n\n\fwhere \u03be\u2217 = \u03b1(\u03bb)\u03c4\u2217(\u03b1(\u03bb)). It is also important to emphasize a relation between the asymptotic\nMSE, \u03c4 2\u2217 and the model variance. By Theorem 3.1 and the state evolution recursion, almost surely,\n(3.3)\n\n[\u03b7(\u03980 + \u03c4\u2217Z; \u03be\u2217) \u2212 \u03980]2(cid:105)\n\n2/p = E(cid:104)\n\np\u2192\u221e(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n= \u03b4(\u03c4 2\u2217 \u2212 \u03c32\n0) ,\n\nlim\n\nwhich will be helpful to get an estimator for the noise level.\n\n3.3 Stein\u2019s Unbiased Risk Estimator\n\nIn [Ste81], Stein proposed a method to estimate the risk of an almost arbitrary estimator of the mean\nof a multivariate normal vector. A generalized form of his method can be stated as the following.\nProposition 3.2. [Ste81]&[Joh12] Let x, \u00b5 \u2208 Rn and V \u2208 Rn\u00d7n be such that x \u223c Nn(\u00b5, V ).\nSuppose that \u02c6\u00b5(x) \u2208 Rn is an estimator of \u00b5 for which \u02c6\u00b5(x) = x + g(x) and that g : Rn \u2192 Rn is\nweakly differentiable and that \u2200i, j \u2208 [n], E\u03bd[|xigi(x)| + |xjgj(x)|] < \u221e where \u03bd is the measure\ncorresponding to the multivariate Gaussian distribution Nn(\u00b5, V ). De\ufb01ne the functional\n\nS(x, \u02c6\u00b5) \u2261 Tr(V ) + 2Tr(V Dg(x)) + (cid:107)g(x)(cid:107)2\n2 ,\n\n2 = E\u03bd[S(x, \u02c6\u00b5)].\n\nS(x, \u02c6\u00b5) is an unbiased estimator of\n\nwhere Dg is the vector derivative.\nE\u03bd(cid:107)\u02c6\u00b5(x) \u2212 \u00b5(cid:107)2\nIn the literature of statistics, the above estimator is called \u201cStein\u2019s Unbiased Risk Estimator\u201d or\nSURE. The following remark will be helpful to build intuition about our approach.\nRemark 1. If we consider the risk of soft thresholding estimator \u03b7(xi; \u03be) for \u00b5i when xi \u223c\nN1(\u00b5i, \u03c32) for i \u2208 [m], the above formula suggests the functional\n\nthe risk,\n\ni.e.\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\nS(x, \u03b7( \u00b7 ; \u03be))\n\n= \u03c32 \u2212 2\u03c32\nm\nas an estimator of the corresponding MSE.\n\nm\n\n1{|xi|\u2264\u03be} +\n\n1\nm\n\n[min{|xi|, \u03be}]2 ,\n\n4 Main Results\n\n4.1 Standard Gaussian Design Model\n\nWe start by de\ufb01ning two estimators that are motivated by Proposition 3.2.\nDe\ufb01nition 2. De\ufb01ne\n\nwhere x \u2208 Rm, \u03c4 \u2208 R+, and \u03c8 : R \u2192 R is a suitable non-linear function. Also for y \u2208 Rn and\n\n(cid:98)R\u03c8(x, \u03c4 ) \u2261 \u2212\u03c4 2 + 2\u03c4 2(cid:104)\u03c8(cid:48)(x)(cid:105) + (cid:104) (\u03c8(x) \u2212 x)2(cid:105) ,\n(cid:107)X T (y \u2212 X(cid:98)\u03b8)(cid:107)2\n(cid:98)R(y, X, \u03bb, \u03c4 ) \u2261 \u03c4 2\np(n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0)2\n\nX \u2208 Rn\u00d7p denote by (cid:98)R(y, X, \u03bb, \u03c4 ), the estimator of the mean squared error of LASSO where\nRemark 2. Note that (cid:98)R(y, X, \u03bb, \u03c4 ) is just a special case of (cid:98)R\u03c8(x, \u03c4 ) when x = (cid:98)\u03b8u and \u03c8(\u00b7 ) =\n\u03b7(\u00b7 ; \u03be ) for \u03be = \u03bb/(1 \u2212 (cid:107)(cid:98)\u03b8(cid:107)0/p).\n\n(2(cid:107)(cid:98)\u03b8(cid:107)0 \u2212 p) +\n\np\n\n2\n\n.\n\nWe are now ready to state the following theorem on the asymptotic MSE of the AMP:\nTheorem 4.1. Let {\u03b80(n), X(n), \u03c32(n)}n\u2208N be a converging sequence of instances of the standard\nGaussian design model. Denote the sequence of estimators of \u03b80(n) by {\u03b8t(n)}t\u22650, the pseudo-\ndata by {yt(n)}t\u22650, and residuals by {\u0001t(n)}t\u22650 produced by AMP algorithm using the sequence\nof Lipschitz continuous functions {\u03b7t}t\u22650 as in Eq. 3.1.\nThen, as n \u2192 \u221e, the mean squared error of the AMP algorithm at iteration t + 1 has the same limit\n\nas (cid:98)R\u03b7t(yt,(cid:98)\u03c4 ) where(cid:98)\u03c4t = (cid:107)\u0001t(cid:107)2/\nIn other words, (cid:98)R\u03b7t(yt,(cid:98)\u03c4t) is a consistent estimator of the asymptotic mean squared error of the\n\nn\u2192\u221e(cid:98)R\u03b7t(yt,(cid:98)\u03c4t) .\n\nn. More precisely, with probability one,\n\nn\u2192\u221e(cid:107)\u03b8t+1 \u2212 \u03b80(cid:107)2\n\n2/p(n) = lim\n\n(4.1)\n\n\u221a\n\nlim\n\nAMP algorithm at iteration t + 1.\n\n6\n\n\fThe above theorem allows us to accurately predict how far the AMP estimate is from the true signal\nat iteration t + 1 and this can be utilized as a stopping rule for the AMP algorithm. Note that it was\nshown in [BM12b] that the left hand side of Eq. (4.1) is E[(\u03b7t(\u03980 + \u03c4tZ) \u2212 \u03980)2]. Combining this\nwith the above theorem, we easily obtain,\n\nn\u2192\u221e(cid:98)R\u03b7t(yt,(cid:98)\u03c4t) = E[(\u03b7t(\u03980 + \u03c4tZ) \u2212 \u03980)2] .\n\nlim\n\none,\n\nWe state the following version of Theorem 4.1 for the LASSO.\nTheorem 4.2. Let {\u03b80(n), X(n), \u03c32(n)}n\u2208N be a converging sequence of instances of the standard\n\nGaussian design model. Denote the LASSO estimator of \u03b80(n) by(cid:98)\u03b8(n, \u03bb). Then with probability\nwhere(cid:98)\u03c4 = (cid:107)y \u2212 X(cid:98)\u03b8(cid:107)2/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0]. In other words, (cid:98)R(y, X, \u03bb,(cid:98)\u03c4 ) is a consistent estimator of the\n\nn\u2192\u221e(cid:98)R(y, X, \u03bb,(cid:98)\u03c4 ) ,\n\nn\u2192\u221e(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\nlim\n\n2/p(n) = lim\n\nasymptotic mean squared error of the LASSO.\n\nNote that Theorem 4.2 enables us to assess the quality of the LASSO estimation without knowing\nthe true signal itself or the noise (or their distribution). The following corollary can be shown using\nthe above theorem and Eq. 3.3.\nCorollary 4.3. In the standard Gaussian design model, the variance of the noise can be accurately\n\nestimated by(cid:98)\u03c32/n \u2261 (cid:98)\u03c4 2 \u2212 (cid:98)R(y, X, \u03bb,(cid:98)\u03c4 )/\u03b4 where \u03b4 = n/p and other variables are de\ufb01ned as in\n\nTheorem 4.2. In other words, we have\n\nn\u2192\u221e \u02c6\u03c32/n = \u03c32\nlim\n0 ,\n\n(4.2)\n\nalmost surely, providing us a consistent estimator for the variance of the noise in the LASSO.\nRemark 3. Theorems 4.1 and 4.2 provide a rigorous method for selecting the regularization pa-\nrameter optimally. Also, note that obtaining the expression in Theorem 4.2 only requires solving\none solution path to LASSO problem versus k solution paths required by k-fold cross-validation\nmethods. Additionally, using the exponential convergence of AMP algorithm for the standard gaus-\nsian design model, proved by [BM12b], one can use O(log(1/\u0001)) iterations of AMP algorithm and\nTheorem 4.1 to obtain the solution path with an additional error up to O(\u0001).\n\n4.2 General Gaussian Design Model\n\nIn Section 4.1, we devised our estimators based on the standard Gaussian design model. Motivated\nby Theorem 4.2, we state the following conjecture of [JM13].\nLet {\u2126(n)}n\u2208N be a sequence of inverse covariance matrices. De\ufb01ne the general Gaussian design\nmodel by the converging sequence of instances {\u03b80(n), X(n), \u03c32(n)}n\u2208N where for each n, rows\nof design matrix X(n) are iid multivariate Gaussian, i.e. Np(0, \u2126(n)\u22121).\nConjecture 4.4 ([JM13]). Let {\u03b80(n), X(n), \u03c32(n)}n\u2208N be a converging sequence of instances\nunder the general Gaussian design model with a sequence of proper inverse covariance matri-\nces {\u2126(n)}n\u2208N. Assume that the empirical distribution of {(\u03b80,i, \u2126ii}p\ni=1 converges weakly to\n\nthe distribution of a random vector (\u03980, \u03a5). Denote the LASSO estimator of \u03b80(n) by (cid:98)\u03b8(n, \u03bb)\nand the LASSO pseudo-data by (cid:98)\u03b8u(n, \u03bb) \u2261 (cid:98)\u03b8 + \u2126X T (y \u2212 X(cid:98)\u03b8)/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0]. Then, for some\n\u03c4 \u2208 R, the empirical distribution of {\u03b80,i,(cid:98)\u03b8u\ni , \u2126ii} converges weakly to the joint distribution of\nther, the empirical distribution of (y \u2212 X(cid:98)\u03b8)/[n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0] converges weakly to N(0, \u03c4 2).\n(\u03980, \u03980 + \u03c4 \u03a51/2Z, \u03a5), where Z \u223c N1(0, 1), and (\u03980, \u03a5) are independent random variables. Fur-\n\nA heuristic justi\ufb01cation of this conjecture using the replica method from statistical physics is offered\nin [JM13]. Using the above conjecture, we de\ufb01ne the following generalized estimator of the linearly\ntransformed risk under the general Gaussian design model. The construction of the estimator is\nessentially the same as before i.e. apply SURE to unbiased pseudo-data.\n\n7\n\n\f(cid:16)\n\n\u03c4 2\np\n\n2/p as\n\nDe\ufb01nition 3. For an inverse covariance matrix \u2126 and a suitable matrix V \u2208 Rp\u00d7p, let W = V \u2126V T\n\nand de\ufb01ne an estimator of (cid:107)V ((cid:98)\u03b8 \u2212 \u03b8)(cid:107)2\n(cid:107)V \u2126X T (y \u2212 X(cid:98)\u03b8)(cid:107)2\n(cid:98)\u0393\u2126(y, X, \u03c4, \u03bb, V ) =\np(n \u2212 (cid:107)(cid:98)\u03b8(cid:107)0)2\nFurther,(cid:98)\u03b8(n, \u03bb) is the LASSO solution for penalty level \u03bb and \u03c4 is a real number. S \u2282 [p] is the\nsupport of(cid:98)\u03b8 and \u02dcS is [p] \\ S. Finally, for a p \u00d7 p matrix M and subsets D, E of [p] the notation\n\nwhere y \u2208 Rn and X \u2208 Rn\u00d7p denote the linear observations and the design matrix, respectively.\n\nMDE refers to the |D| \u00d7 |E| sub-matrix of M obtained by intersection of rows with indices from D\nand columns with indices from E.\n\nTr (WSS) \u2212 Tr (W \u02dcS \u02dcS) \u2212 2Tr\n\nW \u02dcSS\u2126S \u02dcS\u2126\u22121\n\n(cid:17)(cid:17)\n\n+\n\n\u02dcS \u02dcS\n\n(cid:16)\n\n2\n\nDerivation of the above formula is rather complicated and we refer the reader to [BEM13] for a\ndetailed argument. A notable case, when V = I, corresponds to the mean squared error of LASSO\n\nfor the general Gaussian design and the estimator (cid:98)R(y, X, \u03bb, \u03c4 ) is just a special case of the estimator\n(cid:98)\u0393\u2126(y, X, \u03c4, \u03bb, V ). That is, when V = \u2126 = I, we have(cid:98)\u0393I (y, X, \u03c4, \u03bb, I) = (cid:98)R(y, X, \u03bb, \u03c4 ).\n\nNow, we state the following analog of Theorem 4.2.\nTheorem 4.5. Let {\u03b80(n), X(n), \u03c32(n)}n\u2208N be a converging sequence of instances of the general\nGaussian design model with the inverse covariance matrices {\u2126(n)}n\u2208N. Denote the LASSO esti-\n\nmator of \u03b80(n) by(cid:98)\u03b8(n, \u03bb). If Conjecture 4.4 holds, then, with probability one,\nn\u2192\u221e(cid:98)\u0393\u2126(y, X,(cid:98)\u03c4 , \u03bb, I)\nwhere(cid:98)\u03c4 = (cid:107)y \u2212 X(cid:98)\u03b8(cid:107)2/[n\u2212(cid:107)(cid:98)\u03b8(cid:107)0]. In other words,(cid:98)\u0393\u2126(y, X,(cid:98)\u03c4 , \u03bb, I) is a consistent estimator of the\n\nn\u2192\u221e(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n2/p(n) = lim\n\nlim\n\nasymptotic MSE of the LASSO.\n\nWe will assume that a similar state evolution holds for the general design. In fact, for the general\ncase, replica method suggests the relation\nn\u2192\u221e(cid:107)\u2126\u2212 1\n\n2/p(n) = \u03b4(\u03c4 2 \u2212 \u03c32\n0).\n\n2 ((cid:98)\u03b8 \u2212 \u03b8)(cid:107)2\n\nlim\n\nHence motivated by the Corollary 4.3, we state the following result on the general Gaussian design\nmodel.\nCorollary 4.6. Assume that Conjecture 4.4 holds. In the general Gaussian design model, the vari-\nance of the noise can be accurately estimated by\n\n\u02c6\u03c32(n, \u2126)/n \u2261(cid:98)\u03c4 2 \u2212(cid:98)\u0393\u2126(y, X,(cid:98)\u03c4 , \u03bb, \u2126\u2212 1\n\n2 )/\u03b4 ,\n\nwhere \u03b4 = n/p and other variables are de\ufb01ned as in Theorem 4.5. Also, we have\n\nn\u2192\u221e \u02c6\u03c32/n = \u03c32\nlim\n0 ,\n\nalmost surely, providing us a consistent estimator for the noise level in LASSO.\n\nCorollary 4.6, extends the results stated in Corollary 4.3 to the general Gaussian design matrices.\nThe derivation of formulas in Theorem 4.5 and Corollary 4.6 follows similar arguments as in the\nstandard Gaussian design model. In particular, they are obtained by applying SURE to the distri-\nbutional result of Conjecture 4.4 and using the stationary condition of the LASSO. Details of this\nderivation can be found in [BEM13].\n\n8\n\n\fReferences\n[BC13]\n\n[BdG11]\n\nA. Belloni and V. Chernozhukov, Least Squares after Model Selection in High-Dimensional Sparse\nModels, Bernoulli (2013).\nP. B\u00a8uhlmann and S. Van de Geer, Statistics for high-dimensional data, Springer-Verlag Berlin\nHeidelberg, 2011.\n\n[BEM13] M. Bayati, M. A. Erdogdu, and A. Montanari, Estimating LASSO Risk and Noise Level, long\n\nversion (in preparation), 2013.\n\n[BM12a] M. Bayati and A. Montanari, The dynamics of message passing on dense graphs, with applications\n\nto compressed sensing, IEEE Trans. on Inform. Theory 57 (2012), 764\u2013785.\n\n[BM12b]\n[BRT09]\n\n[BS05]\n\n[BT09]\n\n[BY93]\n\n[CD95]\n\n[CRT06]\n\n[CT07]\n\n, The LASSO risk for gaussian matrices, IEEE Trans. on Inform. Theory 58 (2012).\n\nP. Bickel, Y. Ritov, and A. Tsybakov, Simultaneous Analysis of Lasso and Dantzig Selector, The\nAnnals of Statistics 37 (2009), 1705\u20131732.\nZ. Bai and J. Silverstein, Spectral Analysis of Large Dimensional Random Matrices, Springer,\n2005.\nA. Beck and M. Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse\nProblems, SIAM J. Imaging Sciences 2 (2009), 183\u2013202.\nZ. D. Bai and Y. Q. Yin, Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covari-\nance Matrix, The Annals of Probability 21 (1993), 1275\u20131294.\nS.S. Chen and D.L. Donoho, Examples of basis pursuit, Proceedings of Wavelet Applications in\nSignal and Image Processing III (San Diego, CA), 1995.\nE. C`andes, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccurate\nmeasurements, Communications on Pure and Applied Mathematics 59 (2006), 1207\u20131223.\nE. C`andes and T. Tao, The Dantzig selector: statistical estimation when p is much larger than n,\nAnnals of Statistics 35 (2007), 2313\u20132351.\n\n[DMM09] D. L. Donoho, A. Maleki, and A. Montanari, Message Passing Algorithms for Compressed Sens-\n\ning, Proceedings of the National Academy of Sciences 106 (2009), 18914\u201318919.\n\n[DMM11]\n\n[FGH12]\n\n[JM13]\n\n[Joh12]\n[MB06]\n\n, The noise-sensitivity phase transition in compressed sensing, Information Theory, IEEE\n\nTransactions on 57 (2011), no. 10, 6920\u20136941.\nJ. Fan, S. Guo, and N. Hao, Variance estimation using re\ufb01tted cross-validation in ultrahigh di-\nmensional regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology)\n74 (2012), 1467\u20139868.\nA. Javanmard and A. Montanari, Hypothesis testing in high-dimensional regression under the\ngaussian random design model: Asymptotic theory, preprint available in arxiv:1301.4240, 2013.\nI. Johnstone, Gaussian estimation: Sequence and wavelet models, Book draft, 2012.\nN. Meinshausen and P. B\u00a8uhlmann, High-dimensional graphs and variable selection with the lasso,\nThe Annals of Statistics 34 (2006), no. 3, 1436\u20131462.\n\n[NSvdG10] P. B\u00a8uhlmann N. St\u00a8adler and S. van de Geer, (cid:96)1-penalization for Mixture Regression Models (with\n\ndiscussion), Test 19 (2010), 209\u2013285.\nS. Rangan, A. K. Fletcher, and V. K. Goyal, Asymptotic analysis of map estimation via the replica\nmethod and applications to compressed sensing, 2009.\n\n[RFG09]\n\n[Ste81]\n\n[SZ12]\n[Tib96]\n\n[SJKG07] M. Lustig S. Boyd S. J. Kim, K. Koh and D. Gorinevsky, An Interior-Point Method for Large-Scale\nl1-Regularized Least Squares, IEEE Journal on Selected Topics in Signal Processing 4 (2007),\n606\u2013617.\nC. Stein, Estimation of the mean of a multivariate normal distribution, The Annals of Statistics 9\n(1981), 1135\u20131151.\nT. Sun and C. H. Zhang, Scaled sparse linear regression, Biometrika (2012), 1\u201320.\nR. Tibshirani, Regression shrinkage and selection with the lasso, J. Royal. Statist. Soc B 58 (1996),\n267\u2013288.\nM. J. Wainwright, Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1\nconstrained quadratic programming, Information Theory, IEEE Transactions on 55 (2009), no. 5,\n2183\u20132202.\nP. Zhao and B. Yu, On model selection consistency of Lasso, The Journal of Machine Learning\nResearch 7 (2006), 2541\u20132563.\n\n[Wai09]\n\n[ZY06]\n\n9\n\n\f", "award": [], "sourceid": 515, "authors": [{"given_name": "Mohsen", "family_name": "Bayati", "institution": "Stanford University"}, {"given_name": "Murat", "family_name": "Erdogdu", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}]}