{"title": "A Residual Bootstrap for High-Dimensional Regression with Near Low-Rank Designs", "book": "Advances in Neural Information Processing Systems", "page_first": 3239, "page_last": 3247, "abstract": "We study the residual bootstrap (RB) method in the context of high-dimensional linear regression. Specifically, we analyze the distributional approximation of linear contrasts $c^{\\top}(\\hat{\\beta}_{\\rho}-\\beta)$, where $\\hat{\\beta}_{\\rho}$ is a ridge-regression estimator. When regression coefficients are estimated via least squares, classical results show that RB consistently approximates the laws of contrasts, provided that $p\\ll n$, where the design matrix is of size $n\\times p$. Up to now, relatively little work has considered how additional structure in the linear model may extend the validity of RB to the setting where $p/n\\asymp 1$. In this setting, we propose a version of RB that resamples residuals obtained from ridge regression. Our main structural assumption on the design matrix is that it is nearly low rank --- in the sense that its singular values decay according to a power-law profile. Under a few extra technical assumptions, we derive a simple criterion for ensuring that RB consistently approximates the law of a given contrast. We then specialize this result to study confidence intervals for mean response values $X_i^{\\top} \\beta$, where $X_i^{\\top}$ is the $i$th row of the design. More precisely, we show that conditionally on a Gaussian design with near low-rank structure, RB \\emph{simultaneously} approximates all of the laws $X_i^{\\top}(\\hat{\\beta}_{\\rho}-\\beta)$, $i=1,\\dots,n$. This result is also notable as it imposes no sparsity assumptions on $\\beta$. Furthermore, since our consistency results are formulated in terms of the Mallows (Kantorovich) metric, the existence of a limiting distribution is not required.", "full_text": "A Residual Bootstrap for High-Dimensional\nRegression with Near Low-Rank Designs\n\nMiles E. Lopes\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nmlopes@stat.berkeley.edu\n\nAbstract\n\nWe study the residual bootstrap (RB) method in the context of high-dimensional\nlinear regression. Speci\ufb01cally, we analyze the distributional approximation of lin-\n\near contrasts c(cid:62)((cid:98)\u03b2\u03c1 \u2212 \u03b2), where(cid:98)\u03b2\u03c1 is a ridge-regression estimator. When regres-\n\nsion coef\ufb01cients are estimated via least squares, classical results show that RB\nconsistently approximates the laws of contrasts, provided that p (cid:28) n, where the\ndesign matrix is of size n \u00d7 p. Up to now, relatively little work has considered\nhow additional structure in the linear model may extend the validity of RB to the\nsetting where p/n (cid:16) 1. In this setting, we propose a version of RB that resamples\nresiduals obtained from ridge regression. Our main structural assumption on the\ndesign matrix is that it is nearly low rank \u2014 in the sense that its singular values\ndecay according to a power-law pro\ufb01le. Under a few extra technical assump-\ntions, we derive a simple criterion for ensuring that RB consistently approximates\nthe law of a given contrast. We then specialize this result to study con\ufb01dence\nintervals for mean response values X(cid:62)\nis the ith row of the de-\nsign. More precisely, we show that conditionally on a Gaussian design with near\nlow-rank structure, RB simultaneously approximates all of the laws X(cid:62)\ni = 1, . . . , n. This result is also notable as it imposes no sparsity assumptions on\n\u03b2. Furthermore, since our consistency results are formulated in terms of the Mal-\nlows (Kantorovich) metric, the existence of a limiting distribution is not required.\n\ni ((cid:98)\u03b2\u03c1 \u2212 \u03b2),\n\ni \u03b2, where X(cid:62)\n\ni\n\n1\n\nIntroduction\n\nUntil recently, much of the emphasis in the theory of high-dimensional statistics has been on \u201c\ufb01rst\norder\u201d problems, such as estimation and prediction. As the understanding of these problems has\nbecome more complete, attention has begun to shift increasingly towards \u201csecond order\u201d problems,\ndealing with hypothesis tests, con\ufb01dence intervals, and uncertainty quanti\ufb01cation [1\u20136].\nIn this\ndirection, much less is understood about the effects of structure, regularization, and dimensionality\n\u2014 leaving many questions open. One collection of such questions that has attracted growing interest\ndeals with the operating characteristics of the bootstrap in high dimensions [7\u20139] . Due to the fact\nthat bootstrap is among the most widely used tools for approximating the sampling distributions of\ntest statistics and estimators, there is much practical value in understanding what factors allow for\nthe bootstrap to succeed in the high-dimensional regime.\n\nThe regression model and linear contrasts.\nIn this paper, we focus our attention on high-\ndimensional linear regression, and our aim is to know when the residual bootstrap (RB) method\nconsistently approximates the laws of linear contrasts. (A review of RB is given in Section 2.)\n\n1\n\n\fY = X\u03b2 + \u03b5,\n\nTo specify the model, suppose that we observe a response vector Y \u2208 Rn, generated according to\n(1)\nwhere X \u2208 Rn\u00d7p is the observed design matrix, \u03b2 \u2208 Rp is an unknown vector of coef\ufb01cients, and\nthe error variables \u03b5 = (\u03b51, . . . , \u03b5n) are drawn i.i.d. from an unknown distribution F0, with mean\n0 and unknown variance \u03c32 < \u221e. As is conventional in high-dimensional statistics, we assume\nthe model (1) is embedded in a sequence of models indexed by n. Hence, we allow X, \u03b2, and p to\nvary implicitly with n. We will leave p/n unconstrained until Section 3.3, where we will assume\np/n (cid:16) 1 in Theorem 3, and then in Section 3.4, we will assume further that p/n is bounded strictly\nbetween 0 and 1. The distribution F0 is \ufb01xed with respect to n, and none of our results require F0\nto have more than four moments.\nAlthough we are primarily interested in cases where the design matrix X is deterministic, we will\nalso study the performance of the bootstrap conditionally on a Gaussian design. For this reason,\nwe will use the symbol E[. . .|X] even when the design is non-random so that confusion does not\narise in relating different sections of the paper. Likewise, the symbol E[. . . ] refers to unconditional\nexpectation over all sources of randomness. Whenever the design is random, we will assume X \u22a5\u22a5 \u03b5,\ndenoting the distribution of X by PX, and the distribution of \u03b5 by P\u03b5.\n\nWithin the context of the regression, we will be focused on linear contrasts c(cid:62)((cid:98)\u03b2\u2212\u03b2), where c \u2208 Rp\nis a \ufb01xed vector and (cid:98)\u03b2 \u2208 Rp is an estimate of \u03b2. The importance of contrasts arises from the fact\n\nthat they unify many questions about a linear model. For instance, testing the signi\ufb01cance of the ith\ncoef\ufb01cient \u03b2i may be addressed by choosing c to be the standard basis vector c(cid:62) = e(cid:62)\ni . Another\nimportant problem is quantifying the uncertainty of point predictions, which may be addressed by\nchoosing c(cid:62) = X(cid:62)\ni , i.e. the ith row of the design matrix. In this case, an approximation to the law\nof the contrast leads to a con\ufb01dence interval for the mean response value E[Yi] = X(cid:62)\ni \u03b2. Further\napplications of contrasts occur in the broad topic of ANOVA [10].\n\nordinary least squares estimator, then it is a simple but important fact that contrasts can be written\n\nIntuition for structure and regularization in RB. The following two paragraphs explain the core\nconceptual aspects of the paper. To understand the role of regularization in applying RB to high-\n\ndimensional regression, it is helpful to think of RB in terms of two ideas. First, if (cid:98)\u03b2LS denotes the\nas c(cid:62)((cid:98)\u03b2LS \u2212 \u03b2) = a(cid:62)\u03b5 where a(cid:62):= c(cid:62)(X(cid:62)X)\u22121X(cid:62). Hence, if it were possible to sample directly\nsecond key idea is to use the residuals of some estimator (cid:98)\u03b2 as a proxy for samples from F0. When\nsquares tends to over\ufb01t when p/n (cid:16) 1. When (cid:98)\u03b2LS \ufb01ts \u201ctoo well\u201d, this means that its residuals are\n\u201ctoo small\u201d, and hence they give a poor proxy for F0. Therefore, by using a regularized estimator(cid:98)\u03b2,\nover\ufb01tting can be avoided, and the residuals of (cid:98)\u03b2 may offer a better way of obtaining \u201capproximate\n\nfrom F0, then the law of any such contrast could be easily determined. Since F0 is unknown, the\np (cid:28) n, the least-squares residuals are a good proxy [11, 12]. However, it is well-known that least-\n\nsamples\u201d from F0.\nThe form of regularized regression we will focus on is ridge regression:\n\n(cid:98)\u03b2\u03c1 := (X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)Y,\n\n(2)\nwhere \u03c1 > 0 is a user-speci\ufb01ced regularization parameter. As will be seen in Sections 3.2 and 3.3,\nthe residuals obtained from ridge regression lead to a particularly good approximation of F0 when\nthe design matrix X is nearly low-rank, in the sense that most of its singular values are close to\n0. In essence, this condition is a form of sparsity, since it implies that the rows of X nearly lie\nin a low-dimensional subspace of Rp. However, this type of structural condition has a signi\ufb01cant\nadvantage over the the more well-studied assumption that \u03b2 is sparse. Namely, the assumption that\nX is nearly low-rank can be inspected directly in practice \u2014 whereas sparsity in \u03b2 is typically\nIn fact, our results will impose no conditions on \u03b2, other than that (cid:107)\u03b2(cid:107)2 remains\nunveri\ufb01able.\nbounded as (n, p) \u2192 \u221e. Finally, it is worth noting that the occurrence of near low-rank design\nmatrices is actually very common in applications, and is often referred to as collinearity [13, ch.\n17].\n\nContributions and outline. The primary contribution of this paper is a complement to the work of\nBickel and Freedman [12] (hereafter B&F 1983) \u2014 who showed that in general, the RB method fails\n\n2\n\n\fan alternative set of results, proving that even when p/n (cid:16) 1, RB can successfully approximate the\n\nto approximate the laws of least-squares contrasts c(cid:62)((cid:98)\u03b2LS \u2212 \u03b2) when p/n (cid:16) 1. Instead, we develop\nlaws of \u201cridged contrasts\u201d c(cid:62)((cid:98)\u03b2\u03c1 \u2212 \u03b2) for many choices of c \u2208 Rp, provided that the design matrix\napproximates the law c(cid:62)((cid:98)\u03b2\u03c1 \u2212 \u03b2) for a certain choice of c that was shown in B&F 1983 to \u201cbreak\u201d\n\nX is nearly low rank. A particularly interesting consequence of our work is that RB successfully\n\nRB when applied to least-squares. Speci\ufb01cally, such a c can be chosen as one of the rows of X with\na high leverage score (see Section 4). This example corresponds to the practical problem of setting\ncon\ufb01dence intervals for mean response values E[Yi] = X(cid:62)\ni \u03b2. (See [12, p. 41], as well as Lemma 2\nand Theorem 4 in Section 3.4). Lastly, from a technical point of view, a third notable aspect of our\nresults is that they are formulated in terms of the Mallows-(cid:96)2 metric, which frees us from having to\nimpose conditions that force a limiting distribution to exist.\nApart from B&F 1983, the most closely related works we are aware of are the recent papers [7]\nand [8], which also consider RB in the high-dimensional setting. However, these works focus on\nrole of sparsity in \u03b2 and do not make use of low-rank structure in the design, whereas our work deals\nonly with structure in the design and imposes no sparsity assumptions on \u03b2.\nThe remainder of the paper is organized as follows. In Section 2, we formulate the problem of\napproximating the laws of contrasts, and describe our proposed methodology for RB based on ridge\nregression. Then, in Section 3 we state several results that lay the groundwork for Theorem 4, which\nshows that that RB can successfully approximate all of the laws L(X(cid:62)\nconditionally on a Gaussian design. Due to space constraints, all proofs are deferred to material that\nwill appear in a separate work.\n\ni ((cid:98)\u03b2\u03c1 \u2212 \u03b2)|X), i = 1, . . . , n,\n\nIf U and V are random variables, then L(U|V ) denotes the law of U,\nNotation and conventions.\nconditionally on V . If an and bn are two sequences of real numbers, then the notation an (cid:46) bn\nmeans that there is an absolute constant \u03ba0 > 0 and an integer n0 \u2265 1 such that an \u2264 \u03ba0bn for all\nn \u2265 n0. The notation an (cid:16) bn means that an (cid:46) bn and bn (cid:46) an. For a square matrix A \u2208 Rk\u00d7k\nwhose eigenvalues are real, we denote them by \u03bbmin(A) = \u03bbk(A) \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bb1(A) = \u03bbmax(A).\n\n2 Problem setup and methodology\n\nProblem setup. For any c \u2208 Rp, it is clear that conditionally on X, the law of c(cid:62)((cid:98)\u03b2\u03c1 \u2212 \u03b2) is\n\ncompletely determined by F0, and hence it makes sense to use the notation\n\n(3)\nThe problem we aim to solve is to approximate the distribution \u03a8\u03c1(F0; c) for suitable choices of c.\n\nReview of the residual bootstrap (RB) procedure. We brie\ufb02y explain the steps involved in the\n\nresidual bootstrap procedure, applied to the ridge estimator(cid:98)\u03b2\u03c1 of \u03b2. To proceed somewhat indirectly,\n\nconsider the following \u201cbias-variance\u201d decomposition of \u03a8\u03c1(F0; c), conditionally on X,\n\n\u03a8\u03c1(F0; c) := L(cid:0)c(cid:62)((cid:98)\u03b2\u03c1 \u2212 \u03b2)\uf8f4\uf8f4\uf8f4\uf8f4X(cid:1).\n\n\u03a8\u03c1(F0; c) = L(cid:0)c(cid:62)(cid:0)(cid:98)\u03b2\u03c1 \u2212 E[(cid:98)\u03b2\u03c1|X](cid:1)\uf8f4\uf8f4\uf8f4\uf8f4X(cid:1)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n=: \u03a6\u03c1(F0;c)\n\n+ c(cid:62)(cid:0)E[(cid:98)\u03b2\u03c1|X] \u2212 \u03b2(cid:1)\n(cid:124)\n(cid:125)\n\n(cid:123)(cid:122)\n\n=: bias(\u03a6\u03c1(F0;c))\n\n.\n\n(4)\n\nNote that the distribution \u03a6(F0; c) has mean zero, and so that the second term on the right side is\nthe bias of \u03a6\u03c1(F0; c) as an estimator of \u03a8\u03c1(F0; c). Furthermore, the distribution \u03a6\u03c1(F0; c) may be\nviewed as the \u201cvariance component\u201d of \u03a8\u03c1(F0; c). We will be interested in situations where the\nregularization parameter \u03c1 may be chosen small enough so that the bias component is small. In this\ncase, one has \u03a8\u03c1(F0; c) \u2248 \u03a6\u03c1(F0; c), and then it is enough to \ufb01nd an approximation to the law\n\n\u03a6\u03c1(F0; c), which is unknown. To this end, a simple manipulation of c(cid:62)((cid:98)\u03b2\u03c1 \u2212 E[(cid:98)\u03b2\u03c1]) leads to\nNow, to approximate \u03a6\u03c1(F0; c), let (cid:98)F be any centered estimate of F0. (Typically, (cid:98)F is obtained by\nn) \u2208 Rn be an i.i.d. sample from (cid:98)F . Then, replacing \u03b5 with \u03b5\u2217 in line (5) yields\n\n\u03a6\u03c1(F0; c) = L(c(cid:62)(X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)\u03b5\uf8f4\uf8f4\uf8f4\uf8f4X).\n\u03a6\u03c1((cid:98)F ; c) = L(c(cid:62)(X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)\u03b5\u2217\uf8f4\uf8f4\uf8f4\uf8f4X).\n\nusing the centered residuals of some estimator of \u03b2, but this is not necessary in general.) Also, let\n\u03b5\u2217 = (\u03b5\u2217\n\n1, . . . , \u03b5\u2217\n\n(6)\n\n(5)\n\n3\n\n\fHence, it is clear that the RB approximation is simply a \u201cplug-in rule\u201d.\n\n\u03a8\u03c1(F0; c). One way of exploiting this \ufb02exibility is to consider a two-stage approach, where a \u201cpilot\u201d\n\nA two-stage approach. An important feature of the procedure just described is that we are free\n\nAt this point, we de\ufb01ne the (random) measure \u03a6\u03c1((cid:98)F ; c) to be the RB approximation to \u03a6\u03c1(F0; c).\nto use any centered estimator (cid:98)F of F0. This fact offers substantial \ufb02exibility in approximating\nridge estimator (cid:98)\u03b2\u0001 is used to \ufb01rst compute residuals whose centered empirical distribution function\nis (cid:98)F\u0001, say. Then, in the second stage, the distribution (cid:98)F\u0001 is used to approximate \u03a6\u03c1(F0; c) via the\nrelation (6). To be more detailed, if ((cid:98)e1(\u0001), . . . ,(cid:98)en(\u0001)) =(cid:98)e(\u0001) := Y \u2212 X(cid:98)\u03b2\u0001 are the residuals of(cid:98)\u03b2\u0001,\nthen we de\ufb01ne (cid:98)F\u0001 to be the distribution that places mass 1/n at each of the values(cid:98)ei(\u0001) \u2212 \u00afe(\u0001) with\n(cid:80)n\ni=1(cid:98)ei(\u0001). Here, it is important to note that the value \u0001 is chosen to optimize (cid:98)F\u0001 as an\ncoverage probability for con\ufb01dence intervals based on \u03a6\u03c1((cid:98)F\u0001; c). Theorems 1, 3, and 4 will offer\n\n\u00afe(\u0001) := 1\nn\napproximation to F0. By contrast, the choice of \u03c1 depends on the relative importance of width and\n\nsome guidance in selecting \u0001 and \u03c1.\n\nResampling algorithm. To summarize the discussion above, if B is user-speci\ufb01ed number of\nbootstrap replicates, our proposed method for approximating \u03a8\u03c1(F0; c) is given below.\n\n1. Select \u03c1 and \u0001, and compute the residuals(cid:98)e(\u0001) = Y \u2212 X(cid:98)\u03b2\u0001.\n2. Compute the centered distribution function (cid:98)F\u0001, putting mass 1/n at each(cid:98)ei(\u0001) \u2212 \u00afe(\u0001).\n\u2022 Draw a vector \u03b5\u2217 \u2208 Rn of n i.i.d. samples from (cid:98)F\u0001.\n\n3. For j = 1, . . . , B:\n\n\u2022 Compute zj := c(cid:62)(X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)\u03b5\u2217.\n\n4. Return the empirical distribution of z1, . . . , zB.\n\nClearly, as B \u2192 \u221e, the empirical distribution of z1, . . . , zB converges weakly to \u03a6\u03c1((cid:98)F\u0001; c), with\nissues, and address only the performance of \u03a6\u03c1((cid:98)F\u0001; c) as an approximation to \u03a8\u03c1(F0; c).\n\nprobability 1. As is conventional, our theoretical analysis in the next section will ignore Monte Carlo\n\n3 Main results\n\nThe following metric will be central to our theoretical results, and has been a standard tool in the\nanalysis of the bootstrap, beginning with the work of Bickel and Freedman [14].\n\nThe Mallows (Kantorovich) metric. For two random vectors U and V in a Euclidean space, the\nMallows-(cid:96)2 metric is de\ufb01ned by\n\n2(L(U ),L(V )) := inf\nd2\n\u03c0\u2208\u03a0\n\n(7)\nwhere the in\ufb01mum is over the class \u03a0 of joint distributions \u03c0 whose marginals are L(U ) and L(V ).\nIt is worth noting that convergence in d2 is strictly stronger than weak convergence, since it also\nrequires convergence of second moments. Additional details may be found in the paper [14].\n\n: (U, V ) \u223c \u03c0\n\n2\n\n(cid:110)E(cid:104)(cid:107)U \u2212 V (cid:107)2\n\n(cid:105)\n\n(cid:111)\n\n3.1 A bias-variance decomposition for bootstrap approximation\n\nTo give some notation for analyzing the bias-variance decomposition of \u03a8\u03c1(F0; c) in line (4), we\n\nde\ufb01ne the following quantities based upon the ridge estimator(cid:98)\u03b2\u03c1. Namely, the variance is\n\nv\u03c1 = v\u03c1(X; c) := var(\u03a8\u03c1(F0; c)|X) = \u03c32(cid:107)c(cid:62)(X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)(cid:107)2\n2.\n\nTo express the bias of \u03a6\u03c1(F0; c), we de\ufb01ne the vector \u03b4(X) \u2208 Rp according to\n\n\u03b4(X) := \u03b2 \u2212 E[(cid:98)\u03b2\u03c1] =(cid:2)Ip\u00d7p \u2212 (X(cid:62)X + \u03c1Ip\u00d7p)\u22121X(cid:62)X(cid:3)\u03b2,\n\n(8)\n\n4\n\n\fand then put\n\nb2\n\u03c1 = b2\n\n\u03c1(X; c) := bias2(\u03a6\u03c1(F0; c)) = (c(cid:62)\u03b4(X))2.\n\n(9)\n\n\u03c1(X; c) only depends on \u03b2 through \u03b4(X).\n\n\u03c1 to lighten notation. Note that v\u03c1(X; c) does not\n\nWe will sometimes omit the arguments of v\u03c1 and b2\ndepend on \u03b2, and b2\nThe following result gives a regularized and high-dimensional extension of some lemmas in Freed-\nman\u2019s early work [11] on RB for least squares. The result does not require any structural conditions\non the design matrix, or on the true parameter \u03b2.\n\nTheorem 1 (consistency criterion). Suppose X \u2208 Rn\u00d7p is \ufb01xed. Let (cid:98)F be any estimator of F0, and\n\nlet c \u2208 Rp be any vector such that v\u03c1 = v\u03c1(X; c) (cid:54)= 0. Then with P\u03b5-probability 1, the following\ninequality holds for every n \u2265 1, and every \u03c1 > 0,\n\n(cid:16) 1\u221a\n\nd2\n2\n\n\u03a8\u03c1(F0; c),\n\n1\u221a\nv\u03c1\n\nv\u03c1\n\n\u03a6\u03c1((cid:98)F ; c)\n\n(cid:17) \u2264 1\n\n2(F0,(cid:98)F ) +\n\n\u03c32 d2\n\nb2\n\u03c1\nv\u03c1\n\n.\n\n(10)\n\n\u221a\n\nRemarks. Observe that the normalization 1/\nv\u03c1 ensures that the bound is non-trivial, since the\n\u221a\nv\u03c1 has variance equal to 1 for all n (and hence does not become degenerate\ndistribution \u03a8\u03c1(F0; c)/\n\u03c1/v\u03c1 decreases mono-\nfor large n). To consider the choice of \u03c1, it is simple to verify that the ratio b2\ntonically as \u03c1 decreases. Note also that as \u03c1 becomes small, the variance v\u03c1 becomes large, and\n\nlikewise, con\ufb01dence intervals based on \u03a6\u03c1((cid:98)F ; c) become wider. In other words, there is a trade-off\n\nbetween the width of the con\ufb01dence interval and the size of the bound (10).\n\nSuf\ufb01cient conditions for consistency of RB. An important practical aspect of Theorem 1 is that\nfor any given contrast c, the variance v\u03c1(X; c) can be easily estimated, since it only requires an\n\nestimate of \u03c32, which can be obtained from (cid:98)F . Consequently, whenever theoretical bounds on\n2(F0,(cid:98)F ) and b2\n2(F0,(cid:98)F )|X] in the case where (cid:98)F is chosen to be (cid:98)F\u0001. Later on in\n\nIn this way,\nd2\nTheorem 1 offers a simple route for guaranteeing that RB is consistent. In Sections 3.2 and 3.3 to\nfollow, we derive a bound on E[d2\nSection 3.4, we study RB consistency in the context of prediction with a Gaussian design, and there\n\u03c1(X; c) where c is a particular row of X.\nwe derive high probability bounds on both v\u03c1(X; c) and b2\n\n\u03c1(X; c) are available, the right side of line (10) can be controlled.\n\nas\n\n3.2 A link between bootstrap consistency and MSPE\n\nIf (cid:98)\u03b2 is an estimator of \u03b2, its mean-squared prediction error (MSPE), conditionally on X, is de\ufb01ned\n\nmspe((cid:98)\u03b2 |X) := 1\n\nE(cid:2)(cid:107)X((cid:98)\u03b2 \u2212 \u03b2)(cid:107)2\n\n\uf8f4\uf8f4\uf8f4\uf8f4X(cid:3).\n\nn\n\n(11)\nThe previous subsection showed that in-law approximation of contrasts is closely tied to the approx-\nimation of F0. We now take a second step of showing that if the centered residuals of an estimator\n\n(cid:98)\u03b2 are used to approximate F0, then the quality of this approximation can be bounded naturally in\nterms of mspe((cid:98)\u03b2 |X). This result applies to any estimator(cid:98)\u03b2 computed from the observations (1).\nTheorem 2. Suppose X \u2208 Rn\u00d7p is \ufb01xed. Let (cid:98)\u03b2 be any estimator of \u03b2, and let (cid:98)F be the empirical\ndistribution of the centered residuals of (cid:98)\u03b2. Also, let Fn denote the empirical distribution of n i.i.d.\n\n2\n\nsamples from F0. Then for every n \u2265 1,\n\n2((cid:98)F , F0)\uf8f4\uf8f4\uf8f4\uf8f4X(cid:3) \u2264 2 mspe((cid:98)\u03b2 |X) + 2 E[d2\nE(cid:2)d2\n\n2(Fn, F0)] + 2\u03c32\nn .\n\n(12)\n\nRemarks. As we will see in the next section, the MSPE of ridge regression can be bounded in a\n\nsharp way when the design matrix is approximately low rank, and there we will analyze mspe((cid:98)\u03b2\u0001|X)\nfor the pilot estimator. Consequently, when near low-rank structure is available, the only remaining\n2(Fn, F0)|X]. The very\nissue in controlling the right side of line (12) is to bound the quantity E[d2\nrecent work of Bobkov and Ledoux [15] provides an in-depth study of this question, and they derive\na variety bounds under different tail conditions on F0. We summarize one of their results below.\nLemma 1 (Bobkov and Ledoux, 2014). If F0 has a \ufb01nite fourth moment, then\n\nE[d2\n\n2(Fn, F0)] (cid:46) log(n)n\u22121/2.\n\n(13)\n\n5\n\n\fRemarks. The fact that the squared distance is bounded at the rate of log(n)n\u22121/2 is an indica-\ntion that d2 is a rather strong metric on distributions. For a detailed discussion of this result, see\nCorollaries 7.17 and 7.18 in the paper [15]. Although it is possible to obtain faster rates when more\n\nstringent tail conditions are placed on F0, we will only need a fourth moment, since the mspe((cid:98)\u03b2|X)\n\nterm in Theorem 2 will often have a slower rate than log(n)n\u22121/2, as discussed in the next section.\n\n3.3 Consistency of ridge regression in MSPE for near low rank designs\n\nIn this subsection, we show that when the tuning parameter \u0001 is set at a suitable rate, the pilot ridge\n\nestimator (cid:98)\u03b2\u0001 is consistent in MSPE when the design matrix is near low-rank \u2014 even when p/n is\n\nlarge, and without any sparsity constraints on \u03b2. We now state some assumptions.\nA1. There is a number \u03bd > 0, and absolute constants \u03ba1, \u03ba2 > 0, such that\ni = 1, . . . , n \u2227 p.\n\n\u03ba1i\u2212\u03bd \u2264 \u03bbi((cid:98)\u03a3) \u2264 \u03ba2i\u2212\u03bd\n\nfor all\n\nA2. There are absolute constants \u03b8, \u03b3 > 0, such that for every n \u2265 1, \u0001\nA3. The vector \u03b2 \u2208 Rp satis\ufb01es (cid:107)\u03b2(cid:107)2 (cid:46) 1.\n\nn = n\u2212\u03b3.\nDue to Theorem 2, the following bound shows that the residuals of (cid:98)\u03b2\u0001 may be used to extract a\n\nn = n\u2212\u03b8 and \u03c1\n\nconsistent approximation to F0. Two other notable features of the bound are that it is non-asymptotic\nand dimension-free.\nTheorem 3. Suppose that X \u2208 Rn\u00d7p is \ufb01xed and that Assumptions 1\u20133 hold, with p/n (cid:16) 1. Assume\nfurther that \u03b8 is chosen as \u03b8 = 2\u03bd\n\n\u03bd+1 when \u03bd > 1\n\n2 . Then,\n\n(cid:40)\n3 when \u03bd \u2208 (0, 1\n\nmspe((cid:98)\u03b2\u0001|X) (cid:46)\n\n2 ), and \u03b8 = \u03bd\nn\u2212 2\u03bd\nn\u2212 \u03bd\n\n\u03bd+1\n\nif \u03bd \u2208 (0, 1\n2 ),\nif \u03bd > 1\n2 .\n\n3\n\nAlso, both bounds in (14) are tight in the sense that \u03b2 can be chosen so that(cid:98)\u03b2\u0001 attains either rate.\nRemarks. Since the eigenvalues \u03bbi((cid:98)\u03a3) are observable, they may be used to estimate \u03bd and guide\n\nthe selection of \u0001/n = n\u2212\u03b8. However, from a practical point of view, we found it easier to select \u0001\nvia cross-validation in numerical experiments, rather than via an estimate of \u03bd.\n\n(14)\n\nA link with Pinsker\u2019s Theorem.\nIn the particular case when F0 is a centered Gaussian distribu-\ntion, the \u201cprediction problem\u201d of estimating X\u03b2 is very similar to estimating the mean parameters of\na Gaussian sequence model, with error measured in the (cid:96)2 norm. In the alternative sequence-model\nn X(cid:62)X translates into an ellipsoid constraint on\nformat, the decay condition on the eigenvalues of 1\nthe mean parameter sequence [16, 17]. For this reason, Theorem 3 may be viewed as \u201cregression\nversion\u201d of (cid:96)2 error bounds for the sequence model under an ellipsoid constraint (cf. Pinsker\u2019s The-\norem, [16, 17]). Due to the fact that the latter problem has a very well developed literature, there\nmay be various \u201cneighboring results\u201d elsewhere. Nevertheless, we could not \ufb01nd a direct reference\nfor our stated MSPE bound in the current setup. For the purposes of our work in this paper, the more\nimportant point to take away from Theorem 3 is that it can be coupled with Theorem 2 for proving\nconsistency of RB.\n\n3.4 Con\ufb01dence intervals for mean responses, conditionally on a Gaussian design\ni \u2208 Rp drawn\nIn this section, we consider the situation where the design matrix X has rows X(cid:62)\ni.i.d. from a multivariate normal distribution N (0, \u03a3), with X \u22a5\u22a5 \u03b5. (The covariance matrix \u03a3 may\ni ((cid:98)\u03b2\u03c1 \u2212 \u03b2)|X). As discussed in Section 1, this corresponds to the problem of\nvary with n.) Conditionally on a realization of X, we analyze the RB approximation of the laws\n\u03a8\u03c1(F0; Xi) = L(X(cid:62)\nsetting con\ufb01dence intervals for the mean responses E[Yi] = X(cid:62)\ni \u03b2. Assuming that the population\neigenvalues \u03bbi(\u03a3) obey a decay condition, we show below in Theorem 4 that RB succeeds with high\nPX-probability. Moreover, this consistency statement holds for all of the laws \u03a8\u03c1(F0; Xi) simul-\ntaneously. That is, among the n distinct laws \u03a8\u03c1(F0; Xi), i = 1, . . . , n, even the worst bootstrap\napproximation is still consistent. We now state some population-level assumptions.\n\n6\n\n\fA4. The operator norm of \u03a3 \u2208 Rp\u00d7p satis\ufb01es (cid:107)\u03a3(cid:107)op (cid:46) 1.\nNext, we impose a decay condition on the eigenvalues of \u03a3. This condition also ensures that \u03a3 is\ninvertible for each \ufb01xed p \u2014 even though the bottom eigenvalue may become arbitrarily small as p\nbecomes large. It is important to notice that we now use \u03b7 for the decay exponent of the population\neigenvalues, whereas we used \u03bd when describing the sample eigenvalues in the previous section.\nA5. There is a number \u03b7 > 0, and absolute constants k1, k2 > 0, such that for all i = 1, . . . , p,\n\nk1i\u2212\u03b7 \u2264 \u03bbi(\u03a3) \u2264 k2i\u2212\u03b7.\n\nA6. There are absolute constants k3, k4 \u2208 (0, 1) such that for all n \u2265 3, we have the bounds\n\nk3 \u2264 p\n\nn \u2264 k4 and p \u2264 n \u2212 2.\n\nThe following lemma collects most of the effort needed in proving our \ufb01nal result in Theorem 4.\nHere it is also helpful to recall the notation \u03c1/n = n\u2212\u03b3 and \u0001/n = n\u2212\u03b8 from Assumption 2.\nLemma 2. Suppose that the matrix X \u2208 Rn\u00d7p has rows X(cid:62)\ni drawn i.i.d. from N (0, \u03a3), and that\nAssumptions 2\u20136 hold. Furthermore, assume that \u03b3 chosen so that 0 < \u03b3 < min{\u03b7, 1}. Then, the\nstatements below are true.\n(i) (bias inequality)\nFix any \u03c4 > 0. Then, there is an absolute constant \u03ba0 > 0, such that for all large n, the following\nevent holds with PX-probability at least 1 \u2212 n\u2212\u03c4 \u2212 ne\u2212n/16,\n\n\u03c1(X; Xi) \u2264 \u03ba0 \u00b7 n\u2212\u03b3 \u00b7 (\u03c4 + 1) log(n + 2).\nb2\n\nmax\n1\u2264i\u2264n\n\n(15)\n\n(ii) (variance inequality)\nThere are absolute constants \u03ba1, \u03ba2 > 0 such that for all large n, the following event holds with\nPX-probability at least 1 \u2212 4n exp(\u2212\u03ba1n\n\n\u03b3\n\n\u03b7 ),\nv\u03c1(X;Xi) \u2264 \u03ba2n1\u2212 \u03b3\n\n1\n\n\u03b7 .\n\nmax\n1\u2264i\u2264n\n\n(16)\n\n(17)\n\n(cid:40)\n\n2 ), and that \u03b8 is chosen as \u03b8 = \u03b7\n\n1+\u03b7 when\n\n(iii) (mspe inequalities)\nSuppose that \u03b8 is chosen as \u03b8 = 2\u03b7/3 when \u03b7 \u2208 (0, 1\n\u03b7 > 1\n\n2 . Then, there are absolute constants \u03ba3, \u03ba4, \u03ba5, \u03ba6 > 0 such that for all large n,\nwith PX-probability at least 1 \u2212 exp(\u2212\u03ba3n2\u22124\u03b7/3),\n\nmspe((cid:98)\u03b2\u0001|X) \u2264\nthe entire lemma may be explained in relatively simple terms. Viewing the quantities mspe((cid:98)\u03b2\u0001|X),\n\n\u03ba4n\u2212 2\u03b7\n\u03b7+1 with PX-probability at least 1 \u2212 exp(\u2212\u03ba5n\n\u2212 \u03b7\n\u03ba6n\n\nRemarks. Note that the two rates in part (iii) coincide as \u03b7 approaches 1/2. At a conceptual level,\n\nif \u03b7 \u2208 (0, 1\n2 )\nif \u03b7 > 1\n2 .\n\n2\n1+\u03b7 ),\n\n3\n\n\u03c1(X; Xi) and v\u03c1(X; Xi) as functionals of a Gaussian matrix, the proof involves deriving concen-\nb2\ntration bounds for each of them. Indeed, this is plausible given that these quantities are smooth\nfunctionals of X. However, the dif\ufb01culty of the proof arises from the fact that they are also highly\nnon-linear functionals of X. We now combine Lemmas 1 and 2 with Theorems 1 and 2 to show that\nall of the laws \u03a8\u03c1(F0; Xi) can be simultaneously approximated via our two-stage RB method.\nTheorem 4. Suppose that F0 has a \ufb01nite fourth moment, Assumptions 2\u20136 hold, and \u03b3 is chosen\n1+\u03b7 < \u03b3 < min{\u03b7, 1}. Also suppose that \u03b8 is chosen as \u03b8 = 2\u03b7/3 when \u03b7 \u2208 (0, 1\n2 ), and\nso that\n\u03b8 = \u03b7\n2 . Then, there is a sequence of positive numbers \u03b4n with limn\u2192\u221e \u03b4n = 0, such\nthat the event\n\n\u03b7+1 when \u03b7 > 1\n\n\u03b7\n\nE(cid:104)\n\n(cid:16) 1\u221a\n\nmax\n1\u2264i\u2264n\n\nd2\n2\n\n\u03a8\u03c1(F0; Xi),\n\n1\u221a\nv\u03c1\n\nv\u03c1\n\n\u03a6\u03c1((cid:98)F\u0001; Xi)\n\n(cid:17)\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4X\n\n(cid:105) \u2264 \u03b4n\n\nhas PX-probability tending to 1 as n \u2192 \u221e.\n\nRemark. Lemma 2 gives explicit bounds on the numbers \u03b4n, as well as the probabilities of the\ncorresponding events, but we have stated the result in this way for the sake of readability.\n\n7\n\n\f4 Simulations\n\nIn four different settings of n, p, and the decay parameter \u03b7, we compared the nominal 90% con-\n\ufb01dence intervals (CIs) of four methods: \u201coracle\u201d, \u201cridge\u201d, \u201cnormal\u201d, and \u201cOLS\u201d, to be described\nbelow. In each setting, we generated N1 := 100 random designs X with i.i.d. rows drawn from\nN (0, \u03a3), where \u03bbj(\u03a3) = j\u2212\u03b7, j = 1, . . . , p, and the eigenvectors of \u03a3 were drawn randomly by\nsetting them to be the Q factor in a QR decomposition of a standard p \u00d7 p Gaussian matrix. Then,\nfor each realization of X, we generated N2 := 1000 realizations of Y according to the model (1),\nwhere \u03b2 = 1/(cid:107)1(cid:107)2 \u2208 Rp, and F0 is the centered t distribution on 5 degrees of freedom, rescaled to\nhave standard deviation \u03c3 = 0.1. For each X, and each corresponding Y , we considered the prob-\nlem of setting a 90% CI for the mean response value X(cid:62)\ni(cid:63) is the row with the highest\nleverage score, i.e. i(cid:63) = argmax1\u2264i\u2264n Hii and H := X(X(cid:62)X)\u22121X(cid:62). This problem was shown in\nB&F 1983 to be a case where the standard RB method based on least-squares fails when p/n (cid:16) 1.\nBelow, we refer to this method as \u201cOLS\u201d.\nTo describe the other three methods, \u201cridge\u201d refers to the interval [X(cid:62)\nSection 2, with B = 1000 and c(cid:62) = X(cid:62)\n\ni(cid:63)(cid:98)\u03b2\u03c1 \u2212(cid:98)q0.05],\nwhere (cid:98)q\u03b1 is the \u03b1% quantile of the numbers z1, . . . , zB computed in the proposed algorithm in\nwe \ufb01rst computed(cid:98)r as the value that optimized the MSPE error of a ridge estimator(cid:98)\u03b2r with respect\nput \u0001 = 5(cid:98)r and \u03c1 = 0.1(cid:98)r, as we found the prefactors 5 and 0.1 to work adequately across various\ni(cid:63) ((cid:98)\u03b2\u03c1\u2212\u03b2)|X) \u2248 N (0,(cid:98)\u03c4 2), where\n2, \u03c1 = 0.1(cid:98)r, and(cid:98)\u03c32 is the usual unbiased estimate of \u03c32 based\n(cid:98)\u03c4 2 =(cid:98)\u03c32(cid:107)X(cid:62)\ni(cid:63)(cid:98)\u03b2\u03c1 \u2212 \u02dcq0.95, X(cid:62)\ni(cid:63)(cid:98)\u03b2\u03c1 \u2212 \u02dcq0.05], with\ni ((cid:98)\u03b2\u03c1 \u2212 \u03b2) over all 1000 realizations of Y\n\u03c1 = 0.1(cid:98)r, and \u02dcq\u03b1 being the empirical \u03b1% quantile of X(cid:62)\nbased on a given X. (This accounts for the randomness in \u03c1 = 0.1(cid:98)r.)\n\nsettings. (Optimizing \u0001 with respect to MSPE is motivated by Theorems 1, 2, and 3. Also, choosing \u03c1\nto be somewhat smaller than \u0001 conforms with the constraints on \u03b8 and \u03b3 in Theorem 4.) The method\n\u201cnormal\u201d refers to the CI based on the normal approximation L(X(cid:62)\n\ni(cid:63) (X(cid:62)X +\u03c1Ip\u00d7p)\u22121X(cid:62)(cid:107)2\n\non OLS residuals. The \u201coracle\u201d method refers to the interval [X(cid:62)\n\ni(cid:63) \u03b2, where X(cid:62)\n\ni(cid:63)(cid:98)\u03b2\u03c1 \u2212(cid:98)q0.95, X(cid:62)\n\nto 5-fold cross validation; i.e. cross validation was performed for every distinct pair (X, Y ). We then\n\ni(cid:63). To choose the parameters \u03c1 and \u0001 for a given X and Y ,\n\nWithin a given setting of the triplet (n, p, \u03b7), we refer to the \u201ccoverage\u201d of a method as the fraction of\nthe N1\u00d7N2 = 105 instances where the method\u2019s CI contained the parameter X(cid:62)\ni(cid:63) \u03b2. Also, we refer to\n\u201cwidth\u201d as the average width of a method\u2019s intervals over all of the 105 instances. The four settings of\n(n, p, \u03b7) correspond to moderate/high dimension and moderate/fast decay of the eigenvalues \u03bbi(\u03a3).\nEven in the moderate case of p/n = 0.45, the results show that the OLS intervals are too narrow\nand have coverage noticeably less than 90%. As expected, this effect becomes more pronounced\nwhen p/n = 0.95. The ridge and normal intervals perform reasonably well across settings, with\nboth performing much better than OLS. However, it should be emphasized that our study of RB\nis motivated by the desire to gain insight into the behavior of the bootstrap in high dimensions\n\u2014 rather than trying to outperform particular methods. In future work, we plan to investigate the\nrelative merits of the ridge and normal intervals in greater detail.\n\nTable 1: Comparison of nominal 90% con\ufb01dence intervals\n\nsetting 1\n\nn = 100, p = 45, \u03b7 = 0.5\n\nsetting 2\n\nn = 100, p = 95, \u03b7 = 0.5\n\nsetting 3\n\nn = 100, p = 45, \u03b7 = 1\n\nsetting 4\n\nn = 100, p = 95, \u03b7 = 1\n\nwidth\n\ncoverage\n\nwidth\n\ncoverage\n\nwidth\n\ncoverage\n\nwidth\n\ncoverage\n\noracle\n0.21\n0.90\n0.22\n0.90\n0.20\n0.90\n0.21\n0.90\n\nridge\n0.20\n0.87\n0.26\n0.88\n0.21\n0.90\n0.26\n0.92\n\nnormal\n0.23\n0.91\n0.26\n0.88\n0.22\n0.91\n0.23\n0.87\n\nOLS\n0.16\n0.81\n0.06\n0.42\n0.16\n0.81\n0.06\n0.42\n\nAcknowledgements. MEL thanks Prof. Peter J. Bickel for many helpful discussions, and grate-\nfully acknowledges the DOE CSGF under grant DE-FG02-97ER25308, as well as the NSF-GRFP.\n\n8\n\n\fReferences\n[1] C.-H. Zhang and S. S. Zhang. Con\ufb01dence intervals for low dimensional parameters in high\ndimensional linear models. Journal of the Royal Statistical Society: Series B, 76(1):217\u2013242,\n2014.\n\n[2] A. Javanmard and A. Montanari. Hypothesis testing in high-dimensional regression under the\n\nGaussian random design model: Asymptotic theory. arXiv preprint arXiv:1301.4240, 2013.\n\n[3] A. Javanmard and A. Montanari. Con\ufb01dence intervals and hypothesis testing for high-\n\nlinear models.\n\nBernoulli,\n\ndimensional regression. arXiv preprint arXiv:1306.3171, 2013.\nStatistical signi\ufb01cance in high-dimensional\n\n[4] P. B\u00a8uhlmann.\n\n19(4):1212\u20131242, 2013.\n\n[5] S. van de Geer, P. B\u00a8uhlmann, and Y. Ritov. On asymptotically optimal con\ufb01dence regions and\n\ntests for high-dimensional models. arXiv preprint arXiv:1303.0518, 2013.\n\n[6] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact inference after model selection via the\n\nlasso. arXiv preprint arXiv:1311.6238, 2013.\n\n[7] A. Chatterjee and S. N. Lahiri. Rates of convergence of the adaptive lasso estimators to the\noracle distribution and higher order re\ufb01nements by the bootstrap. The Annals of Statistics,\n41(3):1232\u20131259, 2013.\n\n[8] H. Liu and B. Yu. Asymptotic properties of lasso+mls and lasso+ridge in sparse high-\n\ndimensional linear regression. Electronic Journal of Statistics, 7:3124\u20133169, 2013.\n\n[9] V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximations and multiplier boot-\nstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics,\n41(6):2786\u20132819, 2013.\n\n[10] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer, 2005.\n[11] D. A. Freedman. Bootstrapping regression models. The Annals of Statistics, 9(6):1218\u20131228,\n\n1981.\n\n[12] P. J. Bickel and D. A. Freedman. Bootstrapping regression models with many parameters. In\n\nFestschrift for Erich L. Lehmann, pages 28\u201348. Wadsworth, 1983.\n\n[13] N. R. Draper and H. Smith. Applied regression analysis. Wiley-Interscience, 1998.\n[14] P. J. Bickel and D. A. Freedman. Some asymptotic theory for the bootstrap. The Annals of\n\nStatistics, pages 1196\u20131217, 1981.\n\n[15] S. Bobkov and M. Ledoux. One-dimensional empirical measures, order statistics, and Kan-\n\ntorovich transport distances. preprint, 2014.\n\n[16] A. B. Tsybakov. Introduction to nonparametric estimation. Springer, 2009.\n[17] L. Wasserman. All of nonparametric statistics. Springer, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1648, "authors": [{"given_name": "Miles", "family_name": "Lopes", "institution": "UC Berkeley"}]}