{"title": "High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity", "book": "Advances in Neural Information Processing Systems", "page_first": 2726, "page_last": 2734, "abstract": "Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings.", "full_text": "High-dimensional regression with noisy and missing data:\n\nProvable guarantees with non-convexity\n\nPo-Ling Loh\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nploh@berkeley.edu\n\nMartin J. Wainwright\n\nDepartments of Statistics and EECS\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nwainwrig@stat.berkeley.edu\n\nAbstract\n\nAlthough the standard formulations of prediction problems involve fully-observed and noise-\nless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data,\npossibly involving dependencies. We study these issues in the context of high-dimensional\nsparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or\ndependent data. Many standard approaches to noisy or missing data, such as those using the\nEM algorithm, lead to optimization problems that are inherently non-convex, and it is dif\ufb01cult\nto establish theoretical guarantees on practical algorithms. While our approach also involves\noptimizing non-convex programs, we are able to both analyze the statistical error associated\nwith any global optimum, and prove that a simple projected gradient descent algorithm will\nconverge in polynomial time to a small neighborhood of the set of global minimizers. On\nthe statistical side, we provide non-asymptotic bounds that hold with high probability for the\ncases of noisy, missing, and/or dependent data. On the computational side, we prove that\nunder the same types of conditions required for statistical consistency, the projected gradient\ndescent algorithm will converge at geometric rates to a near-global minimizer. We illustrate\nthese theoretical predictions with simulations, showing agreement with the predicted scalings.\n\n1\n\nIntroduction\n\nIn standard formulations of prediction problems, it is assumed that the covariates are fully-observed and sam-\npled independently from some underlying distribution. However, these assumptions are not realistic for many\napplications, in which covariates may be observed only partially, observed with corruption, or exhibit dependen-\ncies. Consider the problem of modeling the voting behavior of politicians: in this setting, votes may be missing\ndue to abstentions, and temporally dependent due to collusion or \u201ctit-for-tat\u201d behavior. Similarly, surveys often\nsuffer from the missing data problem, since users fail to respond to all questions. Sensor network data also tends\nto be both noisy due to measurement error, and partially missing due to failures or drop-outs of sensors.\nThere are a variety of methods for dealing with noisy and/or missing data, including various heuristic meth-\nods, as well as likelihood-based methods involving the expectation-maximization (EM) algorithm (e.g., see the\nbook [1] and references therein). A challenge in this context is the possible non-convexity of associated opti-\nmization problems. For instance, in applications of EM, problems in which the negative likelihood is a convex\nfunction often become non-convex with missing or noisy data. Consequently, although the EM algorithm will\nconverge to a local minimum, it is dif\ufb01cult to guarantee that the local optimum is close to a global minimum.\nIn this paper, we study these issues in the context of high-dimensional sparse linear regression, in the case\nwhen the predictors or covariates are noisy, missing, and/or dependent. Our main contribution is to develop and\nstudy some simple methods for handling these issues, and to prove theoretical results about both the associated\nstatistical error and the optimization error. Like EM-based approaches, our estimators are based on solving\noptimization problems that may be non-convex; however, despite this non-convexity, we are still able to prove\nthat a simple form of projected gradient descent will produce an output that is \u201csuf\ufb01ciently close\u201d\u2014meaning as\n\n1\n\n\fsmall as the statistical error\u2014to any global optimum. As a second result, we bound the size of this statistical\nerror, showing that it has the same scaling as the minimax rates for the classical cases of perfectly observed and\nindependently sampled covariates. In this way, we obtain estimators for noisy, missing, and/or dependent data\nwith guarantees similar to the usual fully-observed and independent case. The resulting estimators allow us to\nsolve the problem of high-dimensional Gaussian graphical model selection with missing data.\nThere is a large body of work on the problem of corrupted covariates or errors-in-variables for regression\nproblems (see the papers and books [2, 3, 4, 5] and references therein). Much of the earlier theoretical work\nis classical in nature, where the sample size n diverges with the dimension p held \ufb01xed. Most relevant to this\npaper is more recent work that has examined issues of corrupted and/or missing data in the context of high-\ndimensional sparse linear models, allowing for n (cid:28) p. St\u00a8adler and B\u00a8uhlmann [6] developed an EM-based\nmethod for sparse inverse covariance matrix estimation in the missing data regime, and used this result to\nderive an algorithm for sparse linear regression with missing data. As mentioned above, however, it is dif\ufb01cult\nto guarantee that EM will converge to a point close to a global optimum of the likelihood, in contrast to the\nmethods studied here. Rosenbaum and Tsybakov [7] studied the sparse linear model when the covariates are\ncorrupted by noise, and proposed a modi\ufb01ed form of the Dantzig selector, involving a convex program. This\nconvexity produces a computationally attractive method, but the statistical error bounds that they establish scale\nproportionally with the size of the additive perturbation, hence are often weaker than the bounds that can be\nproved using our methods.\nThe remainder of this paper is organized as follows. We begin in Section 2 with background and a precise\ndescription of the problem. We then introduce the class of estimators we will consider and the form of the\nprojected gradient descent algorithm. Section 3 is devoted to a description of our main results, including a pair\nof general theorems on the statistical and optimization error, and then a series of corollaries applying our results\nto the cases of noisy, missing, and dependent data. In Section 4, we demonstrate simulations to con\ufb01rm that our\nmethods work in practice. For detailed proofs, we refer the reader to the technical report [8].\nNotation. For a matrix M, we write kMkmax := maxi,j |mij| to be the elementwise \u2018\u221e-norm of M. Further-\nmore, |||M|||1 denotes the induced \u20181-operator norm (maximum absolute column sum) of M, and |||M|||op is the\ninduced \u20182-operator norm (spectral norm) of M. We write \u03ba(M) := \u03bbmax(M )\n\n\u03bbmin(M ) , the condition number of M.\n\n2 Background and problem set-up\n\nIn this section, we provide a formal description of the problem and motivate the class of estimators studied in\nthe paper. We then describe a class of projected gradient descent algorithms to be used in the sequel.\n\n2.1 Observation model and high-dimensional framework\nSuppose we observe a response variable yi \u2208 R that is linked to a covariate vector xi \u2208 Rp via the linear model\n(1)\nHere, the regression vector \u03b2\u2217 \u2208 Rp is unknown, and \u0001i \u2208 R is observation noise, independent of xi. Rather than\ndirectly observing each xi \u2208 Rp, we observe a vector zi \u2208 Rp linked to xi via some conditional distribution:\n\nyi = hxi, \u03b2\u2217i + \u0001i,\n\nfor i = 1, 2, . . . , n.\n\nzi \u223c Q(\u00b7 | xi),\n\nfor i = 1, 2, . . . , n.\n\n(2)\n\nThis setup allows us to model various types of disturbances to the covariates, including\n\n(a) Additive noise: We observe zi = xi + wi, where wi \u2208 Rp is a random vector independent of xi, say\nzero-mean with known covariance matrix \u03a3w.\n(b) Missing data: For a fraction \u03c1 \u2208 [0, 1), we observe a random vector zi \u2208 Rp such that independently\nfor each component j, we observe zij = xij with probability 1 \u2212 \u03c1, and zij = \u2217 with probability \u03c1.\nThis model can also be generalized to allow for different missing probabilities for each covariate.\n\nOur \ufb01rst set of results is deterministic, depending on speci\ufb01c instantiations of the observed variables\n{(yi, zi)}n\ni=1. However, we are also interested in proving results that hold with high probability when the\nxi\u2019s and zi\u2019s are drawn at random from some distribution. We develop results for both the i.i.d. setting and the\ncase of dependent covariates, where the xi\u2019s are generated according to a stationary vector autoregressive (VAR)\nprocess. Furthermore, we work within a high-dimensional framework where n (cid:28) p, and assume \u03b2\u2217 has at most\nk non-zero parameters, where the sparsity k is also allowed to increase to in\ufb01nity with the sample size n. We\nassume the scaling k\u03b2\u2217k2 = O(1), which is reasonable in order to have a non-diverging signal-to-noise ratio.\n\n2\n\n\f2.2 M-estimators for noisy and missing covariates\nWe begin by examining a simple deterministic problem. Let Cov(X) = \u03a3x (cid:31) 0, and consider the program\n\nAs long as the constraint radius R is at least k\u03b2\u2217k1, the unique solution to this convex program isb\u03b2 = \u03b2\u2217. This\n\u03a3x\u03b2\u2217, denoted byb\u0393 andb\u03b3, respectively, and consider the modi\ufb01ed program and its regularized version:\n\nidealization suggests various estimators based on the plug-in principle. We form unbiased estimates of \u03a3x and\n\nk\u03b2k1\u2264R\n\n(3)\n\nb\u03b2 \u2208 arg min\n\n2 \u03b2T \u03a3x\u03b2 \u2212 h\u03a3x\u03b2\u2217, \u03b2i(cid:9).\n(cid:8)1\n(cid:8)1\n2 \u03b2Tb\u0393\u03b2 \u2212 hb\u03b3, \u03b2i(cid:9),\nb\u03b2 \u2208 arg min\n(cid:8)1\nb\u03b2 \u2208 arg min\n2 \u03b2Tb\u0393\u03b2 \u2212 hb\u03b3, \u03b2i + \u03bbnk\u03b2k1\nb\u0393Las :=\nX T X and b\u03b3Las :=\n\nk\u03b2k1\u2264R\n\nX T y,\n\n\u03b2\u2208Rp\n\n1\nn\n\n1\nn\n\n(cid:9),\n\nwhere \u03bbn > 0 is the regularization parameter. The Lasso [9, 10] is a special case of these programs, where\n\n(4)\n\n(5)\n\n(6)\n\nand we have introduced the shorthand y = (y1, . . . , yn)T \u2208 Rn, and X \u2208 Rn\u00d7p, with xT\ni as its ith row. In\nthis paper, we focus on more general instantiations of the programs (4) and (5), involving different choices of\n\nthe pair (b\u0393,b\u03b3) that are adapted to the cases of noisy and/or missing data. Note that the matrixb\u0393Las is positive\nnatural choice of the matrixb\u0393 is not positive semide\ufb01nite, hence the loss functions appearing in the problems (4)\n\nsemide\ufb01nite, so the Lasso program is convex. In sharp contrast, for the cases of noisy or missing data, the most\n\nand (5) are non-convex. It is generally impossible to provide a polynomial-time algorithm that converges to a\n(near) global optimum of a non-convex problem. Remarkably, we prove that a simple projected gradient descent\nalgorithm still converges with high probability to a vector close to any global optimum in our setting.\nLet us illustrate these ideas with some examples:\nExample 1 (Additive noise). Suppose we observe the n \u00d7 p matrix Z = X + W , where W is a random\nmatrix independent of X, with rows wi drawn i.i.d. from a zero-mean distribution with known covariance \u03a3w.\nConsider the pair\n\n(7)\nwhich correspond to unbiased estimators of \u03a3x and \u03a3x\u03b2\u2217, respectively. Note that when \u03a3w = 0 (corresponding\nnot positive semide\ufb01nite in the high-dimensional regime (n (cid:28) p) of interest. Indeed, since the matrix 1\n\nto the noiseless case), the estimators reduce to the standard Lasso. However, when \u03a3w 6= 0, the matrixb\u0393add is\nhas rank at most n, the subtracted matrix \u03a3w may causeb\u0393add to have a large number of negative eigenvalues.\n\nn Z T Z\nExample 2 (Missing data). Suppose each entry of X is missing independently with probability \u03c1 \u2208 [0, 1), and\nwe observe the matrix Z \u2208 Rn\u00d7p with entries\n\nZ T y,\n\nb\u0393add :=\n\n1\nn\n\nZ T Z \u2212 \u03a3w and b\u03b3add :=\n\n1\nn\n\n(cid:26)Xij with probability 1 \u2212 \u03c1,\n\u2212 \u03c1 diag(cid:0)eZ TeZ\nand b\u03b3mis :=\n\notherwise.\n\n(cid:1)\n\n0\n\nZij =\n\nb\u0393mis := eZ TeZ\n\neZ T y,\n\n1\nn\n\nGiven the observed matrix Z \u2208 Rn\u00d7p, consider an estimator of the general form (4), based on the choices\n\nwhere eZij = Zij/(1\u2212 \u03c1). It is easy to see that the pair (b\u0393mis,b\u03b3mis) reduces to the pair (b\u0393Las,b\u03b3Las) for the standard\nin equation (8) has rank at most n, so the subtracted diagonal matrix may cause the matrixb\u0393mis to have a\nLasso when \u03c1 = 0, corresponding to no missing data. In the more interesting case when \u03c1 \u2208 (0, 1), the matrix\neZT eZ\nlarge number of negative eigenvalues when n (cid:28) p, and the associated quadratic function is not convex.\n\n(8)\n\nn\n\nn\n\nn\n\n2.3 Restricted eigenvalue conditions\n\nGiven an estimateb\u03b2, there are various ways to assess its closeness to \u03b2\u2217. We focus on the \u20182-norm kb\u03b2\u2212\u03b2\u2217k2, as\nwell as the closely related \u20181-norm kb\u03b2 \u2212 \u03b2\u2217k1. When the covariate matrix X is fully observed (so that the Lasso\ncan be applied), it is well understood that a suf\ufb01cient condition for \u20182-recovery is that the matrixb\u0393Las = 1\n\nn X T X\nsatisfy a restricted eigenvalue (RE) condition (e.g., [11, 12, 13]). In this paper, we use the following condition:\n\n3\n\n\fDe\ufb01nition 1 (Lower-RE condition). The matrixb\u0393 satis\ufb01es a lower restricted eigenvalue condition with curva-\n\nture \u03b1\u2018 > 0 and tolerance \u03c4\u2018(n, p) > 0 if\n\n\u03b8Tb\u0393\u03b8 \u2265 \u03b1\u2018 k\u03b8k2\n\n2 \u2212 \u03c4\u2018(n, p)k\u03b8k2\n\n1\n\nfor all \u03b8 \u2208 Rp.\n\nhas low \u20182-error for any vector \u03b2\u2217 supported on any subset of size at most k . 1\n\nIt can be shown that when the Lasso matrixb\u0393Las = 1\nthat for various random choices of the design matrix X, the Lasso matrixb\u0393Las will satisfy such an RE condition\nDe\ufb01nition 2 (Upper-RE condition). The matrix b\u0393 satis\ufb01es an upper restricted eigenvalue condition with\n\nn X T X satis\ufb01es this RE condition (9), the Lasso estimate\n\u03c4\u2018(n,p). Moreover, it is known\n\nwith high probability (e.g., [14]). We also make use of the analogous upper restricted eigenvalue condition:\n\n(9)\n\nsmoothness \u03b1u > 0 and tolerance \u03c4u(n, p) > 0 if\n\n\u03b8Tb\u0393\u03b8 \u2264 \u03b1uk\u03b8k2\n\n2 + \u03c4u(n, p)k\u03b8k2\n\n1\n\nfor all \u03b8 \u2208 Rp.\n\n(10)\n\nIn recent work on high-dimensional projected gradient descent, Agarwal et al. [15] use a more general form of\nbounds (9) and (10), called the restricted strong convexity (RSC) and restricted smoothness (RSM) conditions.\n\n2.4 Projected gradient descent\n\nIn addition to proving results about the global minima of programs (4) and (5), we are also interested in\npolynomial-time procedures for approximating such optima. We show that the simple projected gradient descent\nalgorithm can be used to solve the program (4). The algorithm generates a sequence of iterates \u03b2t according to\n\n\u03b2t+1 = \u03a0(cid:0)\u03b2t \u2212 1\n\n(b\u0393\u03b2t \u2212b\u03b3)(cid:1),\n\n\u03b7\n\n(11)\n\nwhere \u03b7 > 0 is a stepsize parameter, and \u03a0 denotes the \u20182-projection onto the \u20181-ball of radius R. This\nprojection can be computed rapidly in O(p) time, for instance using a procedure due to Duchi et al. [16]. Our\nanalysis shows that under a reasonable set of conditions, the iterates for the family of programs (4) converges to\na point extremely close to any global optimum in both \u20181-norm and \u20182-norm, even for the non-convex program.\n\n3 Main results and consequences\n\nWe provide theoretical guarantees for both the constrained estimator (4) and the regularized variant\n\nb\u03b2 \u2208 arg min\n\nk\u03b2k1\u2264b0\n\n\u221a\nk\n\n(cid:8)1\n2 \u03b2Tb\u0393\u03b2 \u2212 hb\u03b3, \u03b2i + \u03bbnk\u03b2k1\n\n(cid:9),\n\nfor a constant b0 \u2265 k\u03b2\u2217k2, which is a hybrid between the constrained (4) and regularized (5) programs.\n\n(12)\n\n3.1 Statistical error\n\nIn controlling the statistical error, we assume that the matrixb\u0393 satis\ufb01es a lower-RE condition with curvature\n\u03b1\u2018 and tolerance \u03c4\u2018(n, p), as previously de\ufb01ned (9). In addition, recall that the matrixb\u0393 and vectorb\u03b3 serve\n\nas surrogates to the deterministic quantities \u03a3x \u2208 Rp\u00d7p and \u03a3x\u03b2\u2217 \u2208 Rp, respectively. We assume there is a\nfunction \u03d5(Q, \u03c3\u0001), depending on the standard deviation \u03c3\u0001 of the observation noise vector \u0001 from equation (1)\nand the conditional distribution Q from equation (2), such that the following deviation conditions are satis\ufb01ed:\n\nrlog p\nThe following result applies to any global optimumb\u03b2 of the program (12) with \u03bbn \u2265 4 \u03d5(Q, \u03c3\u0001)\nTheorem 1 (Statistical error). Suppose the surrogates (b\u0393,b\u03b3) satisfy the deviation bounds (13), and the matrix\nb\u0393 satis\ufb01es the lower-RE condition (9) with parameters (\u03b1\u2018, \u03c4\u2018) such that\n\nk(b\u0393 \u2212 \u03a3x)\u03b2\u2217k\u221e \u2264 \u03d5(Q, \u03c3\u0001)\n\nkb\u03b3 \u2212 \u03a3x\u03b2\u2217k\u221e \u2264 \u03d5(Q, \u03c3\u0001)\n\nrlog p\n\nn :\n\n.\n\nq log p\n\n(13)\n\nand\n\nn\n\nn\n\nk \u03c4\u2018(n, p) \u2264 min(cid:8) \u03b1\u2018\n\n\u221a\n\n\u221a\n\n\u03d5(Q, \u03c3\u0001)\n\n2 b0\n\n,\n\nk\n\n128\n\nrlog p\n\n(cid:9).\n\nn\n\n(14)\n\n4\n\n\fThen for any vector \u03b2\u2217 with sparsity at most k, there is a universal positive constant c0 such that any global\n\noptimumb\u03b2 satis\ufb01es the bounds\n\n\u221a\n\nkb\u03b2 \u2212 \u03b2\u2217k2 \u2264 c0\nkb\u03b2 \u2212 \u03b2\u2217k1 \u2264 8 c0 k\n\n\u03b1\u2018\n\nk\n\n\u03b1\u2018\n\nmax(cid:8)\u03d5(Q, \u03c3\u0001)\nmax(cid:8)\u03d5(Q, \u03c3\u0001)\n\nrlog p\nrlog p\n\nn\n\n, \u03bbn\n\n, \u03bbn\n\nn\n\n(cid:9),\n(cid:9).\n\nand\n\n(15a)\n\n(15b)\n\nThe same bounds (without \u03bbn) also apply to the constrained program (4) with radius choice R = k\u03b2\u2217k1.\n\nRemarks: Note that for the standard Lasso pair (b\u0393Las,b\u03b3Las), bounds of the form (15) for sub-Gaussian noise\n\nare well-known from past work (e.g., [12, 17, 18, 19]). The novelty of Theorem 1 is in allowing for general\npairs of such surrogates, which can lead to non-convexity in the underlying M-estimator.\n\n3.2 Optimization error\n\nAlthough Theorem 1 provides guarantees that hold uniformly for any choice of global minimizer, it does not\nprovide any guidance on how to approximate such a global minimizer using a polynomial-time algorithm.\n\nNonetheless, we are able to show that for the family of programs (4), under reasonable conditions onb\u0393 sat-\nis active. Suppose that the surrogate matrixb\u0393 satis\ufb01es the lower-RE (9) and upper-RE (10) conditions with\nconstants (c1, c2) such that for any global optimumb\u03b2, the gradient descent iterates {\u03b2t}\u221e\n\nis\ufb01ed in various settings, a simple projected gradient method will converge geometrically fast to a very good\napproximation of any global optimum.\nTheorem 2 (Optimization error). Consider the program (4) with any choice of radius R for which the constraint\n\u03c4u, \u03c4l (cid:16) log p\nn , and that we apply projected gradient descent (11) with constant stepsize \u03b7 = 2\u03b1u. Then as\nlong as n % k log p, there is a contraction coef\ufb01cient \u03b3 \u2208 (0, 1) independent of (n, p, k) and universal positive\nt=0 satisfy the bound\n\n2 \u2264 \u03b3tk\u03b20 \u2212b\u03b2k2\n\nk\u03b2t \u2212b\u03b2k2\nk\u03b2t \u2212b\u03b2k1 \u2264 2\n\n2 + c1\n\nlog p\n\nn\n\n\u221a\n\nk k\u03b2t \u2212b\u03b2k2 + 2\n\n\u221a\n\nIn addition, we have the \u20181-bound\n\n2\n\nkb\u03b2 \u2212 \u03b2\u2217k2\n1 + c2kb\u03b2 \u2212 \u03b2\u2217k2\nk kb\u03b2 \u2212 \u03b2\u2217k2 + 2kb\u03b2 \u2212 \u03b2\u2217k1\nn ) and O(cid:0)k\n\nfor all t = 0, 1, 2, . . ..\n\n(16)\n\nfor all t = 0, 1, 2, . . ..\n\n(17)\n\nin polynomial-time, and any global optimum b\u03b2 of the program (4), which may be dif\ufb01cult to compute. Since\n\nNote that the bound (16) controls the \u20182-distance between the iterate \u03b2t at time t, which is easily computed\n\u03b3 \u2208 (0, 1), the \ufb01rst term in the bound vanishes as t increases. Together with Theorem 1, equations (16) and (17)\nimply that the \u20182- and \u20181-optimization error are bounded as O( k log p\n\n(cid:1), respectively.\n\nq log p\n\nn\n\n3.3 Some consequences\n\nBoth Theorems 1 and 2 are deterministic results; applying them to speci\ufb01c models requires additional work to\nestablish the stated conditions. We turn to the statements of some consequences of these theorems for different\ncases of noisy, missing, and dependent data. A zero-mean random variable Z is sub-Gaussian with parameter\n\u03c3 > 0 if E(e\u03bbZ) \u2264 exp(\u03bb2\u03c32/2) for all \u03bb \u2208 R. We say that a random matrix X \u2208 Rn\u00d7p is sub-Gaussian\ni \u2208 Rp is sampled independently from a zero-mean distribution with\nwith parameters (\u03a3, \u03c32) if each row xT\ncovariance \u03a3, and for any unit vector u \u2208 Rp, the random variable uT xi is sub-Gaussian with parameter at\nmost \u03c3.\nWe begin with the case of i.i.d. samples with additive noise, as described in Example 1.\nCorollary 1. Suppose we observe Z = X + W , where the random matrices X, W \u2208 Rn\u00d7p are sub-\nGaussian with parameters (\u03a3x, \u03c32\nw), respectively, and the sample size is lower-bounded as\n\n, 1(cid:9)k log p. Then for the M-estimator based on the surrogates (b\u0393add,b\u03b3add), the results of\n\nn % max(cid:8)(cid:0) \u03c32\n\nx) and (\u03a3w, \u03c32\n\n(cid:1)2\n\n\u03bbmin(\u03a3x)\n\nx+\u03c32\n\nw\n\nTheorems 1 and 2 hold with parameters\n\n\u03b1\u2018 =\n\n1\n2 \u03bbmin(\u03a3x) and \u03d5(Q, \u03c3\u0001) = c0\n\nwith probability at least 1 \u2212 c1 exp(\u2212c2 log p).\n\n5\n\n(cid:8)\u03c32\n\np\u03c32\n\n(cid:9),\n\nx + \u03c32\n\nw + \u03c3\u0001\n\nx + \u03c32\n\nw\n\n\fFor i.i.d. samples with missing data, we have the following:\nCorollary 2. Suppose X \u2208 Rn\u00d7p is a sub-Gaussian matrix with parameters (\u03a3x, \u03c32\ndata matrix with parameter \u03c1.\n(1\u2212\u03c1)4\nprobability at least 1 \u2212 c1 exp(\u2212c2 log p) for \u03b1\u2018 = 1\n\u03d5(Q, \u03c3\u0001) = c0\n\nmin(\u03a3x) , 1(cid:1)k log p, then Theorems 1 and 2 hold with\n(cid:0)\u03c3\u0001 + \u03c3x\n\nIf n % max(cid:0)\n\nx), and Z is the missing\n\n(cid:1).\n\n\u03c34\nx\n\n\u03bb2\n\n1\n\n2 \u03bbmin(\u03a3x) and\n\u03c3x\n1 \u2212 \u03c1\n\n1 \u2212 \u03c1\n\nxi+1 = Axi + vi,\n\nNow consider the case where the rows of X are drawn from a vector autoregressive (VAR) process according to\n(18)\nwhere vi \u2208 Rp is a zero-mean noise vector with covariance matrix \u03a3v, and A \u2208 Rp\u00d7p is a driving matrix with\nspectral norm |||A|||2 < 1. We assume the rows of X are drawn from a Gaussian distribution with covariance\n\u03a3x, such that \u03a3x = A\u03a3xAT + \u03a3v, so the process is stationary. Corollary 3 corresponds to the case of additive\nnoise for a Gaussian VAR process. A similar result can be derived in the missing data setting.\nCorollary 3. Suppose the rows of X are drawn according to a Gaussian VAR process with driving matrix A.\n\nSuppose the additive noise matrix W is i.i.d. with Gaussian rows. If n % max(cid:0)\n\nmin(\u03a3x) , 1(cid:1)k log p, with\n\nfor i = 1, 2, . . . , n \u2212 1,\n\n\u03bb2\n\n\u03b64\n\n\u03b6 2 = |||\u03a3w|||op +\n\n2|||\u03a3x|||op\n1 \u2212 |||A|||op\n\n,\n\nthen Theorems 1 and 2 hold with probability at least 1 \u2212 c1 exp(\u2212c2 log p) for \u03b1\u2018 = 1\n\n2 \u03bbmin(\u03a3x) and\n\n\u03d5(Q, \u03c3\u0001) = c0(\u03c3\u0001\u03b6 + \u03b6 2).\n\n3.4 Application to graphical model inverse covariance estimation\n\nThe problem of inverse covariance estimation for a Gaussian graphical model is closely related to the Lasso.\nMeinshausen and B\u00a8uhlmann [20] prescribed a way to recover the support of the precision matrix \u0398 when each\ncolumn of \u0398 is k-sparse, via linear regression and the Lasso. More recently, Yuan [21] proposed a method for\n\nestimating \u0398 using linear regression and the Dantzig selector, and obtained error bounds on |||b\u0398 \u2212 \u0398|||1 when\n\nthe columns of \u0398 are bounded in \u20181. Both of these results assume the rows of X are observed noiselessly and\nindependently.\nSuppose we are given a matrix X \u2208 Rn\u00d7p of samples from a multivariate Gaussian distribution, where each row\nis distributed according to N(0, \u03a3). We assume the rows of X are either i.i.d. or sampled from a Gaussian VAR\nprocess (18). Based on the modi\ufb01ed Lasso, we devise a method to estimate \u0398 based on a corrupted observation\nmatrix Z. Let X j denote the jth column of X, and let X\u2212j denote the matrix X with jth column removed. By\nstandard results on Gaussian graphical models, there exists a vector \u03b8j \u2208 Rp\u22121 such that\n\nX j = X\u2212j\u03b8j + \u0001j,\n\n(19)\nwhere \u0001j is a vector of i.i.d. Gaussians and \u0001j \u22a5\u22a5 X\u2212j. De\ufb01ning aj := \u2212(\u03a3jj \u2212 \u03a3j,\u2212j\u03b8j)\u22121, we have\n\n\u0398j,\u2212j = aj\u03b8j. Our algorithm estimatesb\u03b8j andbaj for each j and combines the estimates to obtainb\u0398j,\u2212j =bajb\u03b8j.\npair (b\u0393(j),b\u03b3(j)) = (b\u03a3\u2212j,\u2212j, 1\n\nIn the additive noise case, we observe Z = X + W . The equations (19) yield Z j = X\u2212j\u03b8j + (\u0001j + W j).\nNote that \u03b4j = \u0001j + W j is a vector of i.i.d. Gaussians, and since X \u22a5\u22a5 W , we have \u03b4j \u22a5\u22a5 X\u2212j. Hence, our\nresults on covariates with additive noise produce an estimate of \u03b8j by solving the program (4) or (12) with the\nn Z T Z \u2212 \u03a3w. When Z is a missing-data version of X,\nusing the modi\ufb01ed Lasso program (4) or (12) with the estimators (b\u0393(j),b\u03b3(j)), to obtain estimatesb\u03b8j.\n(1) Perform p linear regressions of the variables Z j upon the remaining variables Z\u2212j,\n(2) Estimate the scalars aj usingbaj := \u2212(b\u03a3jj \u2212b\u03a3j,\u2212jb\u03b8j)\u22121. Sete\u0398j,\u2212j =bajb\u03b8j ande\u0398jj = \u2212baj.\n|||\u0398 \u2212e\u0398|||1, where Sp is the set of symmetric p \u00d7 p matrices.\n(3) Construct the matrixb\u0398 = arg min\n\nwe similarly estimate the vectors \u03b8j with suitable corrections. We arrive at the following algorithm:\nAlgorithm 3.1.\n\nn Z\u2212jT Z j), whereb\u03a3 = 1\n\n\u0398\u2208Sp\n\nNote that the minimization in step (3) is a linear program, so is easily solved with standard methods. We have:\n\n6\n\n\fCorollary 4. Suppose the columns of the matrix \u0398 are k-sparse, and suppose the condition number \u03ba(\u0398) is\nnonzero and \ufb01nite. Suppose the deviation conditions\n\nk(b\u0393(j) \u2212 \u03a3\u2212j,\u2212j)\u03b8jk\u221e \u2264 \u03d5(Q, \u03c3\u0001)\n\nrlog p\n\nkb\u03b3(j) \u2212 \u03a3\u2212j,\u2212j\u03b8jk\u221e \u2264 \u03d5(Q, \u03c3\u0001)\n\nhold for all j, and suppose we have the following additional deviation condition onb\u03a3:\nFinally, suppose the lower-RE condition holds uniformly over the matrices b\u0393(j) with the scaling (14). Then\n\nrlog p\n\nunder the estimation procedure of Algorithm 3.1, there exists a universal constant c0 such that\n\n(20)\n\n(21)\n\nn\n\nn\n\n.\n\nn\n\nand\n\nrlog p\nkb\u03a3 \u2212 \u03a3kmax \u2264 c\u03d5(Q, \u03c3\u0001)\n(cid:0) \u03d5(Q, \u03c3\u0001)\n\n\u03bbmin(\u03a3)\n\n\u03bbmin(\u03a3)\n\n|||b\u0398 \u2212 \u0398|||op \u2264 c0\u03ba2(\u03a3)\n\nrlog p\n\n(cid:1)k\n\n.\n\nn\n\n+ \u03d5(Q, \u03c3\u0001)\n\n\u03b1\u2018\n\n4 Simulations\n\nIn this section, we provide simulation results to con\ufb01rm that the scalings predicted by our theory are sharp.\nIn Figure 1, we plot the results of simulations under the additive noise model described in Example 1, using\n\u03a3x = I and \u03a3w = \u03c32\nwI with \u03c3w = 0.2. Panel (a) provides plots of \u20182-error versus the sample size n, for\np \u2208 {128, 256, 512}. For all three choices of dimensions, the error decreases to zero as the sample size n in-\ncreases, showing consistency of the method. If we plot the \u20182-error versus the rescaled sample size n/(k log p),\nas depicted in panel (b), the curves roughly align for different values of p, agreeing with Theorem 1. Panel (c)\nshows analogous curves for VAR data with additive noise, using a driving matrix A with |||A|||op = 0.2.\n\n(a)\n\nFigure 1. Plots of the error kb\u03b2 \u2212 \u03b2\u2217k2 after running projected gradient descent on the non-convex objective, with\nsparsity k \u2248 \u221a\nrescaled sample size\nby Theorem 1, the curves align for different values of p in the rescaled plot.\n\np. Plot (a) is an error plot for i.i.d. data with additive noise, and plot (b) shows \u20182-error versus the\nk log p . Plot (c) depicts a similar (rescaled) plot for VAR data with additive noise. As predicted\n\nn\n\n(b)\n\n(c)\n\nWe also veri\ufb01ed the results of Theorem 2 empirically. Figure 2 shows the results of applying projected gradient\ndescent to solve the optimization problem (4) in the cases of additive noise and missing data. We \ufb01rst applied\n\nprojected gradient to obtain an initial estimate b\u03b2, then reapplied projected gradient descent 10 times, tracking\nthe optimization error k\u03b2t \u2212b\u03b2k2 (in blue) and statistical error k\u03b2t \u2212 \u03b2\u2217k2 (in red). As predicted by Theorem 2,\n\nthe iterates exhibit geometric convergence to roughly the same \ufb01xed point, regardless of starting point.\nFinally, we simulated the inverse covariance matrix estimation algorithm on three types of graphical models:\n\n(a) Chain-structured graphs. In this case, all nodes of are arranged in a line. The diagonal entries of \u0398\n\nequal 1, and entries corresponding to links in the chain equal 0.1. Then \u0398 is rescaled so |||\u0398|||op = 1.\nIn this case, all nodes are connected to a central node, which has degree\nk \u2248 0.1p. All other nodes have degree 1. The diagonal entries of \u0398 are set equal to 1, and all entries\ncorresponding to edges in the graph are set equal to 0.1. Then \u0398 is rescaled so |||\u0398|||op = 1.\n\n(b) Star-structured graphs.\n\n(c) Erd\u00a8os-Renyi graphs. As in Rothman et al. [22], we \ufb01rst generate a matrix B with diagonal entries 0,\nand all other entries independently equal to 0.5 with probability k/p, and 0 otherwise. Then \u03b4 is chosen\nso \u0398 = B + \u03b4I has condition number p, and \u0398 is rescaled so |||\u0398|||op = 1.\n\n7\n\n0500100015002000250030000.080.10.120.140.160.180.20.220.240.260.28nl2 norm errorAdditive noise p=128p=256p=51224681012141618200.080.10.120.140.160.180.20.220.240.260.28n/(k log p)l2 norm errorAdditive noise p=128p=256p=512024681012141618200.050.10.150.20.250.30.350.40.450.50.55n/(k log p)l2 norm errorAdditive noise with autoregressive data p=128p=256p=512\f(a)\n\nFigure 2. Plots of the optimization error log(k\u03b2t \u2212 b\u03b2k2) and statistical error log(k\u03b2t \u2212 \u03b2\u2217k2) versus iteration\n\nnumber t, generated by running projected gradient descent on the non-convex objective. As predicted by Theorem 2,\nthe optimization error decreases geometrically.\n\n(b)\n\n|||b\u0398\u2212\u0398|||op plotted against the sample size n for a chain-structured graph, with panel (a) showing the\n\nAfter generating the matrix X of n i.i.d. samples from the appropriate graphical model, with covariance matrix\n\u03a3x = \u0398\u22121, we generated the corrupted matrix Z = X + W with \u03a3w = (0.2)2I. Figure 3 shows the rescaled\n\u20182-error 1\u221a\noriginal plot and panel (b) plotting against the rescaled sample size. We obtained qualitatively similar results\nfor the star and Erd\u00a8os-Renyi graphs, in the presence of missing and/or dependent data.\n\nk\n\n(a) \u20182 error plot for chain graph, additive noise\n\nk\n\nFigure 3. (a) Plots of the rescaled error 1\u221a\nGaussian graphical model with additive noise. As predicted by Theorems 1 and 2, all curves align when the rescaled\nerror is plotted against the ratio\n\nk log p , as shown in (b). Each point represents the average over 50 trials.\n\nn\n\n|||b\u0398\u2212\u0398|||op after running projected gradient descent for a chain-structured\n\n(b) rescaled plot\n\n5 Discussion\n\nIn this paper, we formulated an \u20181-constrained minimization problem for sparse linear regression on corrupted\ndata. The source of corruption may be additive noise or missing data, and although the resulting objective is\nnot generally convex, we showed that projected gradient descent is guaranteed to converge to a point within\nstatistical precision of the optimum. In addition, we established \u20181- and \u20182-error bounds that hold with high\nprobability when the data are drawn i.i.d. from a sub-Gaussian distribution, or drawn from a Gaussian VAR\nprocess. Finally, we used our results on linear regression to perform sparse inverse covariance estimation for a\nGaussian graphical model, based on corrupted data. The bounds we obtain for the spectral norm of the error are\nof the same order as existing bounds for inverse covariance matrix estimation with uncorrupted, i.i.d. data.\n\nAcknowledgments\n\nPL acknowledges support from a Hertz Foundation Fellowship and an NDSEG Fellowship; MJW and PL were\nalso partially supported by grants NSF-DMS-0907632 and AFOSR-09NL184. The authors thank Alekh Agar-\nwal, Sahand Negahban, and John Duchi for discussions and guidance.\n\n8\n\n020406080100(cid:239)3.5(cid:239)3(cid:239)2.5(cid:239)2(cid:239)1.5(cid:239)1(cid:239)0.500.5Iteration countlog(||(cid:96)t (cid:239) (cid:96)||2)Log error plot: additive noise case Stat errorOpt error020406080100(cid:239)3.5(cid:239)3(cid:239)2.5(cid:239)2(cid:239)1.5(cid:239)1(cid:239)0.500.5Iteration countlog(||(cid:96)t (cid:239) (cid:96)||2)Log error plot: missing data case Stat errorOpt error010020030040050060070000.10.20.30.40.50.60.7n1/sqrt(k) * l2 operator norm errorChain graph p=64p=128p=25610203040506000.10.20.30.40.50.60.7n/(k log p)1/sqrt(k) * l2 operator norm errorChain graph p=64p=128p=256\fReferences\n[1] R. Little and D. B. Rubin. Statistical analysis with missing data. Wiley, New York, 1987.\n[2] J. T. Hwang. Multiplicative errors-in-variables models with applications to recent data released by the\nU.S. Department of Energy. Journal of the American Statistical Association, 81(395):pp. 680\u2013688, 1986.\n[3] R. J. Carroll, D. Ruppert, and L. A. Stefanski. Measurement Error in Nonlinear Models. Chapman and\n\nHall, 1995.\n\n[4] S. J. Iturria, R. J. Carroll, and D. Firth. Polynomial regression and estimating functions in the presence of\nmultiplicative measurement error. Journal of the Royal Statistical Society Series B - Statistical Methodol-\nogy, 61:547\u2013561, 1999.\n\n[5] Q. Xu and J. You. Covariate selection for linear errors-in-variables regression models. Communications\n\nin Statistics - Theory and Methods, 36(2):375\u2013386, 2007.\n\n[6] N. St\u00a8adler and P. B\u00a8uhlmann. Missing values: Sparse inverse covariance estimation and an extension to\n\nsparse regression. Statistics and Computing, pages 1\u201317, 2010.\n\n[7] M. Rosenbaum and A. B. Tsybakov. Sparse recovery under matrix uncertainty. Annals of Statistics,\n\n38:2620\u20132651, 2010.\n\n[8] P. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data: Provable\nguarantees with non-convexity. Technical report, UC Berkeley, September 2011. Available at http:\n//arxiv.org/abs/1109.3714.\n\n[9] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[10] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[11] S. van de Geer. The deterministic Lasso. In Proceedings of Joint Statistical Meeting, 2007.\n[12] P. J. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of\n\nStatistics, 37(4):1705\u20131732, 2009.\n\n[13] S. van de Geer and P. Buhlmann. On the conditions used to prove oracle results for the Lasso. Electronic\n\nJournal of Statistics, 3:1360\u20131392, 2009.\n\n[14] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated Gaussian designs.\n\nJournal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[15] A. Agarwal, S. Negahban, and M.J. Wainwright. Fast global convergence of gradient methods for high-\ndimensional statistical recovery. Technical report, UC Berkeley, April 2011. Available at http://\narxiv.org/abs/1104.4824.\n\n[16] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u20181-ball for learning\n\nin high dimensions. In International Conference on Machine Learning, pages 272\u2013279, 2008.\n\n[17] C. H. Zhang and J. Huang. The sparsity and bias of the Lasso selection in high-dimensional linear regres-\n\nsion. Annals of Statistics, 36(4):1567\u20131594, 2008.\n\n[18] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.\n\nAnnals of Statistics, 37(1):246\u2013270, 2009.\n\n[19] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis of regu-\n\nlarized M-estimators. In Advances in Neural Information Processing Systems, 2009.\n\n[20] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34:1436\u20131462, 2006.\n\n[21] M. Yuan. High-dimensional inverse covariance matrix estimation via linear programming. Journal of\n\nMachine Learning Research, 99:2261\u20132286, August 2010.\n\n[22] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.\n\nElectronic Journal of Statistics, 2:494\u2013515, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1482, "authors": [{"given_name": "Po-ling", "family_name": "Loh", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}