{"title": "Dirty Statistical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 611, "page_last": 619, "abstract": "We provide a unified framework for the high-dimensional analysis of \u201csuperposition-structured\u201d or \u201cdirty\u201d statistical models: where the model parameters are a \u201csuperposition\u201d of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of $M$-estimators that minimize the sum of any loss function, and an instance of what we call a \u201chybrid\u201d regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures.", "full_text": "Dirty Statistical Models\n\nEunho Yang\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\neunho@cs.utexas.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nAbstract\n\nWe provide a uni\ufb01ed framework for\nthe high-dimensional analysis of\n\u201csuperposition-structured\u201d or \u201cdirty\u201d statistical models: where the model param-\neters are a superposition of structurally constrained parameters. We allow for any\nnumber and types of structures, and any statistical model. We consider the gen-\neral class of M-estimators that minimize the sum of any loss function, and an\ninstance of what we call a \u201chybrid\u201d regularization, that is the in\ufb01mal convolution\nof weighted regularization functions, one for each structural component. We pro-\nvide corollaries showcasing our uni\ufb01ed framework for varied statistical models\nsuch as linear regression, multiple regression and principal component analysis,\nover varied superposition structures.\n\n1\n\nIntroduction\n\nHigh-dimensional statistical models have been the subject of considerable focus over the past\ndecade, both theoretically as well as in practice. In these high-dimensional models, the ambient\ndimension of the problem p may be of the same order as, or even substantially larger than the sample\nsize n. It has now become well understood that even in this type of high-dimensional p n scal-\ning, it is possible to obtain statistically consistent estimators provided one imposes structural con-\nstraints on the statistical models. Examples of such structural constraints include sparsity constraints\n(e.g. compressed sensing), graph-structure (for graphical model estimation), low-rank structure (for\nmatrix-structured problems), and sparse additive structure (for non-parametric models), among oth-\ners. For each of these structural constraints, a large body of work have proposed and analyzed\nstatistically consistent estimators. For instance, a key subclass leverage such structural constraints\nvia speci\ufb01c regularization functions. Examples include `1-regularization for sparse models, nuclear\nnorm regularization for low-rank matrix-structured models, and so on.\nA caveat to this strong line of work is that imposing such \u201cclean\u201d structural constraints such as\nsparsity or low-rank structure, is typically too stringent for real-world messy data. What if the\nparameters are not exactly sparse, or not exactly low rank? Indeed, over the last couple of years,\nthere has been an emerging line of work that address this caveat by \u201cmixing and matching\u201d different\nstructures. Chandrasekaran et al. [5] consider the problem of recovering an unknown low-rank and\nan unknown sparse matrix, given the sum of the two matrices; for which they point to applications\nin system identi\ufb01cation in linear time-invariant systems, and optical imaging systems among others.\nChandrasekaran et al. [6] also apply this matrix decomposition estimation to the learning of latent-\nvariable Gaussian graphical models, where they estimate an inverse covariance matrix that is the sum\nof sparse and low-rank matrices. A number of papers have applied such decomposition estimation\nto robust principal component analysis: Cand`es et al. [3] learn a covariance matrix that is the sum\nof a low-rank factored matrix and a sparse \u201cerror/outlier\u201d matrix, while [9, 15] learn a covariance\nmatrix that is the sum of a low-rank matrix and a column-sparse error matrix. Hsu et al. [7] analyze\nthis estimation of a sum of a low-rank and elementwise sparse matrix in the noisy setting; while\nAgarwal et al. [1] extend this to the sum of a low-rank matrix and a matrix with general structure.\nAnother application is multi-task learning, where [8] learn a multiple-linear-regression coef\ufb01cient\n\n1\n\n\fmatrix that is the sum of a sparse and a block-sparse matrix. This strong line of work can be seen to\nfollow the resume of estimating a superposition of two structures; and indeed their results show this\nsimple extension provides a vast increase in the practical applicability of structurally constrained\nmodels. The statistical guarantees in these papers for the corresponding M-estimators typically\nrequire fairly extensive technical arguments that extend the analyses of speci\ufb01c single-structured\nregularized estimators in highly non-trivial ways.\nThis long-line of work above on M-estimators and analyses for speci\ufb01c pairs of super-position\nstructures for speci\ufb01c statistical models, lead to the question: is there a uni\ufb01ed framework for study-\ning any general tuple (i.e. not just a pair) of structures, for any general statistical model? This is\nprecisely the focus of this paper: we provide a uni\ufb01ed framework of \u201csuperposition-structured\u201d or\n\u201cdirty\u201d statistical models, with any number and any types of structures, for any statistical model.\nBy such \u201csuperposition-structure,\u201d we mean the constraint that the parameter be a superposition of\n\u201cclean\u201d structurally constrained parameters. In addition to the motivation above, of unifying the\nburgeoning list of works above, as well as to provide guarantees for many novel superpositions (of\nfor instance more than two structures) not yet considered in the literature; another key motivation is\nto provide insights on the key ingredients characterizing the statistical guarantees for such dirty sta-\ntistical models. Our uni\ufb01ed analysis allows the following very general class of M-estimators, which\nare the sum of any loss function, and an instance of what we call a \u201chybrid\u201d regularization func-\ntion, that is the in\ufb01mal convolution of any weighted regularization functions, one for each structural\ncomponent. As we show, this is equivalent to an M-estimator that is the sum of (a) a loss function\napplied to the sum of the multiple parameter vectors, one corresponding to each structural compo-\nnent; and (b) a weighted sum of regularization functions, one for each of the parameter vectors. We\nstress that our analysis allows for general loss functions, and general component regularization func-\ntions. We provide corollaries showcasing our uni\ufb01ed framework for varied statistical models such as\nlinear regression, multiple regression and principal component analysis, over varied superposition\nstructures.\n\n2 Problem Setup\n\nWe consider the following general statistical modeling setting. Consider a random variable Z with\n1 := {Z1, . . . , Zn} drawn i.i.d. from P.\ndistribution P, and suppose we are given n observations Zn\nWe are interested in estimating some parameter \u2713\u21e4 2 Rp of the distribution P. We assume that\nthe statistical model parameter \u2713\u21e4 is \u201csuperposition-structured,\u201d so that it is the sum of parameter\ncomponents, each of which is constrained by a speci\ufb01c structure. For a formalization of the notion\nof structure, we \ufb01rst review some terminology from [11]. There, they use subspace pairs (M,M?),\nwhere M\u2713 M, to capture any structured parameter. M is the model subspace that captures\nthe constraints imposed on the model parameter, and is typically low-dimensional. M? is the\nperturbation subspace of parameters that represents perturbations away from the model subspace.\nThey also de\ufb01ne the property of decomposability of a regularization function, which captures the\nsuitablity of a regularization function R to particular structure. Speci\ufb01cally, a regularization function\nR is said to be decomposable with respect to a subspace pair (M,M?), if\n\n2\n\nR(u + v) = R(u) + R(v),\n\nfor all u 2M , v 2 M?.\n\nFor any structure such as sparsity, low-rank, etc., we can de\ufb01ne the corresponding low-dimensional\nmodel subspaces, as well as regularization functions that are decomposable with respect to the cor-\nresponding subspace pairs.\nI. Sparse vectors. Given any subset S \u2713{ 1, . . . , p} of the coordinates, let M(S) be the subspace\nof vectors in Rp that have support contained in S. It can be seen that any parameter \u2713 2M (S)\nwould be atmost |S|-sparse. For this case, we use M(S) = M(S), so that M?(S) = M?(S). As\nshown in [11], the `1 norm R(\u2713) = k\u2713k1, commonly used as a sparsity-encouraging regularization\nfunction, is decomposable with respect to subspace pairs (M(S),M?(S)).\nII. Low-rank matrices. Consider the class of matrices \u21e5 2 Rk\u21e5m that have rank r \uf8ff min{k, m}.\nFor any given matrix \u21e5, we let row(\u21e5) \u2713 Rm and col(\u21e5) \u2713 Rk denote its row space and column\nspace respectively. For a given pair of r-dimensional subspaces U \u2713 Rk and V \u2713 Rm, we de\ufb01ne\nthe subspace pairs as follows: M(U, V ) := \u21e5 2 Rk\u21e5m | row(\u21e5) \u2713 V, col(\u21e5) \u2713 U and\n\n\fM?(U, V ) :=\u21e5 2 Rk\u21e5m | row(\u21e5) \u2713 V ?, col(\u21e5) \u2713 U? . As [11] show, the nuclear norm\nR(\u2713) = |||\u2713|||1 is decomposable with respect to the subspace pairs (M(U, V ),M?(U, V )).\nIn our dirty statistical model setting, we do not just have one, but a set of structures; suppose we\nindex them by the set I. Our key structural constraint can then be stated as: \u2713\u21e4 =P\u21b52I \u2713\u21e4\u21b5, where\n\u2713\u21e4\u21b5 is a \u201cclean\u201d structured parameter with respect to a subspace pair (M\u21b5,M?\n\u21b5 ), for M\u21b5 \u2713 M\u21b5.\nWe also assume we are given a set of regularization functions R\u21b5(\u00b7), for \u21b5 2 I that are suited to\nthe respective structures, in the sense that they are decomposable with respect to the subspace pairs\n(M\u21b5,M?\n\u21b5 ).\nLet L :\u2326 \u21e5Z n 7! R be some loss function that assigns a cost to any parameter \u2713 2 \u2326 \u2713 Rp, for a\ngiven set of observations Zn\n1 . For ease of notation, in the sequel, we adopt the shorthand L(\u2713) for\nL(\u2713; Zn\n\n1 ). We are interested in the following \u201csuper-position\u201d estimator:\n\nmin\n\n(\u2713\u21b5)\u21b52I L\u21e3X\u21b52I\n\n\u2713\u21b5\u2318 +X\u21b52I\n\n\u21b5 R\u21b5(\u2713\u21b5),\n\n(1)\n\nwhere (\u21b5)\u21b52I are the regularization penalties. This optimization problem involves not just one\nparameter vector, but multiple parameter vectors, one for each structural component: while the\nloss function applies only to the sum of these, separate regularization functions are applied to the\ncorresponding parameter vectors. We will now see that this can be re-written to a standard M-\nestimation problem which minimizes, over a single parameter vector, the sum of a loss function and\na special \u201cdirty\u201d regularization function.\nGiven a vector c := (c\u21b5)\u21b52I of convex-combination weights, suppose we de\ufb01ne the following\n\u201cdirty\u201d regularization function, that is the in\ufb01mal convolution of a set of regularization functions:\n\nR(\u2713; c) = infnX\u21b52I\n\nc\u21b5R\u21b5(\u2713\u21b5) : X\u21b52I\n\n\u2713\u21b5 = \u2713o.\n\n(2)\n\nIt can be shown that provided the individual regularization functions R\u21b5(\u00b7), for \u21b5 2 I, are norms,\nR(\u00b7; c) is a norm as well. We discuss this and other properties of this hybrid regularization function\nR(\u00b7; c) in Appendix A.\nProposition 1. Suppose (b\u2713\u21b5)\u21b52I is the solution to the M-estimation problem in (1). Then b\u2713 :=\nP\u21b52Ib\u2713\u21b5 is the solution to the following problem:\nwhere c\u21b5 = \u21b5/. Similarly, if b\u2713 is the solution to (3), then there is a solution (b\u2713\u21b5)\u21b52I to the\nM-estimation problem (1), such thatb\u2713 :=P\u21b52Ib\u2713\u21b5.\n\nProposition 1 shows that the optimization problems (1) and (3) are equivalent. While the tuning\nparameters in (1) correspond to the regularization penalties (\u21b5)\u21b52I, the tuning parameters in (3)\ncorrespond to the weights (c\u21b5)\u21b52I specifying the \u201cdirty\u201d regularization function. In our uni\ufb01ed\nanalysis theorem, we will provide guidance on setting these tuning parameters as a function of\nvarious model-parameters.\n\n\u27132\u2326 L(\u2713) + R(\u2713; c),\n\nmin\n\n(3)\n\n3 Error Bounds for Convex M-estimators\n\nOur goal is to provide error bounds kb\u2713 \u2713\u21e4k, between the target parameter \u2713\u21e4, the minimizer of the\npopulation risk, and our M-estimateb\u2713 from (1), for any error norm k\u00b7k . A common example of an\nerror norm for instance is the `2 norm k\u00b7k 2. We now turn to the properties of the loss function and\nregularization function that underlie our analysis. We \ufb01rst restate some natural assumptions on the\nloss and regularization functions.\n(C1) The loss function L is convex and differentiable.\n(C2) The regularizers R\u21b5 are norms, and are decomposable with respect to the subspace pairs\n\n(M\u21b5,M?\n\n\u21b5 ), where M\u21b5 \u2713 M\u21b5.\n\n3\n\n\fOur next assumption is a restricted strong convexity assumption [11]. Speci\ufb01cally, we will require\nthe loss function L to satisfy:\n(C3) (Restricted Strong Convexity) For all \u21b5 2 \u2326\u21b5, where \u2326\u21b5 is the parameter space for the\n\nparameter component \u21b5,\n\nL(\u21b5; \u2713\u21e4) := L(\u2713\u21e4 + \u21b5) L (\u2713\u21e4) \u2326r\u2713L(\u2713\u21e4), \u21b5\u21b5 \uf8ffLk\u21b5k2 g\u21b5R2\nwhere \uf8ffL is a \u201ccurvature\u201d parameter, and g\u21b5R2\n\n\u21b5(\u21b5) is a \u201ctolerance\u201d parameter.\n\n\u21b5(\u21b5),\n\nNote that these conditions (C1)-(C3) are imposed even when the model has a single clean structural\nconstraint; see [11]. Note that g\u21b5 is usually a function on the problem size decreasing in the sample\nsize; in the standard Lasso with |I| = 1 for instance, g\u21b5 = log p\nn .\nOur next assumption is on the interaction between the different structured components.\n(C4) (Structural Incoherence) For all \u21b5 2 \u2326\u21b5,\n\n\uf8ffL\n\n2 X\u21b52I\n\nk\u21b5k2 +X\u21b52I\n\nh\u21b5R2\n\n\u21b5(\u21b5).\n\nL\u2713\u21e4 +X\u21b52I\n\n\u21b5 + (|I| 1)L(\u2713\u21e4) X\u21b52I\n\nL\u2713\u21e4 + \u21b5 \uf8ff\n\nNote that for a model with a single clean structural constraint, with |I| = 1, the condition (C4) is\ntrivially satis\ufb01ed since the LHS becomes 0. We will see in the sequel that for a large collection\nof loss functions including all linear loss functions, the condition (C4) simpli\ufb01es considerably, and\nmoreover holds with high probability, typically with h\u21b5 = 0. We note that this condition is much\nweaker than \u201cincoherence\u201d conditions typically imposed when analyzing speci\ufb01c instances of such\nsuperposition-structured models (see e.g. references in the introduction), where the assumptions\ntypically include (a) assuming that the structured subspaces (M\u21b5)\u21b52I intersect only at {0}, and (b)\nthat the sizes of these subspaces are extremely small.\nFinally, we will use the notion of subspace compatibility constant de\ufb01ned in [11], that captures the\nrelationship between the regularization function R(\u00b7) and the error norm k\u00b7k , over vectors in the\nsubspace M: (M,k\u00b7k ) := supu2M\\{0}\nTheorem 1. Suppose we solve the M-estimation problem in (3), with hybrid regularization\nR(\u00b7; c), where the convex-combination weights c are set as c\u21b5 = \u21b5/P\u21b52I \u21b5, with \u21b5 \n2R\u21e4\u21b5r\u2713\u21b5L(\u2713\u21e4; Zn\n1 ). Further, suppose conditions (C1) - (C4) are satis\ufb01ed. Then, the parame-\nkb\u2713 \u2713\u21e4k \uf8ff\u2713 3|I|\n2 32\u00afg2|I|\u21e3 max\n\u21b5\u21e7M?\u21b5\n\u21b5R2\n\n2\u00af\uf8ff\u25c6 max\n\u21b5 \u21b5(M\u21b5)\u23182\n(\u2713\u21e4\u21b5) +\n\n\u21b5pg\u21b5 + h\u21b5 ,\n(\u2713\u21e4\u21b5)i.\n\n\u21b5 \u21b5(M\u21b5) + (|I|p\u2327L/p\u00af\uf8ff),\n\n\u2327L :=X\u21b52Ih32\u00afg22\n\n2\u21b5\n\n|I| R\u21b5\u21e7M?\u21b5\n\nter error bounds are given as:\n\n.\n\nR\nkuk\n\n,\n\n\u00afg := max\n\n\uf8ffL\n\n\u00af\uf8ff :=\n\nwhere\n\n1\n\n\u21b5\n\n\u21b52I\n\n\u21b52I\n\nRemarks: (R1) It is instructive to compare Theorem 1 to the main Theorem in [11], where they\nderive parameter error bounds for any M-estimator with a decomposable regularizer, for any\n\u201cclean\u201d structure. Our theorem can be viewed as a generalization: we recover their theorem\nwhen we have a single structure with |I| = 1. We cannot derive our result in turn from their\nthe\ntheorem applied to the M-estimator (3) with the hybrid regularization function R(\u00b7; c):\n\u201csuperposition\u201d structure is not captured by a pair of subspaces, nor is the hybrid regularization\nfunction decomposable, as is required by their theorem. Our setting as well as analysis is strictly\nmore general, because of which we needed the additional structural incoherence assumption (C4)\n(which is trivially satis\ufb01ed when |I| = 1).\n(R2) Agarwal et al. [1] provide Frobenius norm error bounds for the matrix-decomposition problem\nof recovering the sum of low-rank and a general structured matrix. In addition to the greater\ngenerality of our theorem and framework, Theorem 1 addresses two key drawbacks of their\ntheorem even in their speci\ufb01c setting. First, the proof for their theorem requires the regularization\n\n4\n\n\fpenalty for the second structure to be strongly bounded away from zero: their convergence rate\ndoes not approach zero even with in\ufb01nite number of samples n. Theorem 1, in contrast, imposes\n\nthe weaker condition \u21b5 2R\u21e4\u21b5r\u2713\u21b5L(\u2713\u21e4; Zn\n\nfor the convergence rates to go to zero as a function of the samples. Second, they assumed much\nstronger conditions for their theorem to hold; in Theorem 1 in contrast, we pose much milder\n\u201clocal\u201d RSC conditions (C3), and a structural incoherence condition (C4).\n\n1 ), which as we show in the corollaries, allows\n\n(R3) The statement in the theorem is deterministic for \ufb01xed choices of (\u21b5). We also note that\nthe theorem holds for any set of subspace pairs (M\u21b5,M?\n\u21b5 )\u21b52I with respect to which the cor-\nresponding regularizers are decomposable. As noted earlier, the M\u21b5 should ideally be set to\nthe structured subspace in which the true parameter at least approximately lies, and which we\nwant to be as small as possible (note that the bound includes a term that depends on the size\nof this subspace via the subspace compatibility constant). In particular, if we assume that the\n(\u2713\u21e4\u21b5) = 0 i.e. \u2713\u21e4\u21b5 2M \u21b5, then we obtain the simpler bound in\nsubspaces are chosen so that \u21e7M?\u21b5\nthe following corollary.\nCorollary 1. Suppose we solve the M-estimation problem in (1), with hybrid regularization\nR(\u00b7; c), where the convex-combination weights c are set as c\u21b5 = \u21b5/P\u21b52I \u21b5, with \u21b5 \n1 ), and suppose conditions (C1) - (C4) are satis\ufb01ed. Further, suppose that the\n2R\u21e4\u21b5r\u2713\u21b5L(\u2713\u21e4; Zn\nsubspace-pairs are chosen so that \u2713\u21e4\u21b5 2M \u21b5. Then, the parameter error bounds are given as:\n\nkb\u2713 \u2713\u21e4k \uf8ff\u2713 3|I|\n\n2\u00af\uf8ff\u25c6 max\n\n\u21b52I\n\n\u21b5 \u21b5(M\u21b5).\n\nIt is now instructive to compare the bounds of Theorem 1, and Corollary 1. Theorem 1 has two terms:\nthe \ufb01rst of which is the sole term in the bound in Corollary 1. This \ufb01rst term can be thought of as\nthe \u201cestimation error\u201d component of the error bound, when the parameter has exactly the structure\nbeing modeled by the regularizers. The second term can be thought of as the \u201capproximation error\u201d\ncomponent of the error bound, which is the penalty for the parameter not exactly lying in the struc-\ntured subspaces modeled by the regularizers. The key term in the \u201cestimation error\u201d component, in\nTheorem 1, and Corollary 1, is:\n\n = max\n\u21b52I\n\n\u21b5 \u21b5(M\u21b5).\n\nNote that each \u21b5 is larger than a particular norm of the sample score function (gradient of the loss\nat the true parameter): since the expected value of the score function is zero, the magnitude of the\nsample score function captures the amount of \u201cnoise\u201d in the data. This is in turn scaled by \u21b5(M\u21b5),\nwhich captures the size of the structured subspace corresponding to the parameter component \u2713\u21e4\u21b5. \ncan thus be thought of as capturing the amount of noise in the data relative to the particular structure\nat hand.\nWe now provide corollaries showcasing our uni\ufb01ed framework for varied statistical models such as\nlinear regression, multiple regression and principal component analysis, over varied superposition\nstructures.\n\n4 Convergence Rates for Linear Regression\n\nIn this section, we consider the linear regression model:\n\n(4)\nwhere Y 2 Rn is the observation vector, and \u2713\u21e4 2 Rp is the true parameter. X 2 Rn\u21e5p is the\n\u201cobservation\u201d matrix; while w 2 Rn is the observation noise. For this class of statistical models, we\nwill consider the instantiation of (1) with the loss function L consisting of the squared loss:\n\nY = X\u2713\u21e4 + w,\n\n(\u2713\u21b5)\u21b52I( 1\n\nmin\n\nnY XX\u21b52I\n\n\u2713\u21b5\n\n2\n\n2\n\n+X\u21b52I\n\n\u21b5 R\u21b5(\u2713\u21b5)) .\n\n(5)\n\nFor this regularized least squared estimator (5), conditions (C1-C2) in Theorem 1 trivially hold.\nThe restricted strong convexity condition (C3) reduces to the following. Noting that L(\u2713\u21e4 + \u21b5) \nL(\u2713\u21e4) hr\u2713L(\u2713\u21e4), \u21b5i = 1\n\n2, we obtain the following restricted eigenvalue condition:\n\nnkX\u21b5k2\n\n5\n\n\f(D3) 1\n\nnkX\u21b5k2\n\n2 \uf8ffLk\u21b5k2 g\u21b5R2\n\n\u21b5(\u21b5) for all \u21b5 2 \u2326\u21b5.\n\nFinally, our structural incoherence condition reduces to the following: Noting that L(\u2713\u21e4 +\nnP\u21b5B) =PiPj AijBij.\nAs in the previous linear regression example, we again impose the assumption (C-Linear) on the\npopulation covariance matrix of a \u2303-Gaussian ensemble, but in this case with the notational change\nof P \u00afM\u21b5 denoting the matrix corresponding to projection operator onto the row-spaces of matrices in\n\u00afM\u21b5. Thus, with the low-rank matrix structure discussed in Section 2, we would have P \u00afM\u21b5 = U U>.\nUnder the (C-Linear) assumption, the following proposition then extends Proposition 2:\nProposition 3. Consider the problem (8) with the matrix parameter \u21e5. Under the same assumptions\nas in Proposition 2, we have with probability at least 1 \n\uf8ffL\n\nmax{n,p}\n\n\u21b5(\u21b5).\n\n,\n\n4\n\n2\n\nnX\u21b5<\n\nhhX\u21b5, Xii \uf8ff\n\n2 X\u21b5 |||\u21b5|||2\n\nF .\n\nConsider an instance of this multiple linear regression model with the superposition structure con-\nsisting of row-sparse, column-sparse and elementwise sparse matrices: \u21e5\u21e4 =\u21e5 \u21e4r +\u21e5\u21e4c +\u21e5\u21e4s. In order\nto obtain estimators for this model, we use the hybrid regularization functionP\u21b52I \u21b5 R\u21b5(\u2713\u21b5) =\nrk\u21e5rkr,a + ck\u21e5ckc,a + sk\u21e5sk1 where k\u00b7k r,a denotes the sum of `a norm of rows for a 2,\nand similarly k\u00b7k c,a is the sum of `a norm of columns, and k\u00b7k 1 is entrywise `1 norm for matrix.\nCorollary 3. Consider the multiple linear regression model (7) where \u21e5\u21e4 is the sum of \u21e5\u21e4r with sr\nnonzero rows, \u21e5\u21e4c with sc nonzero columns, and \u21e5\u21e4s with s nonzero elements. Suppose that the design\nmatrix X is \u2303-Gaussian ensemble with the properties of column normalization and max(X) \uf8ff pn.\nFurther, suppose that (6) holds and W is elementwise sub-Gaussian with parameter . Then, if we\nsolve (8) with\n+r log m\ns = 8r log p + log m\nn o,\n\nn o, and c = 8n p11/a\n\nr = 8n m11/a\n\n+r log p\n\npn\n\npn\n\n,\n\nn\n\n7\n\n\fwith probability at least 1 c1 exp(c2n2\nmax\u21e2r s(log p + log m)\nkb\u21e5 \u21e5\u21e4k2 \uf8ff\n\n36\n\u00af\uf8ff\n\nn\n\n,\n\ns) c3\n\np2 c3\npsrm11/a\npn\n\nm2 , the error of the estimateb\u21e5 is bounded as:\n+r sr log p\n+r sc log m\n.\n\npscp11/a\n\npn\n\nn\n\nn\n\n,\n\n6 Convergence Rates for Principal Component Analysis\n\nIn this section, we consider the robust/noisy principal component analysis problem, where we are\ngiven n i.i.d. random vectors Zi 2 Rp where Zi = Ui + vi. Ui \u21e0 N (0, \u21e5\u21e4) is the \u201cuncorrupted\u201d set\nof observations, with a low-rank covariance matrix \u21e5\u21e4 = LLT , for some loading matrix L 2 Rp\u21e5r.\nvi 2 Rp is a noise/error vector; in standard factor analysis, vi is a spherical Gaussian noise vector:\nvi \u21e0 N (0, 2Ip\u21e5p) (or vi = 0); and the goal is to recover the loading matrix L from samples.\nIn PCA with sparse noise, vi \u21e0 N (0, \u21e4), where \u21e4 is elementwise sparse. In this case, the covari-\nance matrix of Zi has the form \u2303=\u21e5 \u21e4 + \u21e4, where \u21e5\u21e4 is low-rank, and \u21e4 is sparse. We can thus\nwrite the sample covariance model as: Y := 1\ni =\u21e5 \u21e4 + \u21e4 + W , where W 2 Rp\u21e5p\nis a Wishart distributed random matrix. For this class of statistical models, we will consider the\nfollowing instantiation of (1):\nmin\n\nnPn\nF + \u21e5|||\u21e5|||1 + kk1 .\n(\u21e5,)|||Y \u21e5 |||2\n\nwhere |||\u00b7|||1 denotes the nuclear norm while k\u00b7k 1 does the element-wise `1 norm (we will use |||\u00b7|||2\nfor the spectral norm.).\nIn contrast to the previous two examples, (9) includes a trivial design matrix, X = Ip\u21e5p, which al-\nlows (D4) to hold under the simpler (C-linear) condition: where \u21e4 is max1,2n2+ 31 1 ( \u00afM1 )\n2 2 ( \u00afM2 )o,\nmaxnmax\u21e3P \u00afM\u21e5 P \u00afM\u2318, max\u21e3P \u00afM\u21e5 P \u00afM?\u2318, max\u21e3P \u00afM?\u21e5 P \u00afM\u2318, max\u21e3P \u00afM?\u21e5 P \u00afM?\u2318o \uf8ff\n\ni=1 ZiZT\n\n1\n16\u21e42 .\n\n(10)\n\n(9)\n\nCorollary 4. Consider the principal component analysis model where \u21e5\u21e4 has the rank r at most\nand \u21e4 has s nonzero entries. Suppose that (10) holds. Then, given the choice of\n\nwith the additional constraint of k\u21e5k1 \uf8ff \u21b5\n\ncompute the error bound kb\u21e5 \u21e5\u21e4k2 \u21e3 max{p|||\u2303|||2p rp\n\nwith probability at least 1 c1 exp(c2 log p).\nRemarks. Agarwal et al. [1] also analyze this model, and propose to use the M-estimator in (9),\np . Under a stricter \u201cglobal\u201d RSC condition, they\np} where \u21b5 is a\nparameter between 1 and p. This bound is similar to that in Corollary 4, but with an additional\np , so that it does not go to zero as a function of n. It also faces a trade-off: a smaller\nterm \u21b5\nvalue of \u21b5 to reduce error bound would make the assumption on the maximum element of \u21e5\u21e4\nstronger as well. Our corollaries do not suffer these lacunae; see also our remarks in (R2) in\nTheorem 1. [14] extended the result of [1] to the special case where \u21e5\u21e4 =\u21e5 \u21e4r +\u21e5 \u21e4s using the\nnotation of the previous section; the remarks above also apply here. Note that our work and [1]\nderive Frobenius error bounds under restricted strong convexity conditions; other recent works\nsuch as [7] also derive such Frobenius error bounds but under stronger conditions (see [1] for\ndetails).\n\nn ,\u21e2 (\u2303)q s log p\n\nn + \u21b5\n\nAcknowledgments\nWe acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, DMS-\n1264033.\n\n8\n\nwhere \u21e2(\u2303) = maxj \u2303jj, the optimal error of (9) is bounded by\n\nn\n\n\u21e5 = 16p|||\u2303|||2r p\nkb\u21e5 \u21e5\u21e4k2 \uf8ff\n\n48\n\u00af\uf8ff\n\n,\n\nn\n\n, = 32\u21e2(\u2303)r log p\n, 2\u21e2(\u2303)r s log p\nn ,\n\nmax\u21e2p|||\u2303|||2r rp\n\nn\n\n\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex\n\nrelaxation: Optimal rates in high dimensions. Annals of Statistics, 40(2):1171\u20131197, 2012.\n\n[2] E. J. Cand`es, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate\nmeasurements. Communications on Pure and Applied Mathematics, 59(8):1207\u20131223, 2006.\n[3] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of\n\nthe ACM, 58(3), May 2011.\n\n[4] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of lin-\near inverse problems. In 48th Annual Allerton Conference on Communication, Control and\nComputing, 2010.\n\n[5] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence\n\nfor matrix decomposition. SIAM Journal on Optimization, 21(2), 2011.\n\n[6] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection\n\nvia convex optimization. Annals of Statistics (with discussion), 40(4), 2012.\n\n[7] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.\n\nIEEE Trans. Inform. Theory, 57:7221\u20137234, 2011.\n\n[8] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In\n\nNeur. Info. Proc. Sys. (NIPS), 23, 2010.\n\n[9] M. McCoy and J. A. Tropp. Two proposals for robust pca using semide\ufb01nite programming.\n\nElectron. J. Statist., 5:1123\u20131160, 2011.\n\n[10] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[11] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27\n(4):538\u2013557, 2012.\n\n[12] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. Journal of Machine Learning Research (JMLR), 99:2241\u20132259, 2010.\n\n[13] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed\n\nSensing: Theory and Applications. Cambridge University Press, 2012.\n\n[14] H. Xu and C. Leng. Robust multi-task regression with grossly corrupted observations. Inter.\n\nConf. on AI and Statistics (AISTATS), 2012.\n\n[15] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. IEEE Transactions on\n\nInformation Theory, 58(5):3047\u20133064, 2012.\n\n9\n\n\f", "award": [], "sourceid": 378, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}]}