{"title": "Robust PCA via Outlier Pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 2496, "page_last": 2504, "abstract": "Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the *exact* optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct *column space* of the uncorrupted matrix, rather than the exact matrix itself.", "full_text": "Robust PCA via Outlier Pursuit\n\nHuan Xu\n\nElectrical and Computer Engineering\n\nUniversity of Texas at Austin\n\nConstantine Caramanis\n\nElectrical and Computer Engineering\n\nUniversity of Texas at Austin\n\nhuan.xu@mail.utexas.edu\n\ncmcaram@ece.utexas.edu\n\nSujay Sanghavi\n\nElectrical and Computer Engineering\n\nUniversity of Texas at Austin\n\nsanghavi@mail.utexas.edu\n\nAbstract\n\nSingular Value Decomposition (and Principal Component Analysis) is one of the\nmost widely used techniques for dimensionality reduction: successful and ef\ufb01-\nciently computable, it is nevertheless plagued by a well-known, well-documented\nsensitivity to outliers. Recent work has considered the setting where each point\nhas a few arbitrarily corrupted components. Yet, in applications of SVD or PCA\nsuch as robust collaborative \ufb01ltering or bioinformatics, malicious agents, defec-\ntive genes, or simply corrupted or contaminated experiments may effectively yield\nentire points that are completely corrupted.\nWe present an ef\ufb01cient convex optimization-based algorithm we call Outlier Pur-\nsuit, that under some mild assumptions on the uncorrupted points (satis\ufb01ed, e.g.,\nby the standard generative assumption in PCA problems) recovers the exact opti-\nmal low-dimensional subspace, and identi\ufb01es the corrupted points. Such identi-\n\ufb01cation of corrupted points that do not conform to the low-dimensional approxi-\nmation, is of paramount interest in bioinformatics and \ufb01nancial applications, and\nbeyond. Our techniques involve matrix decomposition using nuclear norm min-\nimization, however, our results, setup, and approach, necessarily differ consider-\nably from the existing line of work in matrix completion and matrix decompo-\nsition, since we develop an approach to recover the correct column space of the\nuncorrupted matrix, rather than the exact matrix itself.\n\n1 Introduction\n\nThis paper is about the following problem: suppose we are given a large data matrix M , and we\nknow it can be decomposed as\n\nM = L0 + C0,\n\nwhere L0 is a low-rank matrix, and C0 is non-zero in only a fraction of the columns. Aside from\nthese broad restrictions, both components are arbitrary. In particular we do not know the rank (or\nthe row/column space) of L0, or the number and positions of the non-zero columns of C0. Can we\nrecover the column-space of the low-rank matrix L0, and the identities of the non-zero columns of\nC0, exactly and ef\ufb01ciently?\nWe are primarily motivated by Principal Component Analysis (PCA), arguably the most widely\nused technique for dimensionality reduction in statistical data analysis. The canonical PCA prob-\nlem [1], seeks to \ufb01nd the best (in the least-square-error sense) low-dimensional subspace approx-\nimation to high-dimensional points. Using the Singular Value Decomposition (SVD), PCA \ufb01nds\nthe lower-dimensional approximating subspace by forming a low-rank approximation to the data\n\n1\n\n\fmatrix, formed by considering each point as a column; the output of PCA is the (low-dimensional)\ncolumn space of this low-rank approximation.\n\nIt is well known (e.g., [2\u20134]) that standard PCA is extremely fragile to the presence of outliers: even\na single corrupted point can arbitrarily alter the quality of the approximation. Such non-probabilistic\nor persistent data corruption may stem from sensor failures, malicious tampering, or the simple fact\nthat some of the available data may not conform to the presumed low-dimensional source / model.\nIn terms of the data matrix, this means that most of the column vectors will lie in a low-dimensional\nspace \u2013 and hence the corresponding matrix L0 will be low-rank \u2013 while the remaining columns will\nbe outliers \u2013 corresponding to the column-sparse matrix C. The natural question in this setting is to\nask if we can still (exactly or near-exactly) recover the column space of the uncorrupted points, and\nthe identities of the outliers. This is precisely our problem.\n\nRecent years have seen a lot of work on both robust PCA [3, 5\u201312], and on the use of convex\noptimization for recovering low-dimensional structure [4, 13\u201315]. Our work lies at the intersection\nof these two \ufb01elds, but has several signi\ufb01cant differences from work in either space. We compare\nand relate our work to existing literature, and expand on the differences, in Section 3.3.\n\n2 Problem Setup\n\nThe precise PCA with outlier problem that we consider is as follows: we are given n points in p-\ndimensional space. A fraction 1\u2212\u03b3 of the points lie on a r-dimensional true subspace of the ambient\nRp, while the remaining \u03b3n points are arbitrarily located \u2013 we call these outliers/corrupted points.\nWe do not have any prior information about the true subspace or its dimension r. Given the set of\npoints, we would like to learn (a) the true subspace and (b) the identities of the outliers.\nAs is common practice, we collate the points into a p \u00d7 n data matrix M , each of whose columns\nis one of the points, and each of whose rows is one of the p coordinates. It is then clear that the data\nmatrix can be decomposed as\n\nM = L0 + C0.\n\nHere L0 is the matrix corresponding to the non-outliers; thus rank(L0) = r. Consider its Singular\nValue Decomposition (SVD)\n\nL0 = U0\u03a30V \u22a40 .\n\n(1)\nThus it is clear that the columns of U0 form an orthonormal basis for the r-dimensional true sub-\nspace. Also note that at most (1 \u2212 \u03b3)n of the columns of L0 are non-zero (the rest correspond to\nthe outliers). C0 is the matrix corresponding to the non-outliers; we will denote the set of non-zero\ncolumns of C0 by I0, with |I0| = \u03b3n. These non-zero columns are completely arbitrary.\nWith this notation, out intent is to exactly recover the column space of L0, and the set of outliers I0.\nClearly, this is not always going to be possible (regardless of the algorithm used) and thus we need\nto impose a few additional assumptions. We develop these in Section 2.1 below.\n\nWe are also interested in the noisy case, where\n\nM = L0 + C0 + N,\n\nand N corresponds to any additional noise. In this case we are interested in approximate identi\ufb01ca-\ntion of both the true subspace and the outliers.\n\n2.1 Incoherence: When does exact recovery make sense?\n\nIn general, our objective of splitting a low-rank matrix from a column-sparse one is not always a well\nde\ufb01ned one. As an extreme example, consider the case where the data matrix M is non-zero in only\none column. Such a matrix is both low-rank and column-sparse, thus the problem is unidenti\ufb01able.\nTo make the problem meaningful, we need to impose that the low-rank matrix L0 cannot itself be\ncolumn-sparse as well. This is done via the following incoherence condition.\nDe\ufb01nition: A matrix L \u2208 Rp\u00d7n with SVD as in (1), and (1 \u2212 \u03b3)n of whose columns are non-zero,\nis said to be column-incoherent with parameter \u00b5 if\ni kV \u22a4 eik2 \u2264\nmax\n\n\u00b5r\n\n(1 \u2212 \u03b3)n\n\n2\n\n\fwhere {ei} are the coordinate unit vectors.\nThus if V has a column aligned with a coordinate axis, then \u00b5 = (1 \u2212 \u03b3)n/r. Similarly, if V is\nperfectly incoherent (e.g. if r = 1 and every non-zero entry of V has magnitude 1/p(1 \u2212 \u03b3)n)\nthen \u00b5 = 1.\nIn the standard PCA setup, if the points are generated by some low-dimensional isometric Gaussian\ndistribution, then with high probability, one will have \u00b5 = O(max(1, log(n)/r)) [16]. Alternatively,\nif the points are generated by a uniform distribution over a bounded set, then \u00b5 = \u0398(1).\nA small incoherence parameter \u00b5 essentially enforces that the matrix L0 will have column support\nthat is spread out. Note that this is quite natural from the application perspective. Indeed, if the left\nhand side is as big as 1, it essentially means that one of the directions of the column space which we\nwish to recover, is de\ufb01ned by only a single observation. Given the regime of a constant fraction of\narbitrarily chosen and arbitrarily corrupted points, such a setting is not meaningful. Indeed, having\na small incoherence \u00b5 is an assumption made in all methods based on nuclear norm minimization\nup to date [4, 15\u201317].\n\nWe would like to identify the outliers, which can be arbitrary. However, clearly an \u201coutlier\u201d point\nthat lies in the true subspace is a meaningless concept. Thus, in matrix terms, we require that every\ncolumn of C0 does not lie in the column space of L0.\nThe parameters \u00b5 and \u03b3 are not required for the execution of the algorithm, and do not need to be\nknown a priori. They only arise in the analysis of our algorithm\u2019s performance.\nOther Notation and Preliminaries: Capital letters such as A are used to represent matrices, and\naccordingly, Ai denotes the ith column vector. Letters U , V , I and their variants (complements,\nsubscripts, etc.) are reserved for column space, row space and column support respectively. There\nare four associated projection operators we use throughout. The projection onto the column space,\nU , is denoted by PU and given by PU (A) = U U\u22a4A, and similarly for the row-space PV (A) =\nAV V \u22a4. The matrix PI(A) is obtained from A by setting column Ai to zero for all i 6\u2208 I. Finally,\nPT is the projection to the space spanned by U and V , and given by PT (\u00b7) = PU (\u00b7) + PV (\u00b7) \u2212\nPUPV (\u00b7). Note that PT depends on U and V , and we suppress this notation wherever it is clear\nwhich U and V we are using. The complementary operators, PU \u22a5 ,PV \u22a5, PT \u22a5 and PI c are de\ufb01ned\nas usual. The same notation is also used to represent a subspace of matrices: e.g., we write A \u2208 PU\nfor any matrix A that satis\ufb01es PU (A) = A. Five matrix norms are used: kAk\u2217 is the nuclear norm,\nkAk is the spectral norm, kAk1,2 is the sum of \u21132 norm of the columns Ai, kAk\u221e,2 is the largest\n\u21132 norm of the columns, and kAkF is the Frobenius norm. The only vector norm used is k \u00b7 k2, the\n\u21132 norm. Depending on the context, I is either the unit matrix, or the identity operator; ei is the ith\nbase vector. The SVD of L0 is U0\u03a30V \u22a40 . The rank of L0 is denoted as r, and we have \u03b3 , |I0|/n,\ni.e., the fraction of outliers.\n\n3 Main Results and Consequences\n\nWhile we do not recover the matrix L0, we show that the goal of PCA can be attained: even under\nour strong corruption model, with a constant fraction of points corrupted, we show that we can \u2013\nunder very weak assumptions \u2013 exactly recover both the column space of L0 (i.e the low-dimensional\nspace the uncorrupted points lie on) and the column support of C0 (i.e. the identities of the outliers),\nfrom M . If there is additional noise corrupting the data matrix, i.e. if we have M = L0 + C0 + N ,\na natural variant of our approach \ufb01nds a good approximation.\n\n3.1 Algorithm\n\nGiven data matrix M , our algorithm, called Outlier Pursuit, generates (a) a matrix \u02c6U , with orthonor-\nmal rows, that spans the low-dimensional true subspace we want to recover, and (b) a set of column\nindices \u02c6I corresponding to the outlier points. To ensure success, one choice of the tuning parameter\nis \u03bb = 3\n\n7\u221a\u03b3n, as Theorem 1 below suggests.\n\nWhile in the noiseless case there are simple algorithms with similar performance, the bene\ufb01t of\nthe algorithm, and of the analysis, is extension to more realistic and interesting situations where in\n\n3\n\n\fAlgorithm 1 Outlier Pursuit\nFind ( \u02c6L, \u02c6C), the optimum of the following convex optimization program.\n\nMinimize:\nSubject to:\n\nkLk\u2217 + \u03bbkCk1,2\nM = L + C\n\n(2)\n\nCompute SVD \u02c6L = U1\u03a31V \u22a41 and output \u02c6U = U1.\nOutput the set of non-zero columns of \u02c6C, i.e. \u02c6I = {j : \u02c6cij 6= 0 for some i}.\n\naddition to gross corruption of some samples, there is additional noise. Adapting the Outlier Pursuit\nalgorithm, we have the following variant for the noisy case.\n\nNoisy Outlier Pursuit:\n\nMinimize:\nSubject to:\n\nkLk\u2217 + \u03bbkCk1,2\n\nkM \u2212 (L + C)kF \u2264 \u03b5\n\n(3)\n\nOutlier Pursuit (and its noisy variant) is a convex surrogate for the following natural (but combina-\ntorial and intractable) \ufb01rst approach to the recovery problem:\n\nMinimize:\nSubject to:\n\nrank(L) + \u03bbkCk0,c\n\nM = L + C\n\n(4)\n\nwhere k \u00b7 k0,c stands for the number of non-zero columns of a matrix.\n3.2 Performance\n\nWe show that under rather weak assumptions, Outlier Pursuit exactly recovers the column space of\nthe low-rank matrix L0, and the identities of the non-zero columns of outlier matrix C0. The formal\nstatement appears below.\nTheorem 1 (Noiseless Case). Suppose we observe M = L0 + C0, where L0 has rank r and inco-\nherence parameter \u00b5. Suppose further that C0 is supported on at most \u03b3n columns. Any output to\nOutlier Pursuit recovers the column space exactly, and identi\ufb01es exactly the indices of columns cor-\nresponding to outliers not lying in the recovered column space, as long as the fraction of corrupted\npoints, \u03b3, satis\ufb01es\n\nwhere c1 = 9\nindeed it holds for any \u03bb in a speci\ufb01c range which we provide below.\n\n121 . This can be achieved by setting the parameter \u03bb in outlier pursuit to be\n\nFor the case where in addition to the corrupted points, we have noisy observations, \u02dcM = M + W ,\nwe have the following result.\nTheorem 2 (Noisy Case). Suppose we observe \u02dcM = M + N = L0 + C0 + N , where\n\n\u03b3\n\n1 \u2212 \u03b3 \u2264\n\nc2\n\u00b5r\n\n(6)\n\nwith c2 = 9\n1024 , and kNkF \u2264 \u03b5. Let the output of Noisy Outlier Pursuit be L\u2032, C\u2032. Then there exists\n\u02dcL, \u02dcC such that M = \u02dcL + \u02dcC, \u02dcL has the correct column space, and \u02dcC the correct column support, and\n\nkL\u2032 \u2212 \u02dcLkF \u2264 10\u221an\u03b5;\n\nkC\u2032 \u2212 \u02dcCkF \u2264 9\u221an\u03b5; .\n\nThe conditions in this theorem are essentially tight in the following scaling sense (i.e., up to universal\nconstants). If there is no additional structure imposed, beyond what we have stated above, then up\nto scaling, in the noiseless case, Outlier Pursuit can recover from as many outliers (i.e., the same\nfraction) as any possible algorithm with arbitrary complexity. In particular, it is easy to see that if\nthe rank of the matrix L0 is r, and the fraction of outliers satis\ufb01es \u03b3 \u2265 1/(r + 1), then the problem\nis not identi\ufb01able, i.e., no algorithm can separate authentic and corrupted points.1\n\n1Note that this is no longer true in the presence of stronger assumptions, e.g., isometric distribution, on the\n\nauthentic points [12].\n\n4\n\n\u03b3\n\n1 \u2212 \u03b3 \u2264\n\nc1\n\u00b5r\n\n,\n\n(5)\n\n3\n\n7\u221a\u03b3n \u2013\n\n\f3.3 Related Work\n\nRobust PCA has a long history (e.g., [3, 5\u201311]). Each of these algorithms either performs standard\nPCA on a robust estimate of the covariance matrix, or \ufb01nds directions that maximize a robust es-\ntimate of the variance of the projected data. These algorithms seek to approximately recover the\ncolumn space, and moreover, no existing approach attempts to identify the set of outliers. This out-\nlier identi\ufb01cation, while outside the scope of traditional PCA algorithms, is important in a variety of\napplications such as \ufb01nance, bio-informatics, and more.\n\nMany existing robust PCA algorithms suffer two pitfalls: performance degradation with dimen-\nsion increase, and computational intractability. To wit, [18] shows several robust PCA algorithms\nincluding M-estimator [19], Convex Peeling [20], Ellipsoidal Peeling [21], Classical Outlier Rejec-\ntion [22], Iterative Deletion [23] and Iterative Trimming [24] have breakdown points proportional to\nthe inverse of dimensionality, and hence are useless in the high dimensional regime we consider.\n\nAlgorithms with non-diminishing breakdown point, such as Projection-Pursuit [25] are non-convex\nor even combinatorial, and hence computationally intractable (NP-hard) as the size of the problem\nscales. In contrast to these, the performance of Outlier Pursuit does not depend on p, and can be\nsolved in polynomial time.\n\nAlgorithms based on nuclear norm minimization to recover low rank matrices are now standard,\nsince the seminal paper [14]. Recent work [4,15] has taken the nuclear norm minimization approach\nto the decomposition of a low-rank matrix and an overall sparse matrix. At a high level, these papers\nare close in spirit to ours. However, there are critical differences in the problem setup, the results, and\nin key analysis techniques. First, these algorithms fail in our setting as they cannot handle outliers \u2013\nentire columns where every entry is corrupted. Second, from a technical and proof perspective, all\nthe above works investigate exact signal recovery \u2013 the intended outcome is known ahead of time,\nand one just needs to investigate the conditions needed for success. In our setting however, the\nconvex optimization cannot recover L0 itself exactly. This requires an auxiliary \u201coracle problem\u201d as\nwell as different analysis techniques on which we elaborate below.\n\n4 Proof Outline and Comments\n\nIn this section we provide an outline of the proof of Theorem 1. The full proofs of all theorems\nappear in a full version available online [26]. The proof follows three main steps\n\n1. Identify the \ufb01rst-order necessary and suf\ufb01cient conditions, for any pair (L\u2032, C\u2032) to be the\n\noptimum of the convex program (2).\n\n2. Consider a candidate pair ( \u02c6L, \u02c6C) that is the optimum of an alternate optimization problem,\noften called the \u201coracle problem\u201d. The oracle problem ensures that the pair ( \u02c6L, \u02c6C) has the\ndesired column space and column support, respectively.\n3. Show that this ( \u02c6L, \u02c6C) is the optimum of Outlier Pursuit.\n\nWe remark that the aim of the matrix recovery papers [4, 15, 16] was exact recovery of the entire\nmatrix, and thus the optimality conditions required are clear. Since our setup precludes exact recov-\nery of L0 and C0, 2 our optimality conditions must imply the optimality for Outlier Pursuit of an\nas-of-yet-undetermined pair ( \u02c6L, \u02c6C), the solution to the oracle problem. We now elaborate.\nOptimality Conditions: We now specify the conditions a candidate optimum needs to satisfy; these\narise from the standard subgradient conditions for the norms involved. Suppose the pair (L\u2032, C\u2032) is a\nfeasible point of (2), i.e. we have that L\u2032 + C\u2032 = M . Let the SVD of L\u2032 be given by L\u2032 = U\u2032\u03a3\u2032V \u2032\u22a4.\nFor any matrix X, de\ufb01ne PT \u2032(X) := U\u2032U\u2032\u22a4X + XV \u2032V \u2032\u22a4 \u2212 U\u2032U\u2032\u22a4XV \u2032V \u2032\u22a4, the projection of X\nonto matrices that share the same column space or row space with L\u2032.\nLet I\u2032 be the set of non-zero columns of C\u2032, and let H\u2032 be the column-normalized version of C\u2032.\nThat is, column H\u2032i = C \u2032\nfor all i \u2208 I\u2032, and H\u2032i = 0 for all i /\u2208 I\u2032. Finally, for any matrix X let\nkC \u2032\nik2\nPI \u2032(X) denote the matrix with all columns in I\u2032c set to 0, and the columns in I\u2032 left as-is.\n\ni\n\n2The optimum \u02c6L of (2) will be non-zero in every column of C0 that is not orthogonal to L0\u2019s column space.\n\n5\n\n\fProposition 1. With notation as above, L\u2032, C\u2032 is an optimum of the Outlier Pursuit progam (2) if\nthere exists a Q such that\n\nPT \u2032 (Q) = U\u2032V \u2032\nPI \u2032(Q) = \u03bbH\u2032\n\nkQ \u2212 PT \u2032(Q)k \u2264 1\nkQ \u2212 PI \u2032(Q)k\u221e,2 \u2264 \u03bb.\n\n(7)\n\nFurther, if both inequalities above are strict, dubbed Q strictly satis\ufb01es (7), then (L\u2032, C\u2032) is the\nunique optimum.\n\nNote that here k \u00b7 k is the spectral norm (i.e. largest singular value) and k \u00b7 k\u221e,2 is the magnitude \u2013\ni.e. \u21132 norm \u2013 of the column with the largest magnitude.\nOracle Problem: We develop our candidate solution ( \u02c6L, \u02c6C) by considering the alternate optimiza-\ntion problem where we add constraints to (2) based on what we hope its optimum should be. In\nparticular, recall the SVD of the true L0 = U0\u03a30V \u22a40 and de\ufb01ne for any matrix X the projection\nonto the space of all matrices with column space contained in U0 as PU0(X) := U0U\u22a40 X. Similarly\nfor the column support I0 of the true C0, de\ufb01ne the projection PI0(X) to be the matrix that results\nwhen all the columns in I c\nNote that U0 and I0 above correspond to the truth. Thus, with this notation, we would like the\noptimum of (2) to satisfy PU0( \u02c6L) = \u02c6L, as this is nothing but the fact that \u02c6L has recovered the true\nsubspace. Similarly, having \u02c6C satisfy PI0( \u02c6C) = \u02c6C means that we have succeeded in identifying the\noutliers. The oracle problem arises by imposing these as additional constraints in (2). Formally:\n\n0 are set to 0.\n\nOracle Problem:\n\nMinimize:\nSubject to:\n\nkLk\u2217 + \u03bbkCk1,2\n\nM = L + C; PU0 (L) = L; PI0(C) = C.\n\n(8)\n\nObtaining Dual Certi\ufb01cates for Outlier Pursuit: We now construct a dual certi\ufb01cate of ( \u02c6L, \u02c6C) to\nestablish Theorem 1. Let the SVD of \u02c6L be \u02c6U \u02c6\u03a3 \u02c6V \u22a4. It is easy to see that there exists an orthonormal\nmatrix V \u2208 Rr\u00d7n such that \u02c6U \u02c6V \u22a4 = U0V \u22a4, where U0 is the column space of L0. Moreover, it is\neasy to show that P \u02c6U (\u00b7) = PU0 (\u00b7), P \u02c6V (\u00b7) = PV , and hence the operator P \u02c6T de\ufb01ned by \u02c6U and \u02c6V ,\nobeys P \u02c6T (\u00b7) = PU0 (\u00b7) + PV (\u00b7) \u2212 PU0PV (\u00b7). Let \u02c6H be the matrix satisfying that PI c\n0 ( \u02c6H) = 0 and\n\u2200i \u2208 I0, \u02c6Hi = \u02c6Ci/k \u02c6Cik2.\nDe\ufb01ne matrix G \u2208 Rr\u00d7r as\n\nG , PI0(V \u22a4)(PI0 (V \u22a4))\u22a4 = X\ni\u2208I0\n\n[(V \u22a4)i][(V \u22a4)i]\u22a4,\n\nand constant c , kGk. Further de\ufb01ne matrices \u22061 , \u03bbPU0 ( \u02c6H), and\n\u221e\nX\n\u22062 , PU \u22a5\n0PV (cid:2)I +\n\n(PV PI0PV )i(cid:3)PV (\u03bb \u02c6H) = PI c\n\n0PV (cid:2)I +\n\n0 PI c\n\n\u221e\nX\n\n(PV PI0PV )i(cid:3)PV PU \u22a5\n\n0\n\n(\u03bb \u02c6H).\n\ni=1\n\ni=1\n\n\u03b3\n\nThen we can de\ufb01ne the dual certi\ufb01cate for strict optimality of the pair ( \u02c6L, \u02c6C).\n\n(1\u2212c)\u221a \u00b5r\n\u221an(1\u2212c\u2212\u221a \u03b3\n\n< \u03bb <\n\n1\u2212\u03b3\n1\u2212\u03b3 \u00b5r)\n\n1\u2212\u03b3 < (1\u2212c)2\n\n(3\u2212c)2\u00b5r , and\n\n(2\u2212c)\u221an\u03b3 , then Q ,\n\nProposition 2. If c < 1,\n1\u2212c\nU0V \u22a4 + \u03bb \u02c6H \u2212 \u22061 \u2212 \u22062 strictly satis\ufb01es Condition (7), i.e., it is the dual certi\ufb01cate.\nConsider the (much) simpler case where the corrupted columns are assumed to be orthogonal to\nIndeed, in that setting, where V0 = \u02c6V =\nthe column space of L0 which we seek to recover.\nV , we automatically satisfy the condition PI0 TPV0 = {0}. In the general case, we require the\ncondition c < 1 to recover the same property. Moreover, considering that the columns of H are\neither zero, or de\ufb01ned as normalizations of the columns of matrix C (i.e., normalizations of outliers),\nthat PU0 (H) = PV0(H) = PT0 (H) = 0, is immediate, as is the condition that PI0(U0V \u22a40 ) = 0.\nFor the general, non-orthogonal case, however, we require the matrices \u22061 and \u22062 to obtain these\nequalities, and the rest of the dual certi\ufb01cate properties. In the full version [26] we show in detail\nhow these ideas and the oracle problem, are used to construct the dual certi\ufb01cate Q. Extending these\nideas, we then quickly obtain the proof for the noisy case.\n\n6\n\n\f5 Implementation issue and numerical experiments\n\nSolving nuclear-norm minimizations naively requires use of general purpose SDP solvers, which\nunfortunately still have questionable scaling capabilities. Instead, we use the proximal gradient al-\ngorithms [27], a.k.a., Singular Value Thresholding [28] to solve Outlier Pursuit. The algorithm con-\nverges with a rate of O(k\u22122) where k is the number of iterations, and in each iteration, it involves a\nsingular value decomposition and thresholding, therefore, requiring signi\ufb01cantly less computational\ntime than interior point methods.\n\nOur \ufb01rst experiment investigates the phase-transition property of Outlier Pursuit, using randomly\ngenerated synthetic data. Fix n = p = 400. For different r and number of outliers \u03b3n, we generated\nmatrices A \u2208 Rp\u00d7r and B \u2208 R(n\u2212\u03b3n)\u00d7r where each entry is an independent N (0, 1) random\nvariable, and then set L\u2217 := A \u00d7 B\u22a4 (the \u201cclean\u201d part of M ). Outliers, C\u2217 \u2208 R\u03b3n\u00d7p are generated\neither neutrally, where each entry of C\u2217 is iid N (0, 1), or adversarial, where every column is an\nidentical copy of a random Gaussian vector. Outlier Pursuit succeeds if \u02c6C \u2208 PI, and \u02c6L \u2208 PU . Note\nthat if a lot of outliers span a same direction, it would be dif\ufb01cult to identify whether they are all\noutliers, or just a new direction of the true space. Indeed, such a setup is order-wise worst, as we\nproved in the full version [26] a matching lower bound is achieved when all outliers are identical.\n\n(a) Random Outlier\n\n(b) Identical Outlier\n\n(c) Noisy Outlier Detection\n\ns=20,\n\nidentical outlier\n\ns=20,\n\nrandom outlier\n\ns=10,\n\nidentical outlier\n\ns=10,\n\nrandom outlier\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nt\n\ne\na\nr\n \n\nd\ne\ne\nc\nc\nu\nS\n\n0\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.6\n\n\u03c3/s\n\nFigure 1: Complete Observation: Results averaged over 10 trials.\n\nFigure 1 shows the phase transition property. We represent success in gray scale, with white denoting\nsuccess, and black failure. When outliers are random (easier case) Outlier Pursuit succeeds even\nwhen r = 20 with 100 outliers. In the adversarial case, we observe a phase transition: Outlier\nPursuit succeeds when r \u00d7 \u03b3 is small, and fails otherwise, consistent with our theory\u2019s predictions.\nWe then \ufb01x r = \u03b3n = 5 and examine the outlier identi\ufb01cation ability of Outlier Pursuit with noisy\nobservations. We scale each outlier so that the \u21132 distance of the outlier to the span of true samples\nequals a pre-determined value s. Each true sample is thus corrupted with a Gaussian random vector\nwith an \u21132 magnitude \u03c3. We perform (noiseless) Outlier Pursuit on this noisy observation matrix, and\nclaim that the algorithm successfully identi\ufb01es outliers if for the resulting \u02c6C matrix, k \u02c6Cjk2 < k \u02c6Cik2\nfor all j 6\u2208 I and i \u2208 I, i.e., there exists a threshold value to separate out outliers. Figure 1 (c) shows\nthe result: when \u03c3/s \u2264 0.3 for the identical outlier case, and \u03c3/s \u2264 0.7 for the random outlier case,\nOutlier Pursuit correctly identi\ufb01es the outliers.\n\nWe further study the case of decomposing M under incomplete observation, which is motivated by\nrobust collaborative \ufb01ltering: we generate M as before, but only observe each entry with a given\nprobability (independently). Letting \u2126 be the set of observed entries, we solve\n\nMinimize: kLk\u2217 + \u03bbkCk1,2;\n\nSubject to: P\u2126(L + C) = P\u2126(M ).\n\n(9)\n\nThe same success condition is used. Figure 2 shows a very promising result: the successful decom-\nposition rate under incomplete observation is close to the complete observation case even when only\n30% of entries are observed. Given this empirical result, a natural direction of future research is to\nunderstand theoretical guarantee of (9) in the incomplete observation case.\n\nNext we report some experiment results on the USPS digit data-set. The goal of this experiment is\nto show that Outlier Pursuit can be used to identify anomalies within the dataset. We use the data\nfrom [29], and construct the observation matrix M as containing the \ufb01rst 220 samples of digit \u201c1\u201d\nand the last 11 samples of \u201c7\u201d. The learning objective is to correctly identify all the \u201c7\u2019s\u201d. Note\nthat throughout the experiment, label information is unavailable to the algorithm, i.e., there is no\ntraining stage. Since the columns of digit \u201c1\u201d are not exactly low rank, an exact decomposition\n\n7\n\n\f(a) 30% entries observed\n\n(b) 80% entries observed\n\n(c) Success rate vs Observe ratio\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nt\na\nR\n \nd\ne\ne\nc\nc\nu\nS\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nFraction of Observed Entries\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nFigure 2: Partial Observation.\n\nis not possible. Hence, we use the \u21132 norm of each column in the resulting C matrix to identify\nthe outliers: a larger \u21132 norm means that the sample is more likely to be an outlier \u2014 essentially,\nwe apply thresholding after C is obtained. Figure 3(a) shows the \u21132 norm of each column of the\nresulting C matrix. We see that all \u201c7\u2019s\u201d are indeed identi\ufb01ed. However, two \u201c1\u201d samples (columns\n71 and 137) are also identi\ufb01ed as outliers, due to the fact that these two samples are written in a way\nthat is different from the rest \u201c1\u2019s\u201d as showed in Figure 4. Under the same setup, we also simulate\nthe case where only 80% of entries are observed. As Figure 3 (b) and (c) show, similar results as\nthat of the complete observation case are obtained, i.e., all true \u201c7\u2019s\u201d and also \u201c1\u2019s\u201d No 71, No 177\nare identi\ufb01ed.\n\n(a) Complete Observation\n\n(b) Partial Obs. (one run)\n\n(c) Partial Obs. (average)\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ni\n\nC\n\n \nf\no\n \nm\nr\no\nn\n \n\n2\n\nl\n\n\"1\"\n\n\"7\"\n\ni\n\nC\n\n \nf\no\n \nm\nr\no\nn\n \n\n2\n\nl\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n\"1\"\n\n\"7\"\n\ni\n\nC\n\n \nf\no\n \nm\nr\no\nn\n \n\n2\n\nl\n\n\"1\"\n\n\"7\"\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n50\n\n100\n\ni\n\n150\n\n200\n\n0\n\n0\n\n50\n\n100\n\ni\n\n150\n\n200\n\n0\n\n0\n\n50\n\n100\n\ni\n\n150\n\n200\n\nFigure 3: Outlyingness: \u21132 norm of Ci.\n\n\u201c1\u201d\n\n\u201c7\u201d\n\nNo 71 No 177\n\nFigure 4: Typical \u201c1\u201d, \u201c7\u201d and abnormal \u201c1\u201d.\n\n6 Conclusion and Future Direction\n\nThis paper considers robust PCA from a matrix decomposition approach, and develops the algorithm\nOutlier Pursuit. Under some mild conditions, we show that Outlier Pursuit can exactly recover the\ncolumn support, and exactly identify outliers. This result is new, differing both from results in\nRobust PCA, and also from results using nuclear-norm approaches for matrix completion and matrix\nreconstruction. One central innovation we introduce is the use of an oracle problem. Whenever the\nrecovery concept (in this case, column space) does not uniquely correspond to a single matrix (we\nbelieve many, if not most cases of interest, will fall under this description), the use of such a tool\nwill be quite useful. Immediate goals for future work include considering speci\ufb01c applications, in\nparticular, robust collaborative \ufb01ltering (here, the goal is to decompose a partially observed column-\ncorrupted matrix) and also obtaining tight bounds for outlier identi\ufb01cation in the noisy case.\nAcknowledgements H. Xu would like to acknowledge support from DTRA grant HDTRA1-08-\n0029. C. Caramanis would like to acknowledge support from NSF grants EFRI-0735905, CNS-\n0721532, CNS- 0831580, and DTRA grant HDTRA1-08-0029. S. Sanghavi would like to acknowl-\nedge support from the NSF CAREER program, Grant 0954059.\n\n8\n\n\fReferences\n\n[1] I. T. Jolliffe. Principal Component Analysis. Springer Series in Statistics, Berlin: Springer, 1986.\n[2] P. J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.\n[3] L. Xu and A. L. Yuille. Robust principal component analysis by self-organizing rules based on statistical\n\nphysics approach. IEEE Tran. on Neural Networks, 6(1):131\u2013143, 1995.\n\n[4] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust pricinpal component analysis? ArXiv:0912.3599, 2009.\n[5] S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. Robust estimation of dispersion matrices and princi-\n\npal components. Journal of the American Statistical Association, 76(374):354\u2013362, 1981.\n\n[6] T. N. Yang and S. D. Wang. Robust algorithms for principal component analysis. Pattern Recognition\n\nLetters, 20(9):927\u2013933, 1999.\n\n[7] C. Croux and G. Hasebroeck. Principal component analysis based on robust estimators of the covariance\n\nor correlation matrix: In\ufb02uence functions and ef\ufb01ciencies. Biometrika, 87(3):603\u2013618, 2000.\n\n[8] F. De la Torre and M. J. Black. Robust principal component analysis for computer vision. In ICCV\u201901,\n\npages 362\u2013369, 2001.\n\n[9] F. De la Torre and M. J. Black. A framework for robust subspace learning.\n\nComputer Vision, 54(1/2/3):117\u2013142, 2003.\n\nInternational Journal of\n\n[10] C. Croux, P. Filzmoser, and M. Oliveira. Algorithms for Projection\u2212Pursuit robust principal component\n\nanalysis. Chemometrics and Intelligent Laboratory Systems, 87(2):218\u2013225, 2007.\n\n[11] S. C. Brubaker. Robust PCA and clustering on noisy mixtures. In SODA\u201909, pages 1078\u20131087, 2009.\n[12] H. Xu, C. Caramanis, and S. Mannor. Principal component analysis with contaminated data: The high\n\ndimensional case. In COLT\u201910, pages 490\u2013502, 2010.\n\n[13] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Tran. on Information Theory, 52(2):489\u2013509, 2006.\n\n[14] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions to linear matrix equations via\n\nnuclear norm minimization. To appear in SIAM Review, 2010.\n\n[15] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decom-\n\nposition. ArXiv:0906.2220, 2009.\n\n[16] E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\nMathematics, 9:717\u2013772, 2009.\n\n[17] E. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Tran. on\n\nInformation Theory, 56(2053-2080), 2010.\n\n[18] D. L. Donoho. Breakdown properties of multivariate location estimators. Qualifying paper, Harvard\n\nUniversity, 1982.\n\n[19] R. Maronna. Robust M-estimators of multivariate location and scatter. The Annals of Statistics, 4:51\u201367,\n\n1976.\n\n[20] V. Barnett. The ordering of multivariate data. Journal of Royal Statistics Society, A, 138:318\u2013344, 1976.\n[21] D. Titterington. Estimation of correlation coef\ufb01cients by ellipsoidal trimming. Applied Statistics, 27:227\u2013\n\n234, 1978.\n\n[22] V. Barnett and T. Lewis. Outliers in Statistical Data. Wiley, New York, 1978.\n[23] A. Dempster and M. Gasko-Green. New tools for residual analysis. The Annals of Statistics, 9(5):945\u2013\n\n959, 1981.\n\n[24] S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. Robust estimation and outlier detection with correla-\n\ntion coef\ufb01cients. Biometrika, 62:531\u2013545, 1975.\n\n[25] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal components:\nPrimary theory and monte carlo. Journal of the American Statistical Association, 80(391):759\u2013766, 1985.\n[26] H. Xu, C. Caramanis, and S. Sanghavi. Robust PCA via outlier pursuit. http://arxiv.org/abs/1010.4237,\n\n2010.\n\n[27] Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet\n\nMathematics Doklady, 27(372-376), 1983.\n\n[28] J-F. Cai, E. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20:1956\u20131982, 2008.\n\n[29] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. the MIT Press, 2006.\n\n9\n\n\f", "award": [], "sourceid": 849, "authors": [{"given_name": "Huan", "family_name": "Xu", "institution": null}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": null}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": null}]}