{"title": "Correlated-PCA: Principal Components' Analysis when Data and Noise are Correlated", "book": "Advances in Neural Information Processing Systems", "page_first": 1768, "page_last": 1776, "abstract": "Given a matrix of observed data, Principal Components Analysis (PCA) computes a small number of orthogonal directions that contain most of its variability. Provably accurate solutions for PCA have been in use for a long time. However, to the best of our knowledge, all existing theoretical guarantees for it assume that the data and the corrupting noise are mutually independent, or at least uncorrelated. This is valid in practice often, but not always. In this paper, we study the PCA problem in the setting where the data and noise can be correlated. Such noise is often also referred to as ``data-dependent noise\". We obtain a correctness result for the standard eigenvalue decomposition (EVD) based solution to PCA under simple assumptions on the data-noise correlation. We also develop and analyze a generalization of EVD, cluster-EVD, that improves upon EVD in certain regimes.", "full_text": "Correlated-PCA: Principal Components\u2019 Analysis\n\nwhen Data and Noise are Correlated\n\nNamrata Vaswani and Han Guo\n\nIowa State University, Ames, IA, USA\nEmail: {namrata,hanguo}@iastate.edu\n\nAbstract\n\nGiven a matrix of observed data, Principal Components Analysis (PCA) computes\na small number of orthogonal directions that contain most of its variability. Prov-\nably accurate solutions for PCA have been in use for a long time. However, to\nthe best of our knowledge, all existing theoretical guarantees for it assume that the\ndata and the corrupting noise are mutually independent, or at least uncorrelated.\nThis is valid in practice often, but not always. In this paper, we study the PCA\nproblem in the setting where the data and noise can be correlated. Such noise is\noften also referred to as \u201cdata-dependent noise\u201d. We obtain a correctness result\nfor the standard eigenvalue decomposition (EVD) based solution to PCA under\nsimple assumptions on the data-noise correlation. We also develop and analyze a\ngeneralization of EVD, cluster-EVD, that improves upon EVD in certain regimes.\n\nIntroduction\n\n1\nPrincipal Components Analysis (PCA) is among the most frequently used tools for dimension re-\nduction. Given a matrix of data, it computes a small number of orthogonal directions that contain all\n(or most) of the variability of the data. The subspace spanned by these directions is the \u201cprincipal\nsubspace\u201d. To use PCA for dimension reduction, one projects the observed data onto this subspace.\nThe standard solution to PCA is to compute the reduced singular value decomposition (SVD) of\nthe data matrix, or, equivalently, to compute the reduced eigenvalue decomposition (EVD) of the\nempirical covariance matrix of the data. If all eigenvalues are nonzero, a threshold is used and all\neigenvectors with eigenvalues above the threshold are retained. This solution, which we henceforth\nrefer to as simple EVD, or just EVD, has been used for many decades and is well-studied in litera-\nture, e.g., see [1] and references therein. However, to the best of our knowledge, all existing results\nfor it assume that the true data and the corrupting noise in the observed data are independent, or, at\nleast, uncorrelated. This is valid in practice often, but not always. Here, we study the PCA problem\nin the setting where the data and noise vectors may be correlated (correlated-PCA). Such noise is\nsometimes called \u201cdata-dependent\u201d noise.\nContributions.\n(1) Under a boundedness assumption on the true data vectors, and some other as-\nsumptions, for a \ufb01xed desired subspace error level, we show that the sample complexity of simple-\nEVD for correlated-PCA scales as f 2r2 log n where n is the data vector length, f is the condition\nnumber of the true data covariance matrix and r is its rank. Here \u201csample complexity\u201d refers to\nthe number of samples needed to get a small enough subspace recovery error with high probability\n(whp). The dependence on f 2 is problematic for datasets with large condition numbers, and, es-\npecially in the high dimensional setting where n is large. (2) To address this, we also develop and\nanalyze a generalization of simple-EVD, called cluster-EVD. Under an eigenvalues\u2019 \u201cclustering\u201d\nassumption, cluster-EVD weakens the dependence on f.\nTo our best knowledge, the correlated-PCA problem has not been explicitly studied. We \ufb01rst en-\ncountered it while solving the dynamic robust PCA problem in the Recursive Projected Compressive\nSensing (ReProCS) framework [2, 3, 4, 5]. The version of correlated-PCA studied here is motivated\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fby these works. Some other somewhat related recent works include [6, 7] that study stochastic\noptimization based techniques for PCA; and [8, 9, 10, 11] that study online PCA.\nNotation. We use the interval notation [a, b] to mean all of the integers between a and b, inclusive,\nand similarly for [a, b) etc. We use (cid:107) \u00b7 (cid:107) to denote the l2 norm of a vector or the induced l2 norm of\na matrix. For other lp norms, we use (cid:107) \u00b7 (cid:107)p. For a set T , IT refers to an n \u00d7 |T | matrix of columns\nof the identity matrix indexed by entries in T . For a matrix A, AT := AIT . A tall matrix with\northonormal columns is referred to as a basis matrix. For basis matrices \u02c6P and P , we quantify the\nsubspace error (SE) between their range spaces using\n\nSE( \u02c6P , P ) := (cid:107)(I \u2212 \u02c6P \u02c6P (cid:48))P(cid:107).\n\n(1)\n\n1.1 Correlated-PCA: Problem De\ufb01nition\nWe are given a time sequence of data vectors, yt, that satisfy\n\nyt = (cid:96)t + wt, with wt = Mt(cid:96)t and (cid:96)t = P at\n\n(2)\nwhere P is an n \u00d7 r basis matrix with r (cid:28) n. Here (cid:96)t is the true data vector that lies in a low\ndimensional subspace of Rn, range(P ); at is its projection into this r-dimensional subspace; and\nwt is the data-dependent noise. We refer to Mt as the correlation / data-dependency matrix. The\ngoal is to estimate range(P ). We make the following assumptions on (cid:96)t and Mt.\nAssumption 1.1. The subspace projection coef\ufb01cients, at, are zero mean, mutually independent\nand bounded random vectors (r.v.), with a diagonal covariance matrix \u039b. De\ufb01ne \u03bb\u2212 := \u03bbmin(\u039b),\n\u03bb+ := \u03bbmax(\u039b) and f := \u03bb+\n\u03bb\u2212 . Since the at\u2019s are bounded, we can also de\ufb01ne a \ufb01nite constant\n. Thus, (at)2\n\u03b7 := maxj=1,2,...r maxt\n\nj \u2264 \u03b7\u03bbj.\n\n(at)2\nj\n\n\u03bbj\n\nFor most bounded distributions, \u03b7 will be a small constant more than one, e.g., if the distribution of\nall entries of at is iid zero mean uniform, then \u03b7 = 3. From Assumption 1.1, clearly, the (cid:96)t\u2019s are also\nzero mean, bounded, and mutually independent r.v.\u2019s with a rank r covariance matrix \u03a3 EVD= P \u039bP (cid:48).\nIn the model, for simplicity, we assume \u039b to be \ufb01xed. However, even if we replace \u039b by \u039bt and\nde\ufb01ne \u03bb\u2212 = mint \u03bbmin(\u039bt) and \u03bb+ = \u03bbmax(\u039bt), all our results will still hold.\nAssumption 1.2. Decompose Mt as Mt = M2,tM1,t. Assume that\n\n(cid:107)M1,tP(cid:107) \u2264 q < 1, (cid:107)M2,t(cid:107) \u2264 1,\n\nand, for any sequence of positive semi-de\ufb01nite Hermitian matrices, At, the following holds\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n\u03b1\n\n\u03b1(cid:88)\n\nt=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b2\n\n\u03b1\n\nfor a \u03b2 < \u03b1,\n\nM2,tAtM2,t\n\n(cid:48)\n\n(cid:107)At(cid:107).\n\nmax\nt\u2208[1,\u03b1]\n\n(3)\n\n(4)\n\nWe will need the above to hold for all \u03b1 \u2265 \u03b10 and for all \u03b2 \u2264 c0\u03b1 with a c0 (cid:28) 1. We set \u03b10 and c0\nin Theorems 2.1 and 3.3; both will depend on q. Observe that, using (3), (cid:107)wt(cid:107)\n(cid:107)(cid:96)t(cid:107) \u2264 q, and so q is an\nupper bound on the signal-to-noise ratio (SNR).\nTo understand the assumption on M2,t, notice that, if we allow \u03b2 = \u03b1, then (4) always holds and\nis not an assumption. Let B denote the matrix on the LHS of (4). One example situation when (4)\nholds with a \u03b2 (cid:28) \u03b1 is if B is block-diagonal with blocks At. In this case, it holds with \u03b2 = 1. In\nfact, it also holds with \u03b2 = 1 if B is permutation-similar to a block diagonal matrix. The matrix\nB will be of this form if M2,t = ITt with all the sets Tt being mutually disjoint. More generally,\nif B is permutation-similar to a block-diagonal matrix with blocks given by the summation of At\u2019s\nover at most \u03b20 < \u03b1 time instants, then (4) holds with \u03b2 = \u03b20. This will happen if M2,t = ITt\nwith Tt = T [k] for at most \u03b20 time instants and if sets T [k] are mutually disjoint for different\nk. Finally, the T [k]\u2019s need not even be mutually disjoint. As long as they are such that B is a\nmatrix with nonzero blocks on only the main diagonal and on a few diagonals near it, e.g., if it is\nblock tri-diagonal, it can be shown that the above assumption holds. This example is generalized in\nAssumption 1.3 given below.\n1.2 Examples of correlated-PCA problems\nOne key example of correlated-PCA is the PCA with missing data (PCA-missing) problem. Let Tt\ndenote the set of missing entries at time t. Suppose, we set the missing entries of yt to zero. Then,\n(5)\n\nyt = (cid:96)t \u2212 ITtITt\n\n(cid:48)(cid:96)t.\n\n2\n\n\fyt = (cid:96)t + ITtxt = (cid:96)t + ITtMs,t(cid:96)t.\n\n(cid:48). Thus, q is an upper bound on (cid:107)ITt\n\nIn this case M2,t = ITt and M1,t = \u2212ITt\n(cid:48)P(cid:107). Clearly, it\nwill be small if the columns of P are dense vectors. For the reader familiar with low-rank matrix\ncompletion (MC), e.g., [12, 13], PCA-missing can also be solved by \ufb01rst solving the low-rank matrix\ncompletion problem to recover L, followed by PCA on the completed matrix. This would, of course,\nbe much more expensive than directly solving PCA-missing and would need more assumptions.\nAnother example where correlated-PCA occurs is that of robust PCA (low-rank + sparse formula-\ntion) [14, 15, 16] when the sparse component\u2019s magnitude is correlated with (cid:96)t. Let Tt denote the\nsupport set of wt and let xt be the |Tt|-length vector of its nonzero entries. If we assume linear\ndependency of xt on (cid:96)t, we can write out yt as\n(6)\nThus M2,t = ITt and M1,t = Ms,t and so q is an upper bound on (cid:107)Ms,tP(cid:107). In the rest of the\npaper, we refer to this problem is \u201cPCA with sparse data-dependent corruptions (PCA-SDDC)\u201d.\nOne key application where it occurs is in foreground-background separation for videos consisting\nof a slow changing background sequence (modeled as lying close to a low-dimensional subspace)\nand a sparse foreground image sequence consisting typically of one or more moving objects [14].\nThe PCA-SDDC problem is to estimate the background sequence\u2019s subspace. In this case, (cid:96)t is\nthe background image at time t, Tt is the support set of the foreground image at t, and xt is the\ndifference between foreground and background intensities on Tt. An alternative solution approach\nfor PCA-SDDC is to use an RPCA solution such as principal components\u2019 pursuit (PCP) [14, 15] or\nAlternating-Minimization (Alt-Min-RPCA) [17] to \ufb01rst recover the matrix L followed by PCA on\nL. However, as shown in Sec. 5, Table 1, this approach will be much slower; and it will work only\nif its required incoherence assumptions hold. For example, if the columns of P are sparse, it fails.\nFor both problems above, a solution for PCA will work only when the corrupting noise wt is small\ncompared to (cid:96)t. A suf\ufb01cient condition for this is that q is small.\nA third example where correlated-PCA and its generalization, correlated-PCA with partial subspace\nknowledge, occurs is in the subspace update step of Recursive Projected Compressive Sensing (Re-\nProCS) for dynamic robust PCA [3, 5].\nIn all three of the above applications, the assumptions on the data-noise correlation matrix given in\nAssumption 1.2 hold if there are enough changes of a certain type in the set of missing or corrupted\nentries, Tt. One example where this is true is in case of a 1D object of length s or less that remains\nstatic for at most \u03b2 frames at a time. When it moves, it moves by at least a certain fraction of s\npixels. The following assumption is inspired by the object\u2019s support.\nAssumption 1.3. Let l denote the number of times the set Tt changes in the interval [1, \u03b1] (or in any\ngiven interval of length \u03b1 in case of dynamic robust PCA). So 0 \u2264 l \u2264 \u03b1 \u2212 1. Let t0 := 1; let tk,\nwith tk < tk+1, denote the time instants in this interval at which Tt changes; and let T [k] denote the\ndistinct sets. In other words, Tt = T [k] for t \u2208 [tk, tk+1), for each k = 1, 2, . . . , l. Assume that the\nfollowing hold with a \u03b2 < \u03b1:\n\n1. (tk+1 \u2212 tk) \u2264 \u02dc\u03b2 and |T [k]| \u2264 s;\n2. \u03c12 \u02dc\u03b2 \u2264 \u03b2 where \u03c1 is the smallest positive integer so that, for any 0 \u2264 k \u2264 l, T [k] and\n\nT [k+\u03c1] are disjoint;\n\n3. for any k1, k2 satisfying 0 \u2264 k1 < k2 \u2264 l, the sets (T [k1] \\ T [k1+1]) and (T [k2] \\ T [k2+1])\n\nk=0 |T [k] \\ T [k+1]| \u2264 n. Observe that\n\nare disjoint.\n\nAn implicit assumption for condition 3 to hold is that(cid:80)l\n\nconditions 2 and 3 enforce an upper bound on the maximum support size s.\n\nTo connect Assumption 1.3 with the moving object example given above, condition 1 holds if the\nobject\u2019s size is at most s and if it moves at least once every \u02dc\u03b2 frames. Condition 2 holds, if, every\ntime it moves, it moves in the same direction and by at least s\n\u03c1 pixels. Condition 3 holds if, every\ntime it moves, it moves in the same direction and by at most d0 \u2265 s\n\u03c1 pixels, with d0\u03b1 \u2264 n (or, more\ngenerally, the motion is such that, if the object were to move at each frame, and if it started at the\ntop of the frame, it does not reach the bottom of the frame in a time interval of length \u03b1).\nThe following lemma [4] shows that, with Assumption 1.3 on Tt, M2,t = ITt satis\ufb01es the assump-\ntion on M2,t given in Assumption 1.2. Its proof generalizes the discussion below Assumption 1.2.\n\n3\n\n\fLemma 1.4. [[4], Lemmas 5.2 and 5.3] Assume that Assumption 1.3 holds. For any sequence of\n|Tt| \u00d7 |Tt| symmetric positive-semi-de\ufb01nite matrices At,\n\n(cid:107) \u03b1(cid:88)\n\nt=1\n\nITt AtITt\n\n(cid:48)(cid:107) \u2264 (\u03c12 \u02dc\u03b2) max\nt\u2208[1,\u03b1]\n\n(cid:107)At(cid:107) \u2264 \u03b2 max\nt\u2208[1,\u03b1]\n\n(cid:107)At(cid:107)\n\n(cid:48)P(cid:107) \u2264 q < 1, then the PCA-missing problem satis\ufb01es Assumption 1.2. If (cid:107)Ms,tP(cid:107) \u2264\n\nThus, if (cid:107)ITt\nq < 1, then the PCA-SDDC problem satis\ufb01es Assumption 1.2.\nAssumption 1.3 is one model on Tt that ensures that, if M2,t = ITt, the assumption on M2,t given\nin Assumption 1.2 holds. For its many generalizations, see Supplementary Material, Sec. 7, or [4].\nAs explained in [18], data-dependent noise also often occurs in molecular biology applications when\nthe noise affects the measurement levels through the very same process as the interesting signal.\n2 Simple EVD\nSimple EVD computes the top eigenvectors of the empirical covariance matrix, 1\n\u03b1\nthe observed data. The following can be shown.\nTheorem 2.1 (simple-EVD result). Let \u02c6P denote the matrix containing all the eigenvectors of\n(cid:48) with eigenvalues above a threshold, \u03bbthresh, as its columns. Pick a \u03b6 so that\n1\nr\u03b6 \u2264 0.01. Suppose that yt\u2019s satisfy (2) and the following hold.\n\u03b1\n\n(cid:80)\u03b1\n\n(cid:80)\u03b1\n\nt=1 ytyt\n\nt=1 ytyt\n\n(cid:48), of\n\n1. Assumption 1.1 on (cid:96)t holds. De\ufb01ne\n\n(r\u03b6)2 max(f, qf, q2f )2, C :=\n\n32\n0.012 .\n\n2. Assumption 1.2 on Mt holds for any \u03b1 \u2265 \u03b10 and for any \u03b2 satisfying\n\n\u03b10 := C\u03b72 r211 log n\n(cid:19)2\n\n(cid:18) 1 \u2212 r\u03b6\n\n\u2264\n\n\u03b2\n\u03b1\n\n(cid:18) (r\u03b6)2\n\n(cid:19)\n\n(r\u03b6)\nq2f\n\n4.1(qf )2 ,\n3. Set algorithm parameters \u03bbthresh = 0.95\u03bb\u2212 and \u03b1 \u2265 \u03b10.\n\nmin\n\n2\n\ntion of measure result to(cid:80)\n\nThen, with probability at least 1 \u2212 6n\u221210, SE( \u02c6P , P ) \u2264 r\u03b6.\nProof: The proof involves a careful application of the sin \u03b8 theorem [19] to bound the subspace\nerror, followed by using matrix Hoeffding [20] to obtain high probability bounds on each of the\nterms in the sin \u03b8 bound. It is given in the Supplementary Material, Section 8.\nConsider the lower bound on \u03b1. We refer to this as the \u201csample complexity\u201d. Since q < 1, and \u03b7 is a\nsmall constant (e.g., for the uniform distribution, \u03b7 = 3), for a \ufb01xed error level, r\u03b6, \u03b10 simpli\ufb01es to\ncf 2r2 log n. Notice that the dependence on n is logarithmic. It is possible to show that the sample\ncomplexity scales as log n because we assume that the (cid:96)t\u2019s are bounded r.v.s. As a result we can\napply the matrix Hoeffding inequality [20] to bound the perturbation between the observed data\u2019s\nempirical covariance matrix and that of the true data. The bounded r.v. assumption is actually a\nmore practical one than the usual Gaussian assumption since most sources of data have \ufb01nite power.\nBy replacing matrix Hoeffding by Theorem 5.39 of [21] in places where one can apply a concentra-\n(cid:48)/\u03b1 (which is at r \u00d7 r matrix), and by matrix Bernstein [20] else-\nwhere, it should be possible to further reduce the sample complexity to c max((qf )2r log n, f 2(r +\nlog n)). It should also be possible remove the boundedness assumption and replace it by a Gaussian\n(or a sub-Gaussian) assumption, but, that would increase the sample complexity to c(qf )2n.\nConsider the upper bound on \u03b2/\u03b1. Clearly, the smaller term is the \ufb01rst one. This depends on\n1/(qf )2. Thus, when f is large and q is not small enough, the bound required may be impractically\nsmall. As will be evident from the proof (see Remark 8.3 in Supplementary Material), we get this\n(cid:48)] (cid:54)= 0.\nbound because wt is correlated with (cid:96)t and this results in E[(cid:96)twt\nIf wt and (cid:96)t were uncorrelated, qf would get replaced by \u03bbmax(Cov(wt))\nas well as in the sample complexity.\nApplication to PCA-missing and PCA-SDDC. By Lemma 1.4, the following is immediate.\n\nin the upper bound on \u03b2/\u03b1\n\nt atat\n\n\u03bb\u2212\n\n4\n\n\fFigure 1: Eigenvalue clusters of the three low-ranki\ufb01ed videos.\n\n(cid:48)P(cid:107) \u2264 q < 1;\nCorollary 2.2. Consider the PCA-missing model, (5), and assume that maxt (cid:107)ITt\nor consider the PCA-SDDC model, (6), and assume that maxt (cid:107)Ms,tP(cid:107) \u2264 q < 1. Assume that\neverything in Theorem 2.1 holds except that we replace Assumption 1.2 by Assumption 1.3. Then,\nwith probability at least 1 \u2212 6n\u221210, SE( \u02c6P , P ) \u2264 r\u03b6.\n\n3 Cluster-EVD\nTo try to relax the strong dependence on f 2 of the result above, we develop a generalization of\nsimple-EVD that we call cluster-EVD. This requires the clustering assumption.\n3.1 Clustering assumption\nTo state the assumption, de\ufb01ne the following partition of the index set {1, 2, . . . r} based on the\neigenvalues of \u03a3. Let \u03bbi denote its i-th largest eigenvalue.\nDe\ufb01nition 3.1 (g-condition-number partition of {1, 2, . . . r}). De\ufb01ne G1 = {1, 2, . . . r1} where r1\n> g. In words, to de\ufb01ne G1, start with the index of the \ufb01rst\nis the index for which \u03bb1\n\u03bbr1\n(largest) eigenvalue and keep adding indices of the smaller eigenvalues to the set until the ratio of\nthe maximum to the minimum eigenvalue \ufb01rst exceeds g.\n\nFor each k > 1, de\ufb01ne Gk = {r\u2217 + 1, r\u2217 + 2, . . . , r\u2217 + rk} where r\u2217 = ((cid:80)k\u22121\n\n\u2264 g and \u03bb1\n\u03bbr1+1\n\n\u2264 g and\n\ni=1 ri), rk is the index for\n> g. In words, to de\ufb01ne Gk, start with the index of the (r\u2217 + 1)-th\n\nwhich \u03bbr\u2217+1\n\u03bbr\u2217+rk\neigenvalue, and repeat the above procedure.\nStop when \u03bbr\u2217+rk+1 = 0, i.e., when there are no more nonzero eigenvalues. De\ufb01ne \u03d1 = k as the\nnumber of sets in the partition. Thus {G1,G2, . . . ,G\u03d1} is the desired partition.\n\n\u03bbr\u2217+1\n\n\u03bbr\u2217+rk +1\n\nDe\ufb01ne G0 = [.], Gk := (P )Gk, \u03bb+\n\nk := maxi\u2208Gk \u03bbi (\u039b), \u03bb\u2212\n\nk := mini\u2208Gk \u03bbi (\u039b) and\n\n\u03c7 := max\n\nk=1,2,...,\u03d1\n\n\u03bb+\nk+1\n\u03bb\u2212\n\nk\n\n.\n\n\u03c7 quanti\ufb01es the \u201cdistance\u201d between consecutive sets of the above partition. Moreover, by de\ufb01nition,\n\u03bb+\nk\n\u2212\n\u03bb\nk\n\n\u2264 g. Clearly, g \u2265 1 and \u03c7 \u2264 1 always. We assume the following.\n\nAssumption 3.2. For a 1 \u2264 g+ < f and a 0 \u2264 \u03c7+ < 1, assume that there exists a g satisfying\n1 \u2264 g \u2264 g+ and a \u03c7 satisfying 0 \u2264 \u03c7 \u2264 \u03c7+, for which we can de\ufb01ne a g-condition-number\npartition of {1, 2, . . . r} that satis\ufb01es \u03c7 \u2264 \u03c7+. The number of sets in the partition is \u03d1. When g+\nand \u03c7+ are small, we say that the eigenvalues are \u201cwell-clustered\u201d with \u201cclusters\u201d, Gk.\nThis assumption can be understood as a generalization of the eigen-gap condition needed by the\nblock power method, which is a fast algorithm for obtaining the k top eigenvectors of a matrix [22].\nWe expect it to hold for data that has variability across different scales. The large scale variations\nwould result in the \ufb01rst (largest eigenvalues\u2019) cluster and the smaller scale variations would form the\nlater clusters. This would be true, for example, for video \u201ctextures\u201d such as moving waters or waving\ntrees in a forest. We tested this assumption on some such videos. We describe our conclusions here\nfor three videos - \u201clake\u201d (video of moving lake waters), \u201cwaving-tree\u201d (video consisting of waving\ntrees), and \u201ccurtain\u201d (video of window curtains moving due to the wind). For each video, we \ufb01rst\nmade it low-rank by keeping the eigenvectors corresponding to the smallest number of eigenvalues\nthat contain at least 90% of the total energy and projecting the video onto this subspace. For the\nlow-ranki\ufb01ed lake video, f = 74 and Assumption 3.2 holds with \u03d1 = 6 clusters, g+ = 2.6 and\n\u03c7+ = 0.7. For the waving-tree video, f = 180 and Assumption 3.2 holds with \u03d1 = 6, g+ = 9.4\nand \u03c7+ = 0.72. For the curtain video, f = 107 and the assumption holds \u03d1 = 3, g+ = 16.1 and\n\u03c7+ = 0.5. We show the clusters of eigenvalues in Fig. 1.\n\n5\n\n05101520253035678910111213141516 log of eigs of curtain videolog of eigs of lake videolog of eigs of waving\u2212tree video\fAlgorithm 1 Cluster-EVD\nParameters: \u03b1, \u02c6g, \u03bbthresh.\nSet \u02c6G0 \u2190 [.]. Set the \ufb02ag Stop \u2190 0. Set k \u2190 1.\nRepeat\n\n(cid:48)). Notice that \u03a81 =\n\nI. Compute\n\n\uf8eb\uf8ed 1\n\n1. Let \u02c6Gdet,k := [ \u02c6G0, \u02c6G1, . . . \u02c6Gk\u22121] and let \u03a8k := (I \u2212 \u02c6Gdet,k \u02c6Gdet,k\n\n\uf8f6\uf8f8 \u03a8k\n(b) set \u02c6Gk = {\u02c6r\u2217 + 1, \u02c6r\u2217 + 2, . . . , \u02c6r\u2217 + \u02c6rk} where \u02c6r\u2217 :=(cid:80)k\u22121\n\n(a) \ufb01nd the index \u02c6rk for which \u02c6\u03bb1\n\u02c6\u03bb\u02c6rk\n\n2. Find the k-th cluster, \u02c6Gk: let \u02c6\u03bbi = \u03bbi( \u02c6Dk);\n\n\u2264 \u02c6g and either\n\nt=(k\u22121)\u03b1+1\n\n\u02c6Dk = \u03a8k\n\nk\u03b1(cid:88)\n\n\u02c6\u03bb1\n\n\u02c6\u03bb\u02c6rk +1\n\n(cid:48)\n\nytyt\n\n\u03b1\n\nj=1 \u02c6rj;\n\n(c) if \u02c6\u03bb\u02c6rk+1 < \u03bbthresh, update the \ufb02ag Stop \u2190 1\n3. Compute \u02c6Gk \u2190 eigenvectors( \u02c6Dk, \u02c6rk); increment k\n\n> \u02c6g or \u02c6\u03bb\u02c6rk+1 < \u03bbthresh;\n\nUntil Stop == 1.\nSet \u02c6\u03d1 \u2190 k. Output \u02c6P \u2190 [ \u02c6G1 \u00b7\u00b7\u00b7 \u02c6G \u02c6\u03d1].\neigenvectors(M, r) returns a basis matrix for the span of the top r eigenvectors of M.\n\nt=1 ytyt\n\n(cid:80)\u03b1\n\n3.2 Cluster-EVD algorithm\nThe cluster-EVD approach is summarized in Algorithm 1. I Its main idea is as follows. We start\nby computing the empirical covariance matrix of the \ufb01rst set of \u03b1 observed data points, \u02c6D1 :=\n(cid:48). Let \u02c6\u03bbi denote its i-th largest eigenvalue. To estimate the \ufb01rst cluster, \u02c6G1, we start\n1\n\u03b1\nwith the index of the \ufb01rst (largest) eigenvalue and keep adding indices of the smaller eigenvalues\nto it until the ratio of the maximum to the minimum eigenvalue exceeds \u02c6g or until the minimum\neigenvalue goes below a \u201czero threshold\u201d, \u03bbthresh. Then, we estimate the \ufb01rst cluster\u2019s subspace,\nrange(G1) by computing the top \u02c6r1 eigenvectors of \u02c6D1. To get the second cluster and its subspace,\nwe project the next set of \u03b1 yt\u2019s orthogonal to \u02c6G1 followed by repeating the above procedure. This\nis repeated for each k > 1. The algorithm stops when \u02c6\u03bb\u02c6rk+1 < \u03bbthresh.\nAlgorithm 1 is related to, but signi\ufb01cantly different from, the ones introduced in [3, 5] for the\nsubspace deletion step of ReProCS. The one introduced in [3] assumed that the clusters were known\nto the algorithm (which is unrealistic). The one studied in [5] has an automatic cluster estimation\napproach, but, one that needs a larger lower bound on \u03b1 compared to what Algorithm 1 needs.\n3.3 Main result\nWe give the performance guarantee for Algorithm 1 here. Its parameters are set as follows. We set\n\u02c6g to a value that is a little larger than g. This is needed to allow for the fact that \u02c6\u03bbi is not equal to\nthe i-th eigenvalue of \u039b but is within a small margin of it. For the same reason, we need to also use\na nonzero \u201czeroing\u201d threshold, \u03bbthresh, that is larger than zero but smaller than \u03bb\u2212. We set \u03b1 large\nenough to ensure that SE( \u02c6P , P ) \u2264 r\u03b6 holds with a high enough probability.\nPick a \u03b6 so that r2\u03b6 \u2264\nTheorem 3.3 (cluster-EVD result). Consider Algorithm 1.\n0.0001, and r2\u03b6f \u2264 0.01. Suppose that yt\u2019s satisfy (2) and the following hold.\n\n1. Assumption 1.1 and Assumption 3.2 on (cid:96)t hold with \u03c7+ satisfying \u03c7+ \u2264 min(1 \u2212 r\u03b6 \u2212\n\n0.08\n0.25 ,\n\n\u03b10 := C\u03b72 r2(11 log n + log \u03d1)\n\n1.01g++0.0001 \u2212 0.0001). De\ufb01ne\ng+\u22120.0001\n(cid:112)\n(cid:18) (1 \u2212 r\u03b6 \u2212 \u03c7+)\n\nq2f, q(r\u03b6)f, (r\u03b6)2f, q\n\n(r\u03b6)2\n\n(cid:19)2\n\nf g+)2, C :=\n2. Assumption 1.2 on Mt holds with \u03b1 \u2265 \u03b10 and with \u03b2 satisfying\n\nf g+, (r\u03b6)\n\nmax(g+, qg+,\n\n(cid:112)\n(cid:18) (rk\u03b6)2\n\n\u2264\n\n\u03b2\n\u03b1\n\nmin\n\n4.1(qg+)2 ,\n\n(rk\u03b6)\nq2f\n\n2\n\n6\n\n32 \u00b7 16\n0.012 .\n\n(cid:19)\n\n.\n\n\f3. Set algorithm parameters \u02c6g = 1.01g+ + 0.0001, \u03bbthresh = 0.95\u03bb\u2212 and \u03b1 \u2265 \u03b10.\n\nThen, with probability at least 1 \u2212 12n\u221210, SE( \u02c6P , P ) \u2264 r\u03b6.\n\n(r\u03b6)2\n\nProof: The proof is given in Section 9 in Supplementary Material.\nWe can also get corollaries for PCA-missing and PCA-SDDC for cluster-EVD. We have given one\nspeci\ufb01c value for \u02c6g and \u03bbthresh in Theorem 3.3 for simplicity. One can, in fact, set \u02c6g to be anything\nthat satis\ufb01es (12) given in Supplementary Material and one can set \u03bbthresh to be anything satisfying\n5r\u03b6\u03bb\u2212 \u2264 \u03bbthresh \u2264 0.95\u03bb\u2212. Also, it should be possible to reduce the sample complexity of cluster-\nEVD to c max(q2(g+)2r log n, (g+)2(r + log n)) using the approach explained in Sec. 2.\n4 Discussion\n\u221a\nComparing simple-EVD and cluster-EVD. Consider the lower bounds on \u03b1. In the cluster-EVD\n(c-EVD) result, Theorem 3.3, if q is small enough (e.g., if q \u2264 1/\nf), and if (r2\u03b6)f \u2264 0.01,\nit is clear that the maximum in the max(., ., ., .) expression is achieved by (g+)2. Thus, in this\nregime, c-EVD needs \u03b1 \u2265 C r2(11 log n+log \u03d1)\ng2 and its sample complexity is \u03d1\u03b1. In the EVD result\n(Theorem 2.1), g+ gets replaced by f and \u03d1 by 1, and so, its sample complexity, \u03b1 \u2265 C r211 log n\n(r\u03b6)2 f 2.\nIn situations where the condition number f is very large but g+ is much smaller and \u03d1 is small (the\nclustering assumption holds well), the sample complexity of c-EVD will be much smaller than that\nof simple-EVD. However, notice that, the lower bound on \u03b1 for simple-EVD holds for any q < 1\n\u221a\nand for any \u03b6 with r\u03b6 < 0.01 while the c-EVD lower bound given above holds only when q is small\nenough, e.g., q = O(1/\nf ), and \u03b6 is small enough, e.g., r\u03b6 = O(1/f ). This tighter bound on \u03b6\nis needed because the error of the k-th step of c-EVD depends on the errors of the previous steps\ntimes f. Secondly, the c-EVD result also needs \u03c7+ and \u03d1 to be small (clustering assumption holds\nwell), whereas, for simple-EVD, by de\ufb01nition, \u03c7+ = 0 and \u03d1 = 1. Another thing to note is that the\nconstants in both lower bounds are very large with the c-EVD one being even larger.\nTo compare the upper bounds on \u03b2, assume that the same \u03b1 is used by both,\ni.e., \u03b1 =\nmax(\u03b10(EVD), \u03b10(c-EVD)). As long as rk is large enough, \u03c7+ is small enough, and g is small\nenough, the upper bound on \u03b2 needed by the c-EVD result is signi\ufb01cantly looser. For example, if\n\u03c7+ = 0.2, \u03d1 = 2, rk = r/2, then c-EVD needs \u03b2 \u2264 (0.5 \u00b7 0.79 \u00b7 0.5)2 (r\u03b6)2\n4.1q2g2 \u03b1 while simple-EVD\nneeds \u03b2 \u2264 (0.5 \u00b7 0.99)2 (r\u03b6)2\nComparison with other results for PCA-SDDC and PCA-missing. To our knowledge, there is no\nother result for correlated-PCA. Hence, we provide comparisons of the corollaries given above for\nthe PCA-missing and PCA-SDDC special cases with works that also study these or related problems.\nAn alternative solution for either PCA-missing or PCA-SDDC is to \ufb01rst recover the entire matrix L\nand then compute its subspace via SVD on the estimated L. For the PCA-missing problem, this can\nbe done by using any of the low-rank matrix completion techniques, e.g., nuclear norm minimization\n(NNM) [13] or alternating minimization (Alt-Min-MC) [23]. Similarly, for PCA-SDDC, this can be\ndone by solving any of the recent provably correct RPCA techniques such as principal components\u2019\npursuit (PCP) [14, 15, 16] or alternating minimization (Alt-Min-RPCA) [17].\nHowever, as explained earlier doing the above has two main disadvantages. The \ufb01rst is that it is\nmuch slower (see Sec. 5). The difference in speed is most dramatic when solving the matrix-sized\nconvex programs such as NNM or PCP, but even the Alt-Min methods are slower. If we use the time\ncomplexity from [17], then \ufb01nding the span of the top k singular vectors of an n \u00d7 m matrix takes\nO(nmk) time. Thus, if \u03d1 is a constant, both simple-EVD and c-EVD need O(n\u03b1r) time, whereas,\nAlt-Min-RPCA needs O(n\u03b1r2) time per iteration [17]. The second disadvantage is that the above\nmethods for MC or RPCA need more assumptions to provably correctly recover L. All the above\nmethods need an incoherence assumption on both the left singular vectors, P , and the right singular\nvectors, V , of L. Of course, it is possible that, if one studies these methods with the goal of only\nrecovering the column space of L correctly, the incoherence assumption on the right singular vectors\nis not needed. From simulation experiments (see Sec. 5), the incoherence of the left singular vectors\nis de\ufb01nitely needed. On the other hand, for the PCA-SDDC problem, simple-EVD or c-EVD do not\neven need the incoherence assumption on P .\nThe disadvantage of both EVD and c-EVD, or in fact of any solution for the PCA problem, is that\nthey work only when q is small enough (the corrupting noise is small compared to (cid:96)t).\n\n4.1q2f 2 \u03b1. If g = 3 but f = 100, clearly the c-EVD bound is looser.\n\n7\n\n\fMean Subspace Error (SE)\n\nAverage Execution Time\n\nA-M-RPCA c-EVD\n0.0549\n0.0613\n\n1.0000\n0.4846\n\nEVD\n0.0255\n0.0223\n\nPCP\n0.2361\n1.6784\n\nA-M-RPCA\n\nc-EVD\n0.0908\n0.3626\n\nEVD\n0.0911\n0.3821\n\nPCP\n1.0000\n0.4970\n\n0.0810\n5.5144\n\nExpt 1\nExpt 2\nTable 1: Comparison of SE( \u02c6P , P ) and execution time (in seconds). A-M-RPCA: Alt-Min-RPCA. Expt 1:\nsimulated data, Expt 2: lake video with simulated foreground.\n5 Numerical Experiments\nWe use the PCA-SDDC problem as our case study example. We compare EVD and cluster-EVD\n(c-EVD) with PCP [15], solved using [24], and with Alt-Min-RPCA [17] (implemented using code\nfrom the authors\u2019 webpage). For both PCP and Alt-Min-RPCA, \u02c6P is recovered as the top r eigenvec-\ntors of of the estimated L. To show the advantage of EVD or c-EVD, we let (cid:96)t = P at with columns\nof P being sparse. These were chosen as the \ufb01rst r = 5 columns of the identity matrix. We gen-\nerate at\u2019s iid uniformly with zero mean and covariance matrix \u039b = diag(100, 100, 100, 0.1, 0.1).\nThus the condition number f = 1000. The clustering assumption holds with \u03d1 = 2, g+ = 1 and\n\u03c7+ = 0.001. The noise wt is generated as wt = ITtMs,t(cid:96)t with Tt generated to satisfy Assumption\n1.3 with s = 5, \u03c1 = 2, and \u02dc\u03b2 = 1; and the entries of Ms,t being iid N (0, q2) with q = 0.01. We\nused n = 500. EVD and c-EVD (Algorithm 1) were implemented with \u03b1 = 300, \u03bbthresh = 0.095,\n\u02c6g = 3. 10000-time Monte Carlo averaged values of SE( \u02c6P , P ) and execution time are shown in the\n\ufb01rst row of Table 1. Since the columns of P are sparse, both PCP and Alt-Min-RPCA fail. Both\nhave average SE close to one whereas the average SE of c-EVD and EVD is 0.0908 and 0.0911\nrespectively. Also, both EVD and c-EVD are much faster than the other two. We also did an exper-\niment with the settings of this experiment, but with P dense. In this case, EVD and c-EVD errors\nwere similar, but PCP and Alt-Min-RPCA errors were less than 10\u22125.\nFor our second experiment, we used images of a low-ranki\ufb01ed real video sequence as (cid:96)t\u2019s.\nWe chose the escalator sequence from http://perception.i2r.a-star.edu.sg/bk_\nmodel/bk_index.html since the video changes are only in the region where the escalator\nmoves (and hence can be modeled as being sparse). We made it exactly low-rank by retaining\nits top 5 eigenvectors and projecting onto their subspace. This resulted in a data matrix L of size\nn \u00d7 2\u03b1 with n = 20800 and \u03b1 = 50, of rank r = 5. We overlaid a simulated moving foreground\nblock on it. The intensity of the moving block was controlled to ensure that q is small. We used\n\u03b1 = 50 for EVD and c-EVD. We let P be the eigenvectors of the low-ranki\ufb01ed video with nonzero\neigenvalues and computed SE( \u02c6P , P ). The errors and execution time are displayed in the second\nrow of Table 1. Since n is very large, the difference in speed is most apparent in this case.\nThus c-EVD outperforms PCP and AltMinRPCA when columns of P are sparse. It also outperforms\nEVD but the advantage in mean error is not as much as our theorems predict. One reason is that the\nconstant in the required lower bounds on \u03b1 is very large. It is hard to pick an \u03b1 that is this large and\nstill only O(log n) unless n is very large. Secondly, both guarantees are only suf\ufb01cient conditions.\n6 Conclusions and Future Work\nWe studied the problem of PCA in noise that is correlated with the data (data-dependent noise). We\nobtained sample complexity bounds for the most commonly used PCA solution, simple EVD. We\nalso developed and analyzed a generalization of EVD, called cluster-EVD, that has lower sample\ncomplexity under extra assumptions. We provided a detailed comparison of our results with those\nfor other approaches to solving its example applications - PCA with missing data and PCA with\nsparse data-dependent corruptions.\nWe used matrix Hoeffding [20] to obtain our results. As explained in Sec. 2, we can signi\ufb01cantly\nimprove the sample complexity bounds if this is replaced by [21, Theorem 5.39] and matrix Bern-\nstein at appropriate places. We have obtained such a result in ongoing work [25]. Moreover, as done\nin [5] (for ReProCS), the mutual independence of (cid:96)t\u2019s can be easily replaced by a more practical as-\nsumption of (cid:96)t\u2019s following autoregressive model with almost no change to our assumptions. Thirdly,\nby generalizing the proof techniques developed here, we can also study the problem of correlated-\nPCA with partial subspace knowledge. The solution to the latter problem helps to greatly simplify\nthe proof of correctness of ReProCS for online dynamic RPCA. The boundedness assumption on\n(cid:96)t\u2019s can be replaced by a Gaussian or a well-behaved sub-Gaussian assumption but this will increase\nthe sample complexity to O(n). Finally, an open-ended question is how we relax Assumption 1.2\non Mt and still get results similar to Theorem 2.1 or Theorem 3.3.\n\n8\n\n\fReferences\n[1] B. Nadler, \u201cFinite sample approximation results for principal component analysis: A matrix\n\nperturbation approach,\u201d The Annals of Statistics, vol. 36, no. 6, 2008.\n\n[2] C. Qiu and N. Vaswani, \u201cReal-time robust principal components\u2019 pursuit,\u201d in Allerton Conf.\n\non Communication, Control, and Computing, 2010.\n\n[3] C. Qiu, N. Vaswani, B. Lois, and L. Hogben, \u201cRecursive robust pca or recursive sparse recovery\n\nin large but structured noise,\u201d IEEE Trans. Info. Th., pp. 5007\u20135039, August 2014.\n\n[4] B. Lois and N. Vaswani, \u201cOnline matrix completion and online robust pca,\u201d in IEEE Intl.\n\nSymp. Info. Th. (ISIT), 2015.\n\n[5] J. Zhan, B. Lois, H. Guo, and N. Vaswani, \u201cOnline (and Of\ufb02ine) Robust PCA: Novel Algo-\n\nrithms and Performance Guarantees,\u201d in Intnl. Conf. Artif. Intell. and Stat. (AISTATS), 2016.\n\n[6] R. Arora, A. Cotter, and N. Srebro, \u201cStochastic optimization of pca with capped msg,\u201d in Adv.\n\nNeural Info. Proc. Sys. (NIPS), 2013, pp. 1815\u20131823.\n\n[7] O. Shamir,\n\n\u201cA stochastic pca and svd algorithm with an exponential convergence rate,\u201d\n\narXiv:1409.2848, 2014.\n\n[8] C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty, \u201cOnline principal components analysis,\u201d in\n\nProc. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2015, pp. 887\u2013901.\n\n[9] A. Balsubramani, S. Dasgupta, and Y. Freund, \u201cThe fast convergence of incremental pca,\u201d in\n\nAdv. Neural Info. Proc. Sys. (NIPS), 2013, pp. 3174\u20133182.\n\n[10] Z. Karnin and E. Liberty, \u201cOnline pca with spectral bounds,\u201d in Proce. Conference on Com-\n\nputational Learning Theory (COLT), 2015, pp. 505\u2013509.\n\n[11] I. Mitliagkas, C. Caramanis, and P. Jain, \u201cMemory limited, streaming pca,\u201d in Adv. Neural\n\nInfo. Proc. Sys. (NIPS), 2013, pp. 2886\u20132894.\n\n[12] M. Fazel, \u201cMatrix rank minimization with applications,\u201d PhD thesis, Stanford Univ, 2002.\n[13] E. J. Candes and B. Recht, \u201cExact matrix completion via convex optimization,\u201d Found. of\n\nComput. Math, , no. 9, pp. 717\u2013772, 2008.\n\n[14] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, \u201cRobust principal component analysis?,\u201d Journal of\n\nACM, vol. 58, no. 3, 2011.\n\n[15] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, \u201cRank-sparsity incoherence\n\nfor matrix decomposition,\u201d SIAM Journal on Optimization, vol. 21, 2011.\n\n[16] D. Hsu, S.M. Kakade, and T. Zhang, \u201cRobust matrix decomposition with sparse corruptions,\u201d\n\nIEEE Trans. Info. Th., Nov. 2011.\n\n[17] P. Netrapalli, U N Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain, \u201cNon-convex robust\n\npca,\u201d in Neural Info. Proc. Sys. (NIPS), 2014.\n\n[18] Jussi Gillberg, Pekka Marttinen, Matti Pirinen, Antti J Kangas, Pasi Soininen, Mehreen Ali,\nAki S Havulinna, Marjo-Riitta Marjo-Riitta J\u00a8arvelin, Mika Ala-Korpela, and Samuel Kaski,\n\u201cMultiple output regression with latent noise,\u201d Journal of Machine Learning Research, 2016.\n[19] C. Davis and W. M. Kahan, \u201cThe rotation of eigenvectors by a perturbation. iii,\u201d SIAM J.\n\nNumer. Anal., vol. 7, pp. 1\u201346, Mar. 1970.\n\n[20] J. A. Tropp, \u201cUser-friendly tail bounds for sums of random matrices,\u201d Found. Comput. Math.,\n\nvol. 12, no. 4, 2012.\n\n[21] R. Vershynin, \u201cIntroduction to the non-asymptotic analysis of random matrices,\u201d Compressed\n\nsensing, pp. 210\u2013268, 2012.\n\n[22] G. H. Golub and H. A. Van der Vorst, \u201cEigenvalue computation in the 20th century,\u201d Journal\n\nof Computational and Applied Mathematics, vol. 123, no. 1, pp. 35\u201365, 2000.\n\n[23] P. Netrapalli, P. Jain, and S. Sanghavi, \u201cLow-rank matrix completion using alternating mini-\n\nmization,\u201d in Symposium on Theory of Computing (STOC), 2013.\n\n[24] Z. Lin, M. Chen, and Y. Ma, \u201cAlternating direction algorithms for l1 problems in compressive\n\nsensing,\u201d Tech. Rep., University of Illinois at Urbana-Champaign, November 2009.\n\n[25] N. Vaswani, \u201cPCA in Data-Dependent Noise (Correlated-PCA): Improved Finite Sample Guar-\n\nantees,\u201d to be posted on arXiV, 2017.\n\n9\n\n\f", "award": [], "sourceid": 955, "authors": [{"given_name": "Namrata", "family_name": "Vaswani", "institution": "Iowa State University"}, {"given_name": "Han", "family_name": "Guo", "institution": "Iowa State University"}]}