{"title": "Robust PCA with compressed data", "book": "Advances in Neural Information Processing Systems", "page_first": 1936, "page_last": 1944, "abstract": "The robust principal component analysis (RPCA) problem seeks to separate low-rank trends from sparse outlierswithin a data matrix, that is, to approximate a $n\\times d$ matrix $D$ as the sum of a low-rank matrix $L$ and a sparse matrix $S$.We examine the robust principal component analysis (RPCA) problem under data compression, wherethe data $Y$ is approximately given by $(L + S)\\cdot C$, that is, a low-rank $+$ sparse data matrix that has been compressed to size $n\\times m$ (with $m$ substantially smaller than the original dimension $d$) via multiplication witha compression matrix $C$. We give a convex program for recovering the sparse component $S$ along with the compressed low-rank component $L\\cdot C$, along with upper bounds on the error of this reconstructionthat scales naturally with the compression dimension $m$ and coincides with existing results for the uncompressedsetting $m=d$. Our results can also handle error introduced through additive noise or through missing data.The scaling of dimension, compression, and signal complexity in our theoretical results is verified empirically through simulations, and we also apply our method to a data set measuring chlorine concentration acrossa network of sensors, to test its performance in practice.", "full_text": "Robust PCA with compressed data\n\nWooseok Ha\n\nUniversity of Chicago\n\nhaywse@uchicago.edu\n\nRina Foygel Barber\nUniversity of Chicago\n\nrina@uchicago.edu\n\nAbstract\n\nThe robust principal component analysis (RPCA) problem seeks to separate low-\nrank trends from sparse outliers within a data matrix, that is, to approximate a n\u21e5d\nmatrix D as the sum of a low-rank matrix L and a sparse matrix S. We examine\nthe robust principal component analysis (RPCA) problem under data compression,\nwhere the data Y is approximately given by (L+S)\u00b7C, that is, a low-rank + sparse\ndata matrix that has been compressed to size n \u21e5 m (with m substantially smaller\nthan the original dimension d) via multiplication with a compression matrix C.\nWe give a convex program for recovering the sparse component S along with the\ncompressed low-rank component L \u00b7 C, along with upper bounds on the error of\nthis reconstruction that scales naturally with the compression dimension m and\ncoincides with existing results for the uncompressed setting m = d. Our results\ncan also handle error introduced through additive noise or through missing data.\nThe scaling of dimension, compression, and signal complexity in our theoretical\nresults is veri\ufb01ed empirically through simulations, and we also apply our method\nto a data set measuring chlorine concentration across a network of sensors to test\nits performance in practice.\n\n1\n\nIntroduction\n\nPrincipal component analysis (PCA) is a tool for providing a low-rank approximation to a data\nmatrix D 2 Rn\u21e5d, with the aim of reducing dimension or capturing the main directions of variation\nin the data. More recently, there has been increased focus on more general forms of PCA, that is\nmore robust to realistic \ufb02aws in the data such as heavy-tailed outliers. The robust PCA (RPCA)\nproblem formulates a decomposition of the data,\n\nD \u21e1 L + S ,\n\ninto a low-rank component L (capturing trends across the data matrix) and a sparse component S\n(capturing outlier measurements that may obscure the low-rank trends), which we seek to separate\nbased only on observing the data matrix D [3, 10]. Depending on the application, we may be\nprimarily interested in one or the other component:\n\n\u2022 In some settings, the sparse component S may represent unwanted outliers, e.g. corrupted\nmeasurements\u2014we may wish to clean the data by removing the outliers and recovering the\nlow-rank component L.\n\n\u2022 In other settings, the sparse component S may contain the information of interest\u2014for\ninstance, in image or video data, S may capture the foreground objects which are of interest,\nwhile L may capture background components which we wish to subtract.\n\nExisting methods to separate the sparse and low-rank components include convex [3, 10] and non-\nconvex [9] methods, and can handle extensions or additional challenges such as missing data [3],\ncolumn-sparse rather than elementwise-sparse structure [11], streaming data [6, 7], and different\ntypes of structures superimposed with a low-rank component [1].\n\n1\n\n\fIn this paper, we examine the possibility of demixing sparse and low rank structure, under the\nadditional challenge of working with data that has been compressed,\nY = D \u00b7 C \u21e1 (L + S) \u00b7 C 2 Rn\u21e5m ,\n\nwhere L, S 2 Rn\u21e5d comprise the (approximately) low-rank and (approximately) sparse components\nof the original data matrix D, while C 2 Rd\u21e5m is a random or \ufb01xed compression matrix. In general,\nwe think of the compression dimension m as being signi\ufb01cantly smaller than d, motivated by several\nconsiderations:\n\n\u2022 Communication constraints: if the n \u21e5 d data matrix consists of d-dimensional measure-\nments taken at n remote sensors, compression would allow the sensors to transmit infor-\nmation of dimension m \u2327 d;\n\n\u2022 Storage constraints: storing a matrix with nm many entries instead of nd many entries;\n\u2022 Data privacy: if the data is represented as the n \u21e5 d matrix, where n-dimensional features\nwere collected from d individuals, we can preserve privacy by compressing the data by a\nrandom linear transformation and allow the access to database only through the compressed\ndata. This privacy-preserving method has been called matrix masking in the privacy litera-\nture and studied by [12] in the context of high-dimensional linear regression.\n\nRandom projection methods have been shown to be highly useful for reducing dimensionality with-\nout much loss of accuracy for numerical tasks such as least squares regression [8] or low-rank\nmatrix computations [5]. Here we use random projections to compress data while preserving the\ninformation about the underlying low-rank and sparse structure. [13] also applied random projection\nmethods to the robust PCA problem, but their purpose is to accelerate the computational task of\nlow-rank approximation, which is different from the aim of our work.\nIn the compressed robust PCA setting, we hope to learn about both the low-rank and sparse compo-\nnents. Unlike compressed sensing problems where sparse structure may be reconstructed perfectly\nwith undersampling, here we face a different type of challenge:\n\nusing the tools of compressed sensing; however,\n\n\u2022 The sparse component S is potentially identi\ufb01able from the compressed component S \u00b7 C,\n\u2022 The low-rank component L is not identi\ufb01able from its compression L \u00b7 C. Speci\ufb01cally, if\nwe let PC 2 Rd\u21e5d be the projection operator onto the column span of C, then the two\nlow-rank matrices L and L0 = L \u00b7 PC cannot be distinguished after multiplication by C.\nTherefore, our goal will be to recover both the sparse component S, and the compressed low-rank\ncomponent L \u00b7 C. Note that recovering L \u00b7 C is similar to the goal of recovering the column span\nof L, which may be a useful interpretation if we think of the columns of the data matrix D as data\npoints lying in Rn; the column span of L characterizes a low-rank subspace of Rn that captures the\nmain trends in the data.\n\nNotation We will use the following notation throughout the paper. We write [n] = {1, . . . , n} for\nany n 1. We write kvk0 or kMk0 to denote the number of nonzero entries in a vector v or matrix\nM (note that this is not in fact a norm). Mi\u21e4 denotes the ith row of a matrix M and is treated as\na column vector. We will use the matrix norms kMkF (Frobenius norm), kMk1 (elementwise `1\nnorm), kMk1 (elementwise `1 norm), kMk (spectral norm, i.e. largest singular value), and kMk\u21e4\n(nuclear norm, also known as the trace norm, given by the sum of the singular values of M).\n\n2 Problem and method\nWe begin by formally de\ufb01ning the problem at hand. The data, which takes the form of a n \u21e5 d\nmatrix, is well-approximated by a sum L? + S?, where L? is low-rank and S? is sparse. However,\nwe can only access this data through a (noisy) compression: our observed data is the n \u21e5 m matrix\n(1)\nwhere C 2 Rd\u21e5m is the compression matrix, and Z 2 Rn\u21e5m absorbs all sources of error and\nnoise\u2014we discuss speci\ufb01c models for Z later on.\n\nY = (L? + S?) \u00b7 C + Z ,\n\n2\n\n\fGiven this model, our goal will be to learn about both the low-rank and sparse structure. In the\nordinary robust PCA setting, the task of separating the low-rank and sparse components has been\nknown to be possible when the underlying low-rank component L? satis\ufb01es certain conditions, e.g.\nincoherence condition in [3] or spikiness condition in [1]. In order to successfully decompose the\nlow-rank and sparse component in the compressed data, we thus need the similar conditions to hold\nfor the compressed low-rank component, which we de\ufb01ne as the product P ? := L? \u00b7 C. As we\nwill see, if L? satis\ufb01es the spikiness condition, i.e. kL?k1 \uf8ff \u21b50, then the compressed low-rank\ncomponent P ? satis\ufb01es the similar spikiness condition, i.e. a bound on kP ?Ck1. This motivates\nthe possibility to recover both the low-rank and sparse components in the case of compressed data.\nAs discussed above, while we can aim to recover the sparse component S?, there is no hope to\nrecover the original low-rank component L?, since L? is not identi\ufb01able in the compressed model.\nTherefore, we propose a natural convex program for recovering the underlying compressed low-\nrank component P ? = L? \u00b7 C and the sparse component S?. Note that as discussed in [5], random\nprojection preserves the column span of L?, and so we can recover the column span of L? via P ?.\nWe de\ufb01ne our estimators of the sparse component S?, and the low-rank product P ?, as follows:\n\n(P,S):kP C>k1\uf8ff\u21b5\u21e2 1\n\narg min\n\n2kY P S \u00b7 Ck2\n\nF + \u232bkPk\u21e4 + kSk1 .\n\n(2)\n\n(bP ,bS) =\n\nNote that we impose the spikiness condition kP C>k1 \uf8ff \u21b5 on P , in order to guarantee good\nperformance for demixing such two superimposed components\u2014in later section, we will see that\nthe same condition holds for P ?. This method is parametrized by the triple (\u21b5, \u232b, ), and natural\nscalings for these tuning parameters are discussed alongside our theoretical results.\n\n2.1 Sources of errors and noise\n\nNext, we give several examples of models and interpretations for the error term Z in (1).\n\nRandom noise First, we may consider a model where the signal has an exact low-rank + sparse\ndecomposition, with well-behaved additive noise added before and/or after the compression step:\n\nY = (L? + S? + Zpre) \u00b7 C + Zpost ,\n\nwhere the entries of the pre- and post-compression noise, Zpre and Zpost, are i.i.d. mean-zero sub-\ngaussian random variables. In this case, the noise term Z in (1) is given by Z = Zpre \u00b7 C + Zpost.\nMisspeci\ufb01ed model Next, we may consider a case where the original data can be closely approx-\nimated by a low-rank + sparse decomposition, but this decomposition is not exact. In this case, we\ncould express the original (uncompressed) data as L? + S? + Zmodel, where Zmodel captures the error\nof the low-rank + sparse decomposition. Then this model misspeci\ufb01cation can be absorbed into the\nnoise term Z, i.e. Z = Zmodel \u00b7 C.\nMissing data Given an original data matrix D = L? + S?, we might have access only to a partial\nversion of this matrix. We write D\u2326 to denote the available data, where \u2326 \u21e2 [n] \u21e5 [d] indexes the\nentries where data is available, and (D\u2326)ij = Dij \u00b7\nij2\u2326. Then, a low-rank + sparse model for our\ncompressed data is given by\nY = D\u2326 \u00b7 C = (L? + S?\n\n\u2326) \u00b7 C + Zmissing \u00b7 C ,\n\n\u2326 L?. In some settings, we may \ufb01rst want to adjust D\u2326 before compressing\nwhere Zmissing = L?\nthe data, for instance, by reweighting the observed entries in D\u2326 to ensure a closer approximation\nto D. Denoting the reweighted matrix of partial observations by eD\u2326, we have compressed data\nwith Zmissing = eL?\nmissing data can be absorbed into the Z term, i.e. Z = Zmissing \u00b7 C.\nCombinations Finally, the observed data Y may differ from the compressed low-rank + sparse\ndecomposition (L? + S?)\u00b7 C due to a combination of the factors above, in which case we may write\n\n\u2326) \u00b7 C + Zmissing \u00b7 C ,\n\u2326 is the reweighted matrix of S?\n\nY = eD\u2326 \u00b7 C = (L? +eS?\n\n\u2326 L?, and where eS?\n\n\u2326. Then the error from the\n\nZ = (Zpre + Zmodel + Zmissing) \u00b7 C + Zpost .\n\n3\n\n\f2.2 Models for the compression matrix C\n\nNext, we consider several scenarios for the compression matrix C.\n\nRandom compression In some settings, the original data naturally lies in Rn\u21e5d, but is compressed\nby the user for some purpose. For instance, if we have data from d individuals, with each data point\nlying in Rn, we may compress this data for the purpose of providing privacy to the individuals in the\ndata set. Alternately, we may compress data to adhere to constraints on communication bandwidth\nor on data storage. In either case, we control the choice of the compression matrix C, and are free\nto use a simple random model. Here we consider two models:\n\n(3)\n\n(4)\n\niid\u21e0 N (0, 1/m).\n\nGaussian model: the entries of C are generated as Cij\n\nOrthogonal model: C =pd/m \u00b7 U,\nwhere U 2 Rd\u21e5m is an orthonormal matrix chosen uniformly at random.\n\nNote that in each case, E\u21e5CC>\u21e4 = Id.\nMultivariate regression / multitask learning\nIn a multivariate linear regression, we observe a\nmatrix of data Y that follows a model Y = X \u00b7 B + W where X is an observed design matrix,\nB is an unknown matrix of coef\ufb01cients (generally the target parameter), and W is a matrix of\nnoise terms. Often, the rows of Y are thought of as (independent) samples, where each row is\na multivariate response. In this setting, the accuracy of the regression can often be improved by\nleveraging low-rank or sparse structure that arises naturally in the matrix of coef\ufb01cients B. If B is\napproximately low-rank + sparse, the methodology of this paper can be applied: taking the transpose\nof the multivariate regression model, we have Y > = B> \u00b7 X> + W >. Compare to our initial\nmodel (1), where we replace Y with Y >, and use the compression matrix C = X>. Then, if\nB> \u21e1 L? + S? is a low-rank + sparse approximation, the multivariate regression can be formulated\nas a problem of the form (1) by setting the error term to equal Z = (B> L? S?) \u00b7 X> + W >.\n3 Theoretical results\n\nIn this section, we develop theoretical error bounds for the compressed robust PCA problem under\nseveral of the scenarios described above. We \ufb01rst give a general deterministic result in Section 3.1,\nthen specialize this result to handle scenarios of pre- and post-compression noise and missing data.\nResults for multivariate regression are given in the Supplementary Materials.\n\n3.1 Deterministic result\n\nWe begin by stating a version of the Restricted Eigenvalue property found in the compressed sensing\nand sparse regression literature [2]:\nDe\ufb01nition 1. For a matrix X 2 Rm\u21e5d and for c1, c2 0, X satis\ufb01es the restricted eigenvalue\nproperty with constants (c1, c2), denoted by REm,d(c1, c2), if\n\nkXvk2 c1kvk2 c2 \u00b7r log(d)\n\nm \u00b7k vk1 for all v 2 Rd .\n\nWe now give our main result for the accuracy of the convex program (2), a theorem that we will\nsee can be specialized to many of the settings described earlier. This theorem gives a deterministic\nresult and does not rely on a random model for the compression matrix C or the error matrix Z.\nTheorem 1. Let L? 2 Rn\u21e5d be any matrix with rank(L?) \uf8ff r, and let S? 2 Rn\u21e5d be any\nmatrix with at most s nonzero entries per row, that is, maxikS?\ni\u21e4k0 \uf8ff s. Let C 2 Rd\u21e5m be any\ncompression matrix and de\ufb01ne the data Y and the error/noise term Z as in (1). Let P ? = L? \u00b7 C\nas before. Suppose that C> satis\ufb01es REm,d(c1, c2), where c0 := c1 c2 \u00b7p16s log(d)/m > 0. If\nparameters (\u21b5, \u232b, ) satisfy\n\u21b5 kL?CC>k1,\u232b 2kZk, 2kZC>k1 + 4\u21b5,\nthen deterministically, the solution (bP ,bS) to the convex program (2) satis\ufb01es\nkbP P ?k2\n\n0kbS S?k2\n\nF \uf8ff 18r\u232b 2 + 9c2\n\n0 sn2 .\n\nF + c2\n\n4\n\n(5)\n\n(6)\n\n\fWe now highlight several applications of this theorem to speci\ufb01c settings: a random compression\nmodel with Gaussian or subgaussian noise, and a random compression model with missing data. (An\napplication to the multivariate linear regression model is given in the Supplementary Materials.)\n\n3.2 Results for random compression with subgaussian noise\n\nSuppose compression matrix C is random, and that the error term Z in the model (1) comes from\ni.i.d. subgaussian noise, e.g. measurement error that takes place before and/or after the compression:\n\nZ = Zpre \u00b7 C + Zpost .\n\ni\u21e4k0 \uf8ff s, we observe data\n\nOur model for this setting is as follows: for \ufb01xed matrices L? and S?, where rank(L?) \uf8ff r and\nmaxikS?\n(7)\nwhere the compression matrix C is generated under either the Gaussian (3) or orthogonal (4) model,\nand where the noise matrices Zpre, Zpost are independent from each other and from C, with entries\n\nY = (L? + S? + Zpre) \u00b7 C + Zpost ,\n\n(Zpre)ij\n\npre) and (Zpost)ij\n\niid\u21e0 N (0, 2\n\npost) .\n\niid\u21e0 N (0, 2\nmax max{2\n\n(8)\n\nFor this section, we assume d m without further comment (that is, the compression should reduce\nthe dimension of the data). Let 2\npost}. Specializing the result of Theorem 1 to\nthis setting, we obtain the following probablistic guarantee:\nTheorem 2. Assume the model (7). Suppose that rank(L?) \uf8ff r, maxikS?\ni\u21e4k0 \uf8ff s, and kL?k1 \uf8ff\n\u21b50. Then there exist universal constants c, c0, c00 > 0 such that if we de\ufb01ne\n\npre, 2\n\nm\n\nm\n\n+ 4\u21b5,\n\nm\n\nmax + \u21b52\n\nF \uf8ff c0 \u00b7\n\nd\n\nm2\n\nmax \u00b7 r(n + m) + (2\n\nkbP P ?k2\n\n\u21b5 = 5\u21b50r d log(nd)\n\n,\u232b = 24maxr d(n + m)\n\n, = 32maxr d log(nd)\nand if m c \u00b7 s log(nd), then the solution (bP ,bS) to the convex program (2) satis\ufb01es\nF + kbS S?k2\nwith probability at least 1 c00\nnd.\nRemark 1. If the entries of Zpre and Zpost are subgaussian rather than Gaussian, then the same result\nholds, except for a change in the constants appearing in the parameters (\u21b5, \u232b, ). (Recall that a\nrandom variable X is 2-subgaussian if E\u21e5etX\u21e4 \uf8ff et22/2 for all t 2 R.)\nRemark 2. In the case d = m, our result matches Corollary 2 in Agarwal et al [1] exactly, except\nthat our result involves multiplicative logarithm factor log(nd) in the \u21b50 term whereas theirs does\nnot.1 This additional log factor arises when we upper bound kL?CC>k1, which is unavoidable if\nwe want the bound to hold with high probability.\nRemark 3. Theorem 2 shows the natural scaling: the \ufb01rst term r(n+m) is the degree of freedom for\ncompressed rank r matrix P whereas the term sn log(nd) is the signal complexity of sparse compo-\nnent S, which has sn many nonzero entries. The multiplicative factor d\nmax can be interpreted as\nthe noise variance of the problem ampli\ufb01ed by the compression.\n\n0) \u00b7 sn log(nd)\n\nm 2\n\n3.3 Results for random compression with missing data\nNext, we consider a missing data scenario where the original n \u21e5 d matrix is only partially ob-\nserved. The original (complete) data is D = L? + S? 2 Rn\u21e5d, a low-rank + sparse decompo-\nsition.2 However, only a subset \u2326 \u21e2 [n] \u21e5 [d] of entries are observed\u2014we are given access to\nDij for each (i, j) 2 \u2326. After a reweighting step, we compress this data with a compression matrix\nC 2 Rd\u21e5m, for instance, in order to reduce communication, storage, or computation requirements.\n1Note that s \u00b7 n in our paper is equivalent to s in [1], since their work de\ufb01nes s to be the total number of\n2For clarity of presentation, we do not include additive noise before or after compression in this section.\nHowever, our theoretical analysis for additive noise (Theorem 2) and for missing data (Theorem 3) can be\ncombined in a straightforward way to obtain an error bound scaling as a sum of the two respective bounds.\n\nnonzero entries in S? while we count entries per row.\n\n5\n\n\fFirst, we specify a model for the missing data. For each (i, j) 2 [n] \u21e5 [d], let \u21e2ij 2 [0, 1] be\nthe probability that this entry is observed. Additionally, we assume that the sampling scheme is\nindependent across all entries, and that the \u21e2ij\u2019s are known.3\nTo proceed, we \ufb01rst de\ufb01ne a reweighted version of the partially observed data matrix and then\nmultiply by the compression matrix C:\n\n(9)\n\n\u2326)ij = Lij/\u21e2ij \u00b7\n\nY = eD\u2326 \u00b7 C where\nY =\u21e3eL?\n\n\u2326 +eS?\n\nij2\u2326 .\nDe\ufb01ne also the reweighted versions of the low rank and sparse components,\n\n(eD\u2326)ij = Dij/\u21e2ij \u00b7\n\nand note that we then have\n\n\u2326)ij = Sij/\u21e2ij \u00b7\n\nij2\u2326 ,\n\nij2\u2326 and (eS?\n\u2326\u2318 \u00b7 C =\u21e3L? +eS?\n\nij for (i, j) 62 \u2326), this new decomposition L? +eS?\n\n(eL?\n\u2326\u2318 \u00b7 C + Z ,\n(10)\n\u2326L?)\u00b7C. The role of the reweighting step (9) is to ensure that this noise term Z has\nmean zero. Note that in the reformulation (10) of the model, Y is approximated with a compression\n\u2326, where L? is the original low rank component while eS?\n\u2326 is de\ufb01ned above. While the\noriginal sparse component S?, is not identi\ufb01able via the missing data model (since we have no\n\u2326 now has\ninformation to help us recover entries S?\n\u2326 preserves the sparsity of S? but has\n\nwhere Z = (eL?\nof L? + eS?\na sparse component that is identi\ufb01able, since by de\ufb01nition, eS?\nno nonzero entries in unobserved locations, that is, (eS?\nWith this model in place, we obtain the following probabilistic guarantee for this setting, which is\nanother specialized version of Theorem 1. We note that we again have no assumptions on the values\nof the entries in S?, only on the sparsity level\u2014e.g. there is no bound assumed on kS?k1.\nTheorem 3. Assume the model (9). Suppose that rank(L?) \uf8ff r, maxikS?\ni\u21e4k0 \uf8ff s, and kL?k1 \uf8ff\n\u21b50. If the sampling scheme satis\ufb01es \u21e2ij \u21e2min for all (i, j) 2 [n] \u21e5 [d] for some positive constant\n\u21e2min > 0, then there exist universal constants c, c0, c00 > 0 such that if we de\ufb01ne\nmin\u21b50s d log2(nd)\n\u21b5 = 5\u21b50r d log(nd)\nand if m c \u00b7 s log(nd), then the solution (bP ,bS) to the convex program (2) satis\ufb01es\nF + kbS eS?\n\u2326k2\nwith probability at least 1 c00\nnd.\n4 Experiments\n\n0r(n + m) log(nd) + sn log2(nd)\n\n\u2326)ij = 0 whenever (i, j) 62 \u2326.\n\nmin\u21b50r d(n + m) log(nd)\n\nm\n\nkbP P ?k2\n\nF \uf8ff c0 \u00b7\n\nd\nm \u00b7 \u21e22\n\nmin\u21b52\n\n+ 4\u21b5,\n\nm\n\n,\u232b = 10\u21e21\n\nm\n\n, = 12\u21e21\n\nIn this section, we \ufb01rst use simulated data to study the behavior of the convex program (2) for\ndifferent compression dimensions, signal complexities and missing levels, which show the close\nagreement with the scaling predicted by our theory. We also apply our method to a data set consisting\nof chlorine measurements across a network of sensors. For simplicity, in all experiments, we select\n\u21b5 = 1, which is easier for optimization and generally results in a solution that still has low spikiness\n(that is, the solution is the same as if we had imposed a bound with \ufb01nite \u21b5).\n\n4.1 Simulated data\n\nHere we run a series of simulations on compressed data to examine the performance of the con-\nvex program (2). In all cases, we used the compression matrix C generated under the orthogonal\nmodel (4). We solve the convex program (2) via alternating minimization over L and S, selecting\nthe regularization parameters \u232b and that minimizes the squared Frobenius error. All results are\naveraged over 5 trials.\n\n3In practice, the assumption that \u21e2ij\u2019s are known is not prohibitive. For example, we might model \u21e2ij =\n\u21b5ij (the row and column locations of the observed entries are chosen independently, e.g. see [4]), or a logistic\n\n1\u21e2ij\u2318 = \u21b5i + j. In either case, \ufb01tting a model using the observed set \u2326 is extremely accurate.\n\nmodel, log\u21e3 \u21e2ij\n\n6\n\n\f\u00d710 5\n\n2\n\nr\no\nr\nr\ne\n \nd\ne\nr\na\nu\nq\ns\n \nl\na\nt\no\nT\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n2\n\nn=d=800\nn=d=400\n\n4\n\n6\n\nCompression ratio d/m\n\n8\n\n10\n\nFigure 1: Results for the noisy data experiment. The total squared error, calculated as in Theorem 2,\nis plotted against the compression ratio d/m. Note the linear scaling, as predicted by the theory.\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nl\n\na\nt\n\no\nT\n\n\u00d710 4\n\n2\n\nDimension n=d=200\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n10\n\nm=50\nm=100\nm=150\nm=200\n\n20\n\n30\n\nRank\n\n40\n\n50\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nl\n\na\nt\n\no\nT\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u00d710 4\n\nDimension n=d=400\n\nm=100\nm=200\nm=300\nm=400\n\n0\n\n10\n\n20\n\n30\n\nRank\n\n40\n\n50\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nl\n\nt\n\na\no\nT\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u00d710 4\n\nDimension n=d=200\n\nm=50\nm=100\nm=150\nm=200\n\n0\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\nSparsity proportion\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \nl\n\nt\n\na\no\nT\n\n0.1\n\n15\n\n10\n\n\u00d710 4\n\nDimension n=d=400\n\nm=100\nm=200\nm=300\nm=400\n\n5\n\n0\n\n0\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\nSparsity proportion\n\n0.1\n\nFigure 2: Results for the varying-rank (top row) and varying-sparsity (bottom row) experiments. The\ntotal squared error, calculated as in Theorem 2, is plotted against the rank r or sparsity proportion\ns/d. Note the nearly linear scaling for most values of m.\n\nSimulation 1: compression ratio. First we examine the role of the compression dimension m. We\n\ufb01x the matrix dimension n = d 2{ 400, 800}. The low-rank component is given by L? = pr\u00b7U V >,\nwhere U and V are n \u21e5 r and d \u21e5 r matrices with i.i.d. N (0, 1) entries, for rank r = 10. The\nsparse component S? has 1% of its entries generated as 5 \u00b7 N (0, 1), that is, s = 0.01d. The data\niid\u21e0 N (0, 0.25). Figure 1 shows the squared Frobenius error\nis D = L? + S? + Z, where Zij\nkbP P ?k2\nF plotted against the compression ratio d/m. We see error scaling linearly\n\nwith the compression ratio, which supports our theoretical results.\n\nF + kbS S?k2\n\nSimulation 2: rank and sparsity. Next we study the role of rank and sparsity, for a matrix of size\nn = d = 200 or n = d = 400. We generate the data D as before, but we either vary the rank\nr 2{ 5, 10, . . . , 50}, or we vary the sparsity s with s/d 2{ 0.01, 0.02, . . . , 0.1}. Figure 2 shows the\nsquared Frobenius error plotted against either the varying rank or the varying sparsity. We repeat this\nexperiment for several different compression dimensions m. We see a little deviation from linear\nscaling for the smallest m, which can be due to the fact that our theorems give upper bounds rather\nthan tight matching upper and lower bounds (or perhaps the smallest value of m does not satisfy the\ncondition stated in the theorems). However, for all but the smallest m, we see error scaling nearly\nlinearly with rank or with sparsity, which is consistent with our theory.\n\nSimulation 3: missing data. Finally, we perform experiments under the existence of missing\nentries in the data matrix D = L? + S?. We \ufb01x dimensions n = d = 400 and generate L? and S?\nas before, with r = 10 and s = 0.01d, but do not add noise. To introduce the missing entries in\nthe data, we use a uniform sampling scheme, where each entry of D is observed with probability \u21e2,\n\n7\n\n\fr\no\nr\nr\ne\n\n \n\nd\ne\nr\na\nu\nq\ns\n \nl\n\na\n\nt\n\no\nT\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u00d710 5\n\nm=100\nm=200\nm=300\nm=400\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n\u03c1\n\nr\no\nr\nr\ne\n\n \n\nd\ne\nr\na\nu\nq\ns\n \nl\n\na\n\nt\n\no\nT\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u00d710 5\n\nm=100\nm=200\nm=300\nm=400\n\n0\n\n20\n\n40\n\n2\n\n1/ \u03c1\n\n60\n\n80\n\n100\n\nFigure 3: Results for the missing data experiment. The total squared error, calculated as in Theo-\nrem 3, is plotted against \u21e2 (proportion of observed data) or against 1/\u21e22, for various values of m,\nbased on one trial. Note the nearly linear scaling with respect to 1/\u21e22.\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n)\nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nr\n(\ng\no\nL\n\n\u22125\n\n \n0\n\nLow\u2212rank + sparse model\nLow\u2212rank model\n\n \n\n1000\n\n2000\n\n3000\n\n4000\n\nCompression dimension m\n\nFigure 4: Results for the chlorine data (averaged over 2 trials), plotting the log of the relative error\non the test set for a low-rank + sparse model and a low-rank-only model. The low-rank + sparse\nmodel performs better across a range of compression dimensions m (up to 8\u20139% reduction in error).\n\n\u2326k2\nF\n(see Theorem 3 for details) across a range of probabilities \u21e2. We see that the squared error scales\napproximately linearly with 1/\u21e22, as predicted by our theory.\n\nwith \u21e2 2{ 0.1, 0.2, . . . , 1}. Figure 3 shows the squared Frobenius error kbP P ?k2\n\nF + kbS eS?\n\n4.2 Chlorine sensor data\n\nTo illustrate the application of our method to a speci\ufb01c application, we consider chlorine concentra-\ntion data from a network of sensors.4 The data contains a realistic simulation of chlorine concen-\ntration measurements from n = 166 sensors in a hydraulic system over d = 4310 time points. We\nassume D is well approximated with a low-rank + sparse decomposition. We then compress the\ndata using the orthogonal model (4) and study the performance of our estimators (2) for varying m.\nIn order to evaluate performance, we use 80% of the entries to \ufb01t the model, 10% as a validation set\nfor selecting tuning parameters, and the \ufb01nal 10% as a test set. We compare against a low-rank ma-\n\ntrix reconstruction, equivalent to setting bS = 0 and \ufb01tting only the low-rank componentbL. (Details\n\nare given in the Supplementary Materials.) The results are displayed in Figure 4, where we see that\nthe error of the recovery grows smoothly with compression dimension m, and that the low-rank +\nsparse decomposition gives better data reconstruction than the low-rank-only model.\n\n5 Discussion\n\nIn this paper, we have examined the robust PCA problem under data compression, where we seek to\ndecompose a data matrix into low-rank + sparse components with access only to a partial projection\nof the data. This provides a tool for accurate modeling of data with multiple superimposed struc-\ntures, while enabling restrictions on communication, privacy, or other considerations that may make\ncompression necessary. Our theoretical results show an intuitive tradeoff between the compression\nratio and the error of the \ufb01tted low-rank + sparse decomposition, which coincides with existing\nresults in the extreme case of no compression (compression ratio = 1). Future directions for this\nproblem include adapting the method to the streaming data (online learning) setting.\n\n4Data obtained from http://www.cs.cmu.edu/afs/cs/project/spirit-1/www/\n\n8\n\n\fReferences\n[1] Alekh Agarwal, Sahand Negahban, Martin J Wainwright, et al. Noisy matrix decomposition\nvia convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u2013\n1197, 2012.\n\n[2] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and\n\ndantzig selector. The Annals of Statistics, pages 1705\u20131732, 2009.\n\n[3] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n[4] Rina Foygel, Ohad Shamir, Nati Srebro, and Ruslan R Salakhutdinov. Learning with the\nweighted trace-norm under arbitrary sampling distributions. In Advances in Neural Informa-\ntion Processing Systems, pages 2133\u20132141, 2011.\n\n[5] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288, 2011.\n\n[6] Jun He, Laura Balzano, and John Lui. Online robust subspace tracking from partial informa-\n\ntion. arXiv preprint arXiv:1109.3827, 2011.\n[7] Jun He, Laura Balzano, and Arthur Szlam.\n\nIncremental gradient on the grassmannian for\nonline foreground and background separation in subsampled video. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 1568\u20131575. IEEE, 2012.\n\n[8] Odalric Maillard and R\u00e9mi Munos. Compressed least-squares regression.\n\nNeural Information Processing Systems, pages 1213\u20131221, 2009.\n\nIn Advances in\n\n[9] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek\nJain. Non-convex robust PCA. In Advances in Neural Information Processing Systems, pages\n1107\u20131115, 2014.\n\n[10] John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. Robust principal com-\nponent analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In\nAdvances in Neural Information Processing Systems, pages 2080\u20132088, 2009.\n\n[11] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In\n\nAdvances in Neural Information Processing Systems, pages 2496\u20132504, 2010.\n\n[12] Shuheng Zhou, John Lafferty, and Larry Wasserman. Compressed and privacy-sensitive sparse\n\nregression. IEEE Transactions on Information Theory, 55(2):846\u2013866, 2009.\n\n[13] Tianyi Zhou and Dacheng Tao. Godec: Randomized low-rank & sparse matrix decomposition\nIn Proceedings of the 28th International Conference on Machine Learning,\n\nin noisy case.\npages 33\u201340, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1188, "authors": [{"given_name": "Wooseok", "family_name": "Ha", "institution": "The University of Chicago"}, {"given_name": "Rina", "family_name": "Foygel Barber", "institution": "University of Chicago"}]}